CN114724226A

CN114724226A - Expression recognition model training method, electronic device and storage medium

Info

Publication number: CN114724226A
Application number: CN202210439565.4A
Authority: CN
Inventors: 刘钊
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2022-04-25
Filing date: 2022-04-25
Publication date: 2022-07-08
Anticipated expiration: 2042-04-25
Also published as: CN114724226B

Abstract

The invention relates to the technical field of facial recognition, in particular to an expression recognition model training method, electronic equipment and a storage medium. According to the expression recognition model training method, basic feature extraction network and model initial data are obtained first, and then a first training image set and a second training image set are obtained. Further, the expression feature weights in the first training image set and the expression feature weights in the second training image set are obtained, and then the first type action units and the second type action units are obtained through screening. Furthermore, the first type action units are used as positive sample data, and the second type action units are used as negative sample data, and the input sample set is formed by mixing. And finally, performing optimization training on the initial recognition model by using the input sample set to obtain an optimized recognition model. The optimized recognition model can reasonably adjust the expression feature weights of the first type of action units and the second type of action units, so that the accuracy in the expression recognition process is improved.

Description

Expression recognition model training method, electronic device and storage medium

Technical Field

The invention relates to the technical field of facial recognition, in particular to an expression recognition model training method, electronic equipment and a storage medium.

Background

In recent years, with the large-scale acquisition of face data in the industry, various application scenes based on basic faces are endless. In the applications, facial expression recognition becomes an important part, and is widely applied to human-computer interaction systems such as social robots, video live broadcast, driver fatigue monitoring and the like.

Currently, there are some related researches on facial expression recognition in academic and industrial fields, in which an expression recognition model training method based on Action Units (AUs) aims to learn different expressions so as to optimize a facial expression recognition model from a fine-grained characterization level according to variation combination characterization of different facial positions. However, due to the richness of the facial expressions and the subtle differences between classes of various expressions, the expression recognition accuracy of the expression recognition model training method in the related art is still at a low level. Therefore, how to further improve the expression recognition accuracy of the facial expression recognition model is still a problem to be solved urgently in the industry.

Disclosure of Invention

The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides an expression recognition model training method, electronic equipment and a storage medium, which can improve the expression recognition accuracy rate of facial expression recognition.

The expression recognition model training method according to the embodiment of the first aspect of the invention comprises the following steps:

acquiring a basic feature extraction network and model initial data according to an initial recognition model, wherein the basic feature extraction network is used for extracting facial features from the initial recognition model;

acquiring a first training atlas and a second training atlas according to the model initial data, wherein the first training atlas is a training atlas with wrong expression recognition of the initial recognition model, and the second training atlas is a training atlas with correct expression recognition of the initial recognition model;

acquiring expression feature weights in the first training image set and expression feature weights in the second training image set based on the basic feature extraction network, wherein the expression feature weights are recognition judgment weights of various facial action units in an expression recognition process;

screening the facial action units with the expression feature weights higher than a first preset threshold value in the first training image set to obtain first type action units;

screening the facial action units with the expression feature weights lower than a second preset threshold value in the second training image set to obtain second type action units;

mixing the positive sample data and the negative sample data to form an input sample set by taking the first type action unit as positive sample data and the second type action unit as negative sample data;

and performing optimization training on the initial recognition model by using the input sample set to obtain an optimized recognition model.

Optionally, according to some embodiments of the present invention, the obtaining, based on the basic feature extraction network, the expression feature weights in the first training image set and the expression feature weights in the second training image set includes:

respectively extracting features of the facial regions in the first training image set and the facial regions in the second training image set;

acquiring a first feature extraction vector for describing a facial region in the first training image set and a second feature extraction vector for describing the facial region in the second training image set based on the basic feature extraction network;

obtaining the expression feature weight of the first training atlas according to the first feature extraction vector;

and obtaining the expression feature weight of the second training atlas according to the second feature extraction vector.

Optionally, according to some embodiments of the present invention, the mixing the positive sample data and the negative sample data to form an input sample set with the first type of action unit as positive sample data and the second type of action unit as negative sample data includes;

mixing the positive sample data and the negative sample data to obtain mixed data;

adjusting the proportion of the positive sample data and the negative sample data in the mixed data;

and when the proportion of the positive sample data in the mixed data is smaller than that of the negative sample data in the mixed data, taking the adjusted mixed data as the input sample set.

Optionally, according to some embodiments of the present invention, the performing optimization training on the initial recognition model by using the input sample set to obtain an optimized recognition model includes:

loading the input sample set into the base feature extraction network of the initial recognition model;

vectorizing the input sample set based on the basic feature extraction network to generate an input feature extraction vector;

and importing the input feature extraction vector into a classification layer of the initial recognition model.

Optionally, according to some embodiments of the present invention, the performing optimization training on the initial recognition model with the input sample set to obtain an optimized recognition model further includes:

acquiring the recognition classification scores of the classification layers, and acquiring the expression recognition accuracy of the classification prediction in the current round according to the recognition classification scores;

after the expression recognition accuracy rate of the classification prediction of the current round is obtained, the expression feature weight of the first type of action unit is adjusted, and the expression feature weight of the second type of action unit is adjusted;

after the expression feature weights of the first type of action units and the expression feature weights of the second type of action units are adjusted, performing iterative training on the initial recognition model based on the input sample set;

after each round of iterative training, counting the change condition of the expression recognition accuracy rate;

and when the expression recognition accuracy rate is converged to a first fixed value, stopping the iterative training and obtaining the optimized recognition model.

Optionally, according to some embodiments of the present invention, after obtaining the expression recognition accuracy of the current round of classification prediction, adjusting the expression feature weight of the first type of action unit and adjusting the expression feature weight of the second type of action unit includes:

when the expression recognition accuracy rate gradually increases, reducing the expression feature weight of the initial recognition model to the first type of action units;

and according to the decreasing trend of the expression feature weight of the first type of action unit in the initial recognition model, the expression feature weight of the second type of action unit in the initial recognition model is increased.

Optionally, according to some embodiments of the invention, the method further comprises:

counting the change condition of the loss function output value in each iteration training;

and when the expression recognition accuracy rate is converged to a first fixed value or the loss function output value is converged to a second fixed value, stopping the iterative training and obtaining the optimized recognition model.

receiving gradient feedback of the expression recognition accuracy according to statistics of the change condition of the expression recognition accuracy;

and when the expression recognition accuracy rate reflected by statistics is gradually reduced, correcting the input sample set.

In a second aspect, an embodiment of the present invention provides an electronic device, including: the expression recognition model training method comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the expression recognition model training method according to any one of the embodiments of the first aspect of the invention when executing the computer program.

In a third aspect, an embodiment of the present invention provides a computer-readable storage medium, where the storage medium stores a program, and the program is executed by a processor to implement the expression recognition model training method according to any one of the embodiments of the first aspect of the present invention.

The expression recognition model training method, the electronic device and the storage medium provided by the embodiment of the invention at least have the following beneficial effects:

according to the expression recognition model training method, basic feature extraction network and model initial data are obtained according to an initial recognition model, and then a first training image set and a second training image set are obtained according to the model initial data. Further, the basic feature extraction network obtains expression feature weights in the first training image set and expression feature weights in the second training image set, then facial action units with the expression feature weights higher than a first preset threshold value in the first training image set are screened to obtain first type action units, and facial action units with the expression feature weights lower than a second preset threshold value in the second training image set are screened to obtain second type action units. And mixing the positive sample data and the negative sample data to form an input sample set by taking the first type action units as positive sample data and taking the second type action units as negative sample data. And finally, performing optimization training on the initial recognition model by using the input sample set to obtain an optimized recognition model. In the method, the first training atlas is a training atlas with wrong expression recognition of the initial recognition model, the second training atlas is a training atlas with correct expression recognition of the initial recognition model, therefore, the first type action units should not influence the expression recognition accuracy rate of the expression recognition with larger expression feature weight, the second kind of action units should influence the expression recognition accuracy rate of expression recognition by larger expression feature weight, so, the method comprises the steps of mixing positive sample data and negative sample data to form an input sample set by taking a first type of action unit as positive sample data and a second type of action unit as negative sample data, performing optimization training by using the input sample set to obtain an optimized recognition model, the optimized recognition model can reasonably adjust the expression feature weights of the first type of action units and the second type of action units, so that the accuracy in the expression recognition process is improved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic flow chart of a method for training an expression recognition model according to an embodiment of the present invention;

FIG. 2 is another schematic flow chart illustrating a method for training an expression recognition model according to an embodiment of the present invention;

FIG. 3 is another schematic flow chart illustrating a method for training an expression recognition model according to an embodiment of the present invention;

FIG. 4 is another schematic flow chart diagram of a method for training an expression recognition model according to an embodiment of the present invention;

FIG. 5 is a schematic flow chart illustrating a method for training an expression recognition model according to an embodiment of the present invention;

FIG. 6 is another schematic flow chart diagram illustrating a method for training an expression recognition model according to an embodiment of the present invention;

FIG. 7 is a schematic flow chart illustrating a method for training an expression recognition model according to an embodiment of the present invention;

FIG. 8 is a schematic flow chart illustrating a method for training an expression recognition model according to an embodiment of the present invention;

fig. 9 is a schematic view of an electronic device for implementing the expression recognition model training method according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

In the description of the present invention, a plurality of means is one or more, a plurality of means is two or more, and greater than, less than, more than, etc. are understood as excluding the essential numbers, and greater than, less than, etc. are understood as including the essential numbers. If the first and second are described for the purpose of distinguishing technical features, they are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

In the description of the present invention, it should be understood that the orientation or positional relationship referred to in the description of the orientation, such as the upper, lower, left, right, front, rear, etc., is based on the orientation or positional relationship shown in the drawings, and is only for convenience of description and simplification of description, and does not indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention.

In the description of the present specification, reference to the description of "one embodiment," "some embodiments," "an illustrative embodiment," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

In the description of the present invention, it should be noted that unless otherwise explicitly defined, terms such as setup, installation, connection, etc. should be understood in a broad sense, and those skilled in the art can reasonably determine the specific meanings of the above terms in the present invention by combining the detailed contents of the technical solutions. In addition, the following labels applied to specific steps do not represent limitations on the order of steps and execution logic, and the order of execution between steps and execution logic should be understood and inferred with reference to corresponding explanatory expressions.

In recent years, with the large-scale collection of face data in the industry, various application scenarios based on the basic face are diversified. In the applications, facial expression recognition becomes an important part, and is widely applied to human-computer interaction systems such as social robots, video live broadcast, driver fatigue monitoring and the like.

Currently, there are some related researches on facial expression recognition in academic and industrial fields, in which an expression recognition model training method based on Action Units (AUs) aims to learn different expressions so as to optimize a facial expression recognition model from a fine-grained characterization level according to variation combination characterization of different facial positions. At present, most of facial expression recognition based on deep learning learns facial expression characteristics through a convolutional neural network, and good effect is achieved. However, because the richness of the facial expressions and the differences among classes of various expressions are slight, and the facial expressions are affected by different factors such as different age groups, different sexes, different living backgrounds and the like, interpretation ways of each person on the same expression are different, so that the differences among classes of some expressions are large, the differences among classes of other expressions are slight, and the facial expressions are not beneficial to the expression recognition. Most of the conventional convolutional neural networks cannot extract characteristics with discriminability, and are not beneficial to improving the accuracy of the facial expression recognition algorithm. In addition, the more complex neural network structure improves the expression recognition accuracy of expression recognition, which generally means more data labels, more complex training process and more operation resources, and limits the generality of expression recognition, so that the real-time recognition of facial expressions cannot be realized. Therefore, how to further improve the expression recognition accuracy, robustness, stability and universality of the facial expression recognition model is still an urgent problem to be solved in the industry.

This is further explained below with reference to the drawings.

Referring to fig. 1, an expression recognition model training method according to an embodiment of the first aspect of the present invention includes:

step S101, acquiring basic feature extraction network and model initial data according to an initial identification model;

it should be noted that the initial recognition model refers to an expression recognition model that has undergone a conventional expression recognition training process, but has not undergone the expression recognition model training process of the present invention. The initial identification model may adopt a basic feature extraction Network established based on an Encoder-Decoder (Encoder-Decoder) model, or may adopt a basic feature extraction Network established based on a unidirectional multilayer Convolutional Neural Network (CNN). It should be understood that the basic feature extraction network is used for extracting facial features in the initial recognition model, and the input of the basic feature extraction network is a picture for training, and the output of the basic feature extraction network is a corresponding feature extraction vector after the picture for training is subjected to recognition processing. In addition, the model initial data refers to training data obtained after the initial recognition model is subjected to conventional expression recognition training, and the model initial data can reflect the expression recognition capability of the initial recognition model. In some embodiments of the present invention, the model initial data includes a first training atlas in which the initial recognition model is incorrectly recognized in terms of expression recognition, a second training atlas in which the initial recognition model is correctly recognized in terms of expression recognition, and may further include an expression feature weight of each facial action unit according to which the initial recognition model is used to recognize an expression. It should be understood that the model initial data includes, but is not limited to, the several types of data mentioned above.

Step S102, acquiring a first training image set and a second training image set according to model initial data;

it should be understood that the initial recognition model is accompanied by an increase in the expression recognition level during the conventional training of expression recognition. In some embodiments of the present invention, after the initial recognition model is subjected to the conventional expression recognition training, the initial recognition model needs to be tested by using a test pattern set with labels, and after the conventionally trained initial expression recognition model performs expression recognition on the test pattern set, the recognition result labels are obtained. Further, comparing the identification result labels with self labels of the test atlas, wherein the atlas with the identification result labels different from the self labels of the test atlas is a training atlas which is judged as an initial identification model expression identification error, namely the first training atlas; and the atlas with the identification result label consistent with the label of the test atlas is judged as a training atlas with correct expression identification of the initial identification model, namely the first training atlas. It should be noted that, according to some embodiments provided by the present invention, the labels carried by the test atlas and the labels of the identification result may be classified into the following categories: anger (Anger), scofflaw (Contempt), Disgust (distust), Fear (Fear), Happy (Happy), Neutral (Neutral), Sadness (Sadness), and Surprise (Surprise). It should be understood that the label of the test atlas and the label of the recognition result can also follow other classification standards for facial expression recognition.

Step S103, acquiring expression feature weights in a first training image set and expression feature weights in a second training image set based on a basic feature extraction network;

according to some embodiments of the present invention, an intermediate layer of the basic feature extraction network in the initial recognition model is used to obtain feature extraction vectors of face regions in the input image, and the feature extraction vectors of the face regions are specifically reflected on the facial expression input, that is, facial action units corresponding to different face regions. It should be noted that two main methods for facial expression recognition include information judgment and sign judgment. The information judgment is to investigate what is conveyed by facial expressions, such as emotions like happiness, anger or sadness, and the evidence judgment is to investigate physical signals for conveying information, such as raised cheeks or sunken lips. The most common descriptor used in the evidence judgment method is a descriptor specified by a Facial Action Coding System (FACS), wherein the descriptor includes a plurality of specified atomic Facial muscle actions, and is named as "Facial Action Units (AUs)". Since any facial expression is produced by the activation of one or more sets of facial muscles, each possible facial expression can be completely described by a combination of facial action units. However, some facial action units can exhibit high distinctiveness during expression recognition, such as: in recognizing the expression "happy", the slight changes of the facial action units related to the mouth corners and the eye corners largely cover the influence of other facial action units.

In some embodiments of the present invention, the influence exhibited by the facial action units during expression recognition may be quantified by expression feature weights, where the expression feature weights refer to recognition determination weights of various facial action units during expression recognition, for example, when the expression "happy" is recognized, slight changes of facial action units related to the corners of the mouth and eyes may largely cover the influence of other facial action units, so that the facial action units related to the corners of the mouth and eyes in this example have higher expression feature weights, and the other facial action units have lower expression feature weights.

It should be noted that the facial action units corresponding to the same facial area may have a higher expression feature weight in the process of identifying multiple expressions. For example: in the process of recognizing the expression of 'happy', the facial action units related to the mouth corners and the eye corners have higher corresponding expression feature weights, and other facial action units have lower corresponding expression feature weights; in the process of recognizing the expression of angry, the facial action units with higher expression feature weights are still the facial action units related to the mouth corner and the eye corner, so that in the process of recognizing the two expressions of happy expression and angry expression, the recognition accuracy rate of the recognized expression is low due to unreasonable presetting of the expression feature weights. Therefore, to improve the expression recognition accuracy of the initial recognition model, it is necessary to reasonably adjust the expression feature weight distribution of different facial action units in the initial recognition model to optimize the initial recognition model. It should be noted that the expression feature weight distribution of different facial action units refers to the distribution of expression feature weights of different facial action units, for example, when the initial recognition model recognizes an expression such as "happy", the expression feature weight distribution is: the expression feature weight of the mouth corner related face action unit accounts for 31 percent, the expression feature weight of the eye corner related face action unit accounts for 42 percent, and the expression feature weight of other face action units accounts for 27 percent; for another example, when the initial recognition model recognizes an expression of "anger", the expression feature weight distribution is: the weight of the expressive features of the mouth corner related facial action units accounts for 27%, the weight of the expressive features of the eye corner related facial action units accounts for 33%, and the weight of the expressive features of the other facial action units accounts for 40%. It should be understood that the above examples are used to assist in explaining the meaning of the expression feature weight distribution, and should not be construed as limiting the technical solution, and the expression feature weight distribution that may appear in the initial recognition model is not limited to the above-mentioned embodiments.

According to some embodiments provided by the invention, expression feature weight distribution of different facial action units is reasonably adjusted, and the expression feature weight distribution of the initial recognition model based on different facial action units needs to be determined before the initial recognition model is trained by the expression recognition model training method provided by the embodiments of the invention. In some embodiments of the present invention, obtaining the expression feature weights in the first training image set and the expression feature weights in the second training image set based on the basic feature extraction network includes: and further, obtaining expression feature weights in the first training image set and expression feature weights in the second training image set based on the basic feature extraction network respectively based on the first feature extraction vector and the second feature extraction vector.

Step S104, screening the facial action units with the expression feature weights higher than a first preset threshold value in the first training image set to obtain first type action units;

it should be noted that, since the influence exhibited by the facial motion unit in the expression recognition process is quantified by the expression feature weight, the higher the expression feature weight of the facial motion unit is, the greater the influence on the recognition result determination in the expression recognition process is. And because the first training atlas is a training atlas for the initial recognition model to recognize errors in the conventional expression recognition training process, the facial action units with higher expression feature weights in the first training atlas can be classified as unreliable facial action units. Wherein, the facial action unit is difficult to provide correct reference in the expression recognition process, and is considered unreliable. For example, when the initial recognition model recognizes an expression of "happy", the expression feature weight distribution is obtained from the training image set in which the recognition is wrong as: the expressive feature weights of the mouth-corner related face action units account for 21%, the expressive feature weights of the eye-corner related face action units account for 52%, and the expressive feature weights of the other face action units account for 27%, then the eye-corner related face action units may be determined to be unreliable face action units.

In some embodiments of the present invention, in step S104, facial action units in the first training image set whose expression feature weights are higher than a first preset threshold are screened to obtain a first type of action units, so as to further provide a sampling basis for subsequent optimization training of the initial recognition model. It should be understood that the first preset threshold is a preset expression feature weight, and is used for screening out facial action units with higher expression feature weights, that is, action units of the first class from the first training image set, where a value of the first preset threshold may be flexibly set according to an actual situation. It should be clear that the first type of action units refers to facial action units that are unreliable and unable to help distinguish the expression categories when the expression feature weights are too high.

Step S105, screening facial action units in the second training image set, wherein the expression feature weight of the facial action units is lower than a second preset threshold value, and obtaining second type action units;

it should be noted that, since the influence exhibited by the facial action unit in the expression recognition process is quantified by the expression feature weight, the lower the expression feature weight of the facial action unit is, the smaller the influence on the recognition result determination in the expression recognition process is. And because the second training atlas is a training atlas for the initial recognition model to recognize correctly in the conventional expression recognition training process, the facial action units with lower expression feature weights in the second training atlas can be classified as reliable facial action units. Wherein, the facial action unit can provide correct reference in the process of expression recognition, and then the facial action unit is considered to be reliable. For example, when the initial recognition model recognizes an expression of "anger", the expression feature weight distribution is obtained from the set of correctly recognized training images as: if the expressive feature weights of the mouth-corner related facial action units account for 45%, the expressive feature weights of the glabella related facial action units account for 30%, and the expressive feature weights of the other facial action units account for 25%, then the glabella related facial action units can be determined to be reliable facial action units.

In some embodiments of the present invention, in step S105, facial action units in the second training image set whose expression feature weights are lower than the second preset threshold are screened to obtain a second class of action units, so as to further provide a sampling basis for the subsequent optimization training of the initial recognition model. It should be understood that the second preset threshold is a preset expression feature weight, and is used for screening out facial action units with lower expression feature weights from the second training image set, that is, second class action units, where a value of the second preset threshold may be flexibly set according to an actual situation. It should be clear that the second category of action units refers to facial action units that are reliable and can help to distinguish between expression categories, although the expression feature weights are low.

It should be emphasized that the reference numerals of specific steps do not represent limitations on the order of steps and the execution logic, and the order of execution between steps and the execution logic should be understood and inferred with reference to corresponding explanatory expressions. The steps S104 and S105 are parallel, the step S104 may be executed before the step S105, the step S104 may be executed after the step S105, and the step S104 may be executed simultaneously with the step S105.

Step S106, mixing the positive sample data and the negative sample data to form an input sample set by taking the first type action unit as positive sample data and the second type action unit as negative sample data;

it should be noted that, before the initial recognition model is optimally trained, an input sample set for the optimal training needs to be obtained. The first type of action units refer to face action units which are unreliable and cannot help to distinguish expression categories when the expression feature weight is too high; and the second category of action units refers to facial action units that are reliable, although the expressive feature weights are low, and can help to distinguish the expression categories. Therefore, in some embodiments of the present invention, in step S106, the first type of action units is used as positive sample data, and the second type of action units is used as negative sample data, the positive sample data and the negative sample data are mixed to form an input sample set, and then the input sample set obtained in step S106 is used for optimization training.

It should be noted that the input sample set is essentially one image sample set, and therefore the first type of operation unit and the second type of operation unit in the input sample set can exist in the form of both image blocks and whole images. When the first type action unit or the second type action unit exists in the form of the image block, the corresponding image block is provided with a self-carrying label, and the self-carrying label is derived from the whole image before the corresponding image block is cut into blocks. When the first type of action unit and the second type of action unit exist in the form of a whole image, the input sample set comprises an image in the first training image set and an image in the second training image set, the first type of action unit can be obtained by processing the image in the first training image set in the input sample set through the basic feature extraction network of the initial recognition model, and the second type of action unit can be obtained by processing the image in the second training image set in the input sample set through the basic feature extraction network of the initial recognition model.

And S107, performing optimization training on the initial recognition model by using the input sample set to obtain an optimized recognition model.

It should be noted that, because the input sample set includes the first-type action units and the second-type action units, the input sample set is used for optimization training, so that the recognition capability of the initial recognition model for the second-type action units can be optimized, and the influence of the first-type action units on expression recognition is suppressed, so that the initial recognition model can refer to more reliable facial action units in the expression recognition process, the expression recognition accuracy rate of facial expression recognition is further improved, and the optimized recognition model after optimization training is obtained.

According to the expression recognition model training method, basic feature extraction network and model initial data are obtained according to an initial recognition model, and then a first training image set and a second training image set are obtained according to the model initial data. Further, the expression feature weight in the first training image set and the expression feature weight in the second training image set are obtained, facial action units with the expression feature weight higher than a first preset threshold value in the first training image set are screened to obtain a first type of action units, and facial action units with the expression feature weight lower than a second preset threshold value in the second training image set are screened to obtain a second type of action units. And mixing the positive sample data and the negative sample data to form an input sample set by taking the first type action units as positive sample data and taking the second type action units as negative sample data. And finally, performing optimization training on the initial recognition model by using the input sample set to obtain an optimized recognition model. In the method, the first training atlas is a training atlas with wrong expression recognition of the initial recognition model, the second training atlas is a training atlas with correct expression recognition of the initial recognition model, therefore, the first type action units should not influence the expression recognition accuracy rate of the expression recognition with larger expression feature weight, the second kind of action units should influence the expression recognition accuracy rate of expression recognition by larger expression feature weight, so, the method comprises the steps of mixing positive sample data and negative sample data to form an input sample set by using a first type action unit as positive sample data and a second type action unit as negative sample data, performing optimization training by using the input sample set to obtain an optimized recognition model, the optimal recognition model can reasonably adjust the expression feature weights of the first type of action units and the second type of action units, so that the accuracy in the expression recognition process is improved.

Referring to fig. 2, according to some embodiments of the present invention, obtaining the expressive feature weights in the first training image set and the expressive feature weights in the second training image set based on the basic feature extraction network includes:

step S201, respectively extracting the features of the face region in the first training image set and the face region in the second training image set;

according to some embodiments of the present invention, an intermediate layer of the basic feature extraction network in the initial recognition model is used to obtain feature extraction vectors of face regions in the input image, and the feature extraction vectors of the face regions are specifically reflected on the facial expression input, that is, facial action units corresponding to different face regions. It should be noted that the intermediate layer of the basic feature extraction network in the initial recognition model includes a lightweight sub-network, and the lightweight sub-network may be a Convolutional Neural Network (CNN) including a Convolutional layer and a pooling layer.

Step S202, acquiring a first feature extraction vector used for describing a face region in a first training image set and a second feature extraction vector used for describing the face region in a second training image set based on a basic feature extraction network;

according to some embodiments provided by the present invention, the initial recognition model in the present invention may use fast R-CNN based on convolutional neural network as a network backbone structure, and enhance the sampling capability of the facial action units of the network model in combination with R-CNN feature extraction as a guide, where the R-CNN uses Selective Search algorithm to extract possible Regions of Interest (Regions of Interest, RoI), then cuts each RoI, and performs bounding box regression and Support Vector Machine (SVM) classification on RoI using convolutional neural network. Wherein the network structure of FasterR-CNN can be divided into the following four modules: first, the convolutional layer: the fast R-CNN firstly uses a group of basic conv, relu and pooling network layers to obtain a characteristic diagram, and uses the characteristic diagram for a subsequent regional candidate network layer and a full connection layer; second, regional candidate layer (RPN): the RPN network is used to generate a region candidate box. The layer judges whether the candidate region belongs to a region to be identified or a background through a Softmax function; then, correcting the region candidate frame by utilizing the boundary frame regression to obtain an accurate boundary frame; thirdly, a region-of-interest pooling layer: the layer performs pooling operation on candidate frames with different sizes by collecting information of the feature images and the region candidate frames, acquires a first feature extraction vector from a face region in the first training image set, acquires a second feature extraction vector from a face region in the second training image set, and sends the second feature extraction vector into a subsequent full-connection layer for facial expression recognition. Fourthly, classification layer: and all the connecting layers in the classifying layers recognize the first feature extraction vector and the second feature extraction vector through expression classifying functions to obtain recognition results, wherein the expression classifying functions are common image classifying functions such as cross entropy loss and Focalloss or the combination of various image classifying functions.

It should be noted that, in step S202, a first feature extraction vector for describing the face region in the first training image set and a second feature extraction vector for describing the face region in the second training image set are obtained based on the basic feature extraction Network, and may also be implemented by model architectures such as Deep Belief Networks (DBNs), Recurrent Neural Networks (RNNs), Deep Auto-encoders (DAE), and Generative Adaptive Networks (GANs), and is not limited to the above-mentioned embodiments.

Step S203, obtaining the expression feature weight of the first training atlas according to the first feature extraction vector;

according to some embodiments of the present invention, the initial recognition model includes a basic feature extraction network and a classification layer, where the basic feature extraction network includes a convolutional layer structure and a pooling layer structure for extracting a first feature extraction vector and a second feature extraction vector, and the classification layer performs recognition processing on the first feature extraction vector and the second feature extraction vector through an expression classification function after obtaining the first feature extraction vector and the second feature extraction vector to obtain a recognition result. It should be noted that the first feature extraction vector extracted by the basic feature extraction network is specifically reflected to the facial expression input, that is, the first feature extraction vector corresponds to facial action units of different facial regions in the first training image set, and step S203 obtains the expression feature weight of the first training image set according to the first feature extraction vector, which is based on the expression feature weight obtained by the facial action units of different facial regions in the first training image set. It should be understood that the expression feature weights of the first training atlas may reflect the degree of influence of various facial action units on the determination of the recognition result in the process of recognizing the first training atlas by the initial recognition model.

And S204, obtaining the expression feature weight of the second training atlas according to the second feature extraction vector.

It should be noted that the second feature extraction vector extracted by the basic feature extraction network is specifically reflected to the facial expression input, that is, the second feature extraction vector corresponds to facial action units of different facial regions in the second training image set, and step S204 determines the expression feature weight of the second training image set according to the second feature extraction vector, which is based on the expression feature weight obtained by the facial action units of different facial regions in the second training image set. It should be understood that the expression feature weights of the second training atlas may reflect the degree of influence of various facial action units on the determination of the recognition result in the process of recognizing the second training atlas by the initial recognition model.

Through the steps S201 to S204, the expression feature weight of the first training atlas and the expression feature weight of the second training atlas may be obtained, which facilitates the reasonable adjustment of the expression feature weight based on the expression feature weight of the first training atlas and the expression feature weight of the second training atlas in the subsequent execution step, thereby further improving the accuracy in the expression recognition process.

Referring to fig. 3, according to some embodiments of the present invention, with a first type of action unit as positive sample data and a second type of action unit as negative sample data, the positive sample data and the negative sample data are mixed to form an input sample set, including;

step S301, mixing the positive sample data and the negative sample data to obtain mixed data;

it should be noted that, when the expression feature weight is too high, the first type of action unit is an unreliable facial action unit that cannot help to distinguish the expression category; and the second category of action units refers to facial action units that are reliable, although the expressive feature weights are low, and can help to distinguish the expression categories. Therefore, the positive sample data in the input sample set are sampled from the first type action units, the negative sample data in the input sample set are sampled from the second type action units, the input sample set formed by mixing the modes is used for optimization training, the training progress of the initial recognition model can be judged based on the change condition of the expression recognition accuracy of the initial recognition model, when the expression recognition accuracy of the initial recognition model is increased and converged to a first fixed value, the optimization training can be judged to be finished, and the optimization recognition model after the optimization training is finished is obtained.

Step S302, the proportion of positive sample data and negative sample data in mixed data is adjusted;

according to some embodiments provided by the present invention, if the proportion of positive sample data in the mixed data is too large, the identification processing of the initial identification model on the second type action units in the negative sample data will start from a lower level, which is not beneficial to promote the optimization training of the initial identification model. Since the proportion of the positive sample data in the mixed data is not too large, in some embodiments of the present invention, before the input sample set is generated according to the mixed data, the proportion of the positive sample data in the mixed data needs to be reduced, and the proportion of the negative sample data in the mixed data needs to be increased. It should be clear that the proportion of the positive sample data in the mixed data refers to the proportion of the positive sample data in the total mixed data, the proportion of the negative sample data in the mixed data refers to the proportion of the positive sample data in the total mixed data, the expression feature weight refers to the recognition determination weight of each facial action unit in the process of recognizing the facial expression by the expression recognition model, and the two types of recognition determination weights cannot be mixed together.

Step S303, when the proportion of the positive sample data in the mixed data is smaller than that of the negative sample data in the mixed data, using the adjusted mixed data as the input sample set.

It should be noted that, when the proportion of the positive sample data in the mixed data is smaller than that of the negative sample data in the mixed data, it is indicated that the second type of action units in the negative sample data accounts for the main part in the mixed data, and the input sample set obtained in this way is used for optimization training, so that an expression recognition accuracy curve with obvious change can be conveniently obtained, and the control and analysis judgment of the optimization training process can be facilitated.

Referring to fig. 4, according to some embodiments of the present invention, the performing optimization training on the initial recognition model with the input sample set to obtain an optimized recognition model includes:

step S401, loading an input sample set into a basic feature extraction network of an initial recognition model;

step S402, vectorizing an input sample set based on a basic feature extraction network to generate an input feature extraction vector;

it should be noted that the loading of the input sample set into the basic feature extraction network of the initial recognition model is to vectorize the input sample set. It should be appreciated that in some embodiments of the present invention, the initial recognition model is built based on the encoder-decoder structure, when an image to be recognized is input into an initial recognition model, image processing operations such as block cutting, feature extraction and the like are firstly carried out on the image to be recognized through an encoder module serving as a basic feature extraction network, then vectorization is carried out on the image after the image processing operations to generate an input feature extraction vector, the input feature extraction vector is delivered to a classification layer of the initial recognition model through the encoder module, further, recognition processing is carried out on the input feature extraction vector through expression classification functions by all levels of full connection layers in the classification layer, then a recognition result label is marked on the image to be recognized, and comparing the identification result label with the label of the image to be identified so as to judge whether the expression identification of the image to be identified by the initial identification model is correct or not.

Step S403, importing the input feature extraction vector into a classification layer of the initial recognition model.

According to some embodiments of the present invention, the classification layer of the initial recognition model is configured to perform recognition processing on the input feature extraction vector through an expression classification function to obtain a recognition result, where the expression classification function is a common image classification function such as cross entropy loss and Focal loss, or a combination of various image classification functions. When the initial recognition model carries out the above operation on a plurality of images to be recognized in the input sample set, the expression recognition accuracy rate of the initial recognition model on the input sample set can be obtained. The image to be recognized may be an image block or an entire image.

Referring to fig. 5, according to some embodiments of the present invention, the performing optimization training on the initial recognition model with the input sample set to obtain an optimized recognition model further includes:

s501, obtaining recognition classification scores of a classification layer, and obtaining the expression recognition accuracy of the classification prediction in the current round according to the recognition classification scores;

specifically, each full-connection layer performs expression classification judgment on an image to be recognized through an expression classification function based on the input feature extraction vector to respectively obtain probabilities of various expressions, recognition classification scores are obtained through the expression classification function according to probability distribution conditions of various labels, then a recognition result label of the image to be recognized is determined according to the recognition classification score of the last stage in the classification layer, and the recognition result label is compared with a label carried by the image to be recognized to judge whether expression recognition of the image to be recognized by the initial recognition model is correct or not. The recognition classification scores are obtained by expression classification functions, and are used for reflecting the probability distribution conditions of various expression labels by scores (such as 80 scores and 75 scores). When the initial recognition model performs the above operation on the input sample set, the expression recognition accuracy rate of the initial recognition model on the input sample set in the current round of classification prediction can be obtained.

Step S502, after the expression recognition accuracy rate of the classification prediction of the current round is obtained, the expression feature weight of the first type of action unit is adjusted, and the expression feature weight of the second type of action unit is adjusted;

it should be noted that, since the purpose of performing optimization training on the initial recognition model is to improve the table recognition accuracy of the initial recognition model, after the expression recognition accuracy of the classification prediction in the current round is obtained, the expression feature weight of the first type action unit in the initial recognition model and the expression feature weight of the second type action unit in the initial recognition model need to be adjusted to further improve the expression recognition accuracy. Specifically, since the first type of action units refers to facial action units that are unreliable and cannot help to distinguish expression categories when the expression feature weight is too high, and the second type of action units refers to facial action units that are reliable and can help to distinguish expression categories although the expression feature weight is low. Therefore, in the initial recognition model, the expression feature weight of the first type of action unit is further decreased and the expression feature weight of the second type of action unit is further increased through step S502, so that the initial recognition model can be more affected by the second type of action unit which is reliable and can help to distinguish the expression types, thereby further improving the expression recognition accuracy. It should be noted that the first type of action unit is unreliable when the weight is too high, and cannot help to distinguish the expression category, but in some embodiments of the present invention, when the weight is too low, the expression recognition accuracy is also reduced. Therefore, according to some embodiments of the present invention, after the expression feature weight of the first type of action unit is decreased and the expression feature weight of the second type of action unit is increased, the expression recognition accuracy is changed from low to high, and then from high to low, the expression feature weight of the first type of action unit and the expression feature weight of the second type of action unit are further recalled, so that the expression recognition accuracy is maintained at a higher level.

Step S503, after adjusting the expression feature weight of the first type of action unit and the expression feature weight of the second type of action unit, performing iterative training on the initial recognition model based on the input sample set;

it should be noted that the purpose of the iterative training is to gradually improve the expression recognition accuracy through several rounds of classification prediction training. Therefore, according to some embodiments provided by the present invention, after the expression recognition accuracy of the classification prediction of the current round is obtained, the expression feature weight of the first type of action unit and the expression feature weight of the second type of action unit in the initial recognition model are adjusted, and then the iterative training is performed on the initial recognition model based on the input sample set. It should be understood that the steps of each round of iterative training include, but are not limited to: and performing the current round of classification prediction based on the expression feature weight distribution after the previous round of adjustment to obtain the expression recognition accuracy of the current round of classification prediction, and further adjusting the expression feature weight of the first type action unit and the expression feature weight of the second type action unit. It should be noted that, the expression feature weight of the first type of action unit and the expression feature weight of the second type of action unit in the initial recognition model may be adjusted by referring to the change condition of the expression recognition accuracy.

Step S504, after each iteration training, counting the change condition of the expression recognition accuracy rate;

according to some embodiments of the invention, after each iteration training, the obtained expression recognition accuracy of the current round is recorded, and the expression recognition accuracy is counted, wherein the counting mode includes but is not limited to reflecting the change condition of the expression recognition rate to a line graph, a histogram and other statistical tools.

And step S505, when the expression recognition accuracy rate is converged to a first fixed value, stopping iterative training and obtaining an optimized recognition model.

Note that the first fixed value means: and after the expression feature weight of the first type of action unit and the expression feature weight of the second type of action unit are subjected to several rounds of adjustment, the convergence value of the expression recognition accuracy rate of the initial recognition model is obtained. In some embodiments of the present invention, the expression recognition accuracy of the initial recognition model is stabilized within an error range of a certain value, which is a first certain value, for example, the expression recognition accuracy fluctuates within a range of 84% to 86%, which is considered to be 85%. It is to be understood that the first constant value is not an exact constant value, but a value that varies with the training situation. And when the expression recognition accuracy rate is converged to a first fixed value, judging that the optimal training has achieved a better effect, and stopping the iterative training, wherein the expression recognition model obtained after the iterative training is stopped is the optimal recognition model.

In some embodiments of the present invention, the expression recognition model training method provided by the present invention is helpful for suppressing negative effects caused by the first type of action units in the initial recognition model, for example, when a certain first type of action unit causes a sample to be mispredicted by the initial recognition model, then the certain first type of action unit is suppressed by a penalty function, where the penalty function includes but is not limited to: cross entropy loss, Focal loss, and the like, or a combination of various image classification functions. When the number of samples with wrong prediction is larger, the value of the final penalty function is higher, so that the initial recognition model is iterated in the direction of improving the expression recognition accuracy.

It should be noted that the purpose of performing optimization training on the initial recognition model is to improve the table recognition accuracy of the initial recognition model. According to some embodiments provided by the present invention, the object of the optimization training is an initial recognition model, and the process of the optimization training is to perform expression recognition training on the initial recognition model by using the input sample set as a training data set. It should be understood that the positive sample data of the input sample set includes a first type of action unit, wherein the first type of action unit refers to a facial action unit which is unreliable and cannot help to distinguish expression categories when the expression feature weight is too high; the negative sample data of the input sample set comprises a second type of action unit, wherein the second type of action unit refers to a facial action unit which is reliable and can help to distinguish expression classes although the expression feature weight is low. Therefore, in the process of optimization training, the expression weight distribution of the initial recognition model with respect to the first-class action units and the second-class action units needs to be adjusted by the initial recognition model according to the expression recognition accuracy of the input sample set, so that the expression recognition accuracy of the initial recognition model is gradually improved along with the progress of the optimization training until the expression recognition accuracy converges to a first fixed value.

Referring to fig. 6, according to some embodiments of the present invention, the expression recognition model training method further includes:

step S601, when the expression recognition accuracy rate is gradually increased, the expression feature weight of the initial recognition model to the first type of action unit is reduced;

it should be noted that, since the purpose of performing optimization training on the initial recognition model is to improve the table recognition accuracy of the initial recognition model, after the expression recognition accuracy of the classification prediction in the current round is obtained, the expression feature weight of the first type action unit in the initial recognition model and the expression feature weight of the second type action unit in the initial recognition model need to be adjusted to further improve the expression recognition accuracy. Specifically, since the first type of action units refers to facial action units that are unreliable and cannot help to distinguish expression categories when the expression feature weight is too high, and the second type of action units refers to facial action units that are reliable and can help to distinguish expression categories although the expression feature weight is low. Therefore, in some embodiments of the present invention, when the expression recognition accuracy gradually increases through step S601, the expression feature weight of the initial recognition model to the first type of action unit is continuously decreased, so that the initial recognition model is more affected by the second type of action unit that is reliable and can help to distinguish the expression category, thereby further improving the expression recognition accuracy.

Step S602, according to the decreasing trend of the expression feature weight of the first type action unit in the initial recognition model, the expression feature weight of the second type action unit in the initial recognition model is increased.

According to some embodiments of the present invention, since the expression feature weight of the first type of action unit is reduced by the initial identification model in step S601, in order to enable the second type of action unit that is reliable and can help to distinguish expression categories to influence the expression identification process of the initial identification model more, the expression feature weight of the second type of action unit in the initial identification model is increased according to the trend of the first type of action unit in the initial identification model to reduce the expression feature weight in step S602, so as to increase the expression identification accuracy of the initial identification model.

Referring to fig. 7, according to some embodiments of the present invention, the expression recognition model training method further includes:

step S701, counting the change situation of the loss function output value in each iteration training;

and step S702, when the expression recognition accuracy rate is converged to a first fixed value or the loss function output value is converged to a second fixed value, stopping iterative training and obtaining an optimized recognition model.

According to some embodiments of the present invention, it is determined that the optimization training has achieved a better effect, and in addition to the expression recognition accuracy as a reference, the loss function output value can be used as a reference. Specifically, when the output value of the loss function converges to the second fixed value, the robustness of the initial recognition model is increased to a higher level, so that the optimal training is judged to achieve a better effect, and the iterative training can be stopped, wherein the expression recognition model obtained after the iterative training is stopped is the optimal recognition model. It should be understood that the reference criterion for determining that the optimization training has achieved a better effect may be the expression recognition accuracy converging to a first fixed value, the loss function output converging to a second fixed value, or a composite criterion, that is, the loss function output converges to the second fixed value while the expression recognition accuracy converges to the first fixed value. It is to be understood that the second constant value is not an exact constant value, but is also a value that varies with the training situation. It should be emphasized that the reference criteria for determining that optimal training has achieved better results in the embodiments of the present invention include, but are not limited to, the above examples.

According to some embodiments of the invention, the expression recognition model training method further comprises:

step S801, receiving gradient feedback of expression recognition accuracy according to statistics of expression recognition accuracy change conditions;

and step S802, when the expression recognition accuracy rate reflected by statistics is gradually reduced, correcting the input sample set.

It should be noted that, the process of the optimization training may be accompanied by an error of the optimization direction, according to some embodiments of the present invention, when the expression recognition accuracy rate instead shows a decreasing trend as the initial recognition model decreases the expression feature weight of the first type of action unit and increases the expression feature weight of the second type of action unit, it may be that the input sample set has an uncorrected problem, and then the input sample set needs to be further corrected. It should be understood that the gradient feedback of the expression recognition accuracy can reflect the development trend of the optimization training by establishing a feedback model, so that the correctness of the training direction can be determined in time, and further, the program error generated in the optimization training process can be eliminated.

Fig. 9 illustrates an electronic device 900 provided by an embodiment of the invention. The electronic device 900 includes: a processor 901, a memory 902 and a computer program stored on the memory 902 and executable on the processor 901, the computer program being operative to perform the above-mentioned expression recognition model training method.

The processor 901 and the memory 902 may be connected by a bus or other means.

The memory 902, which is a non-transitory computer readable storage medium, may be used to store a non-transitory software program and a non-transitory computer executable program, such as the expression recognition model training method described in the embodiments of the present invention. The processor 901 implements the expression recognition model training method described above by running a non-transitory software program and instructions stored in the memory 902.

The memory 902 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function. The storage data area may store and execute the expression recognition model training method described above. Further, the memory 902 may include a high speed random access memory 902, and may also include a non-transitory memory 902, such as at least one storage device memory device, flash memory device, or other non-transitory solid state memory device. In some embodiments, the memory 902 may optionally include memory 902 located remotely from the processor 901, and such remote memory 902 may be coupled to the electronic device 900 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Non-transitory software programs and instructions required to implement the expression recognition model training method described above are stored in the memory 902 and, when executed by the one or more processors 901, perform the expression recognition model training method described above, e.g., performing method steps S101 to S107 in fig. 1, method steps S201 to S204 in fig. 2, method steps S301 to S303 in fig. 3, method steps S401 to S402 in fig. 4, method steps S501 to S505 in fig. 5, method steps S601 to S602 in fig. 6, method steps S701 to S702 in fig. 7, and method steps S801 to S802 in fig. 8.

The embodiment of the invention also provides a computer-readable storage medium, which stores computer-executable instructions, wherein the computer-executable instructions are used for executing the expression recognition model training method.

In one embodiment, the computer-readable storage medium stores computer-executable instructions that are executed by one or more control processors, for example, to perform method steps S101-S107 in fig. 1, method steps S201-S204 in fig. 2, method steps S301-S303 in fig. 3, method steps S401-S402 in fig. 4, method steps S501-S505 in fig. 5, method steps S601-S602 in fig. 6, method steps S701-S702 in fig. 7, and method steps S801-S802 in fig. 8.

The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, storage device storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art. It should also be appreciated that the various implementations provided by the embodiments of the present invention can be combined arbitrarily to achieve different technical effects.

While the preferred embodiments of the present invention have been described in detail, it will be understood by those skilled in the art that the foregoing and various other changes, omissions and deviations in the form and detail thereof may be made without departing from the scope of this invention.

Claims

1. An expression recognition model training method is characterized by comprising the following steps:

2. The method according to claim 1, wherein the obtaining of the expressive feature weights in the first training image set and the expressive feature weights in the second training image set based on the basis of the basic feature extraction network comprises:

acquiring a first feature extraction vector for describing a face region in the first training image set and a second feature extraction vector for describing the face region in the second training image set based on the basic feature extraction network;

3. The method according to claim 1, wherein said mixing said positive sample data with said negative sample data with said first type of action units as positive sample data and said second type of action units as negative sample data forms an input sample set, including;

4. The method of claim 3, wherein the optimally training the initial recognition model with the input sample set to obtain an optimized recognition model comprises:

5. The method of claim 4, wherein the performing optimization training on the initial recognition model with the input sample set to obtain an optimized recognition model, further comprises:

6. The method of claim 5, wherein after obtaining the expression recognition accuracy of the current round of classification prediction, adjusting the expression feature weight of the first type of action unit and adjusting the expression feature weight of the second type of action unit comprises:

and according to the reduction trend of the expression feature weight of the first type of action unit in the initial recognition model, improving the expression feature weight of the second type of action unit in the initial recognition model.

7. The method of claim 5, further comprising:

8. The method of claim 5, further comprising:

9. An electronic device, comprising: a memory storing a computer program, and a processor implementing the expression recognition model training method according to any one of claims 1 to 8 when the computer program is executed by the processor.

10. A computer-readable storage medium, characterized in that the storage medium stores a program executed by a processor to implement the expression recognition model training method according to any one of claims 1 to 8.