CN116343314A

CN116343314A - Expression recognition method and device, storage medium and electronic equipment

Info

Publication number: CN116343314A
Application number: CN202310623737.8A
Authority: CN
Inventors: 李太豪; 刘昱龙
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-05-30
Filing date: 2023-05-30
Publication date: 2023-06-27
Anticipated expiration: 2043-05-30
Also published as: CN116343314B

Abstract

The embodiment of the specification determines emotion detection results of face images through an emotion detection model after face detection is carried out on acquired videos to obtain the face images. The emotion detection result is used for representing the emotion fluctuation degree of the face. Then, a target combination is selected from combinations obtained by grouping face images in advance based on the emotion detection result of the face images. And inputting the face image in the target combination into the expression recognition model to output the expression category, and determining the final expression category based on the output expression category. In the method, the key video segment with the biggest facial emotion fluctuation degree is screened out from the video through the emotion detection model, and only the facial expression in the key video segment is identified through the expression identification model, so that facial expression identification is not required to be carried out on other video segments in the video, and the accuracy of the expression identification is improved.

Description

Expression recognition method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to an expression recognition method, an expression recognition device, a storage medium, and an electronic device.

Background

According to investigation, facial expressions can represent a vast majority of emotional states in daily face-to-face conversational communication. Therefore, expression recognition technology is important in a scene of human-computer interaction through video. The expression recognition technology is related to application scenes such as intelligent auxiliary systems, service robots, driving fatigue detection and the like.

In the prior art, the method for identifying the facial expression of the video can comprise the following steps: the video is firstly analyzed into an image frame sequence, and then a face area in the image frame sequence is extracted. Then, the extracted face area is directly input into a deep learning network to perform expression recognition. Wherein, the deep learning network for expression recognition may include: and the non-time sequence network is used for inputting each frame of face area into the expression recognition network to obtain the expression category corresponding to the frame of face area, and then carrying out weighted fusion on the expression category corresponding to the frame of face area and the expression categories corresponding to the other frames of face areas to obtain the final expression category.

However, for non-time-series networks, because the image frames with excited facial emotion in a section of dialogue video occupy a relatively small amount, the finally recognized expression category is classified into a neutral expression category (i.e. calm expression) by using a weighted fusion mode, but the actual face may be smiling, anger and other expression categories, thereby reducing the accuracy of expression recognition.

Disclosure of Invention

The embodiment of the specification provides an expression recognition method, an expression recognition device, a storage medium and electronic equipment, so as to partially solve the problems existing in the prior art.

The embodiment of the specification adopts the following technical scheme:

the expression recognition method provided by the specification comprises the following steps:

in the process that a user performs man-machine interaction through videos, acquiring videos of the user in real time;

performing face detection on each image frame in the video to determine face images contained in the video;

inputting the face images into a pre-trained emotion detection model to output emotion detection results of the face images through the emotion detection model; the emotion detection result is used for representing the emotion fluctuation degree of the face in the face image;

According to each combination obtained by grouping the face images in advance and the emotion detection result of the face images, determining a comprehensive emotion detection result corresponding to each combination in each combination;

selecting a target combination from the combinations according to the comprehensive emotion detection result corresponding to each combination, wherein the target combination comprises the face image with the largest face emotion fluctuation degree;

inputting the facial image contained in the target combination into a pre-trained expression recognition model to output the expression category of the face in the facial image contained in the target combination through the expression recognition model;

and determining the final expression category of the user according to the expression category of the face in the face image contained in the target combination.

Optionally, face detection is performed on each image frame in the video, and each face image contained in the video is determined, which specifically includes:

performing frame extraction processing on the video to obtain each image frame;

and inputting each image frame into a face detection model for each image frame, carrying out face detection on the image frame through the face detection model, and extracting an image only comprising a face area from the image frame as a face image if the face is detected in the image frame.

Optionally, pre-training the emotion detection model specifically includes:

a sample video is obtained in advance, wherein the sample video comprises sample image frames with different expressions of human faces in a speaking state;

inputting the sample video into an emotion detection model to be trained, and outputting emotion detection results of each sample image frame in the sample video through the emotion detection model;

and training the emotion detection model by taking the difference between emotion detection results of each sample image frame in the sample video and the corresponding label as an optimization target.

Optionally, training the emotion detection model with the difference between the emotion detection result of each sample image frame in the sample video and the label corresponding to each sample image frame minimized as an optimization target specifically includes:

according to emotion detection results of each sample image frame in the sample video, determining the mean value and variance of all emotion detection results; determining the mean value and the variance of all the label values according to the label values of each sample image frame in the sample video;

determining a first loss according to the mean and variance of the detection results for all emotions and the mean and variance of the detection results for all tag values;

Determining a second loss according to differences between emotion detection results of each sample image frame in the sample video and the corresponding label values;

determining a comprehensive loss according to the first loss and the second loss;

and training the emotion detection model by taking the minimization of the comprehensive loss as an optimization target.

Optionally, according to each combination obtained by grouping the face images in advance and the emotion detection result of each face image, determining a comprehensive emotion detection result corresponding to each combination in each combination specifically includes:

grouping the face images according to the number of the face images contained in the combination and the time sequence of the image frames of the face images to obtain the combination;

and determining a comprehensive emotion detection result corresponding to each combination according to the emotion detection result of the face image contained in the combination aiming at each combination.

Average filtering is carried out on emotion detection results of the face images to obtain filtered emotion detection results of the face images;

and determining a comprehensive emotion detection result corresponding to each combination in each combination according to each combination obtained by grouping the face images in advance and the emotion detection result after filtering of the face images.

Optionally, selecting a target combination from the combinations according to the comprehensive emotion detection result corresponding to each combination, specifically including:

and selecting a combination with the largest comprehensive emotion detection result from the combinations as a target combination according to the comprehensive emotion detection result corresponding to each combination.

Optionally, determining the final expression category of the user according to the expression category of the face in the face image included in the target combination specifically includes:

and determining the most expression categories according to the expression categories of the face in each face image contained in the target combination, and taking the most expression categories as the final expression categories of the user.

The expression recognition device provided in the present specification includes:

the acquisition module is used for acquiring the video of the user in real time in the process of human-computer interaction of the user through the video;

The face detection module is used for carrying out face detection on each image frame in the video and determining face images contained in the video;

the emotion detection module is used for inputting the face images into a pre-trained emotion detection model so as to output emotion detection results of the face images through the emotion detection model; the emotion detection result is used for representing the emotion fluctuation degree of the face in the face image;

the first determining module is used for determining a comprehensive emotion detection result corresponding to each combination in each combination according to each combination obtained by grouping the face images in advance and the emotion detection result of each face image;

the selection module is used for selecting a target combination from the combinations according to the comprehensive emotion detection result corresponding to each combination, wherein the target combination comprises the face image with the largest face emotion fluctuation degree;

the expression recognition module is used for inputting the facial image contained in the target combination into a pre-trained expression recognition model so as to output the expression category of the face in the facial image contained in the target combination through the expression recognition model;

And the second determining module is used for determining the final expression category of the user according to the expression category of the face in the face image contained in the target combination.

Optionally, the face detection module is specifically configured to perform frame extraction processing on the video to obtain each image frame; and inputting each image frame into a face detection model for each image frame, carrying out face detection on the image frame through the face detection model, and extracting an image only comprising a face area from the image frame as a face image if the face is detected in the image frame.

Optionally, the emotion detection module is specifically configured to obtain a sample video in advance, where the sample video includes sample image frames in which a face presents different expressions in a speaking state; inputting the sample video into an emotion detection model to be trained, and outputting emotion detection results of each sample image frame in the sample video through the emotion detection model; and training the emotion detection model by taking the difference between emotion detection results of each sample image frame in the sample video and the corresponding label as an optimization target.

Optionally, the emotion detection module is specifically configured to determine, according to emotion detection results of each sample image frame in the sample video, a mean value and a variance of all emotion detection results; determining the mean value and the variance of all the label values according to the label values of each sample image frame in the sample video; determining a first loss according to the mean and variance of the detection results for all emotions and the mean and variance of the detection results for all tag values; determining a second loss according to differences between emotion detection results of each sample image frame in the sample video and the corresponding label values; determining a comprehensive loss according to the first loss and the second loss; and training the emotion detection model by taking the minimization of the comprehensive loss as an optimization target.

Optionally, the first determining module is specifically configured to group the face images according to the number of face images included in the combination and according to a time sequence of an image frame where the face images are located, so as to obtain each combination; and determining a comprehensive emotion detection result corresponding to each combination according to the emotion detection result of the face image contained in the combination aiming at each combination.

Optionally, the first determining module is specifically configured to perform mean filtering on the emotion detection result of each face image to obtain a filtered emotion detection result of each face image; and determining a comprehensive emotion detection result corresponding to each combination in each combination according to each combination obtained by grouping the face images in advance and the emotion detection result after filtering of the face images.

A computer readable storage medium is provided in the present specification, the storage medium storing a computer program, which when executed by a processor, implements the expression recognition method described above.

The electronic device provided by the specification comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the expression recognition method when executing the program.

The above-mentioned at least one technical scheme that this description embodiment adopted can reach following beneficial effect:

in the embodiment of the specification, after face detection is performed on the acquired video to obtain face images, emotion detection results of the face images are determined through an emotion detection model. The emotion detection result is used for representing the emotion fluctuation degree of the face. Then, a target combination is selected from combinations obtained by grouping face images in advance based on the emotion detection result of the face images. And inputting the face image in the target combination into the expression recognition model to output the expression category, and determining the final expression category based on the output expression category. In the method, the key video segment with the biggest facial emotion fluctuation degree is screened out from the video through the emotion detection model, and only facial expressions in facial images related to the key video segment are identified through the expression identification model, so that facial expression identification is not required to be carried out on other video segments in the video, and the accuracy of expression identification is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:

fig. 1 is a schematic flow chart of an expression recognition method according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of grouping face images according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an expression recognition device according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In the prior art, in the process of performing expression recognition on a face by using a non-time-sequence network model, the accuracy of performing expression recognition by using the non-time-sequence network model is also affected by the mouth shape change when a person speaks, for example: it is possible to recognize a surprise expression by the mouth shape of a person when opening the mouth to speak, whereas the actual expression of a person's face is not surprised.

The expression recognition method provided by the specification aims at screening out the video segment with the largest emotion fluctuation degree of the face from the video, and then carrying out expression recognition on the face in the screened video segment without carrying out expression recognition on the face in other video segments, so that the accuracy of expression recognition is improved. The emotion fluctuation degree of the human face in other video segments is small, and the expression category presented by the human face is a calm expression. In addition, the model for screening the video clips is obtained by training sample videos with different expressions of the face in the speaking state, so that the model can accurately judge whether the face has emotion fluctuation or not from the face in the speaking state, and indirectly eliminates the interference of mouth shape on emotion recognition in the speaking state.

For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.

Fig. 1 is a flowchart of an expression recognition method provided in an embodiment of the present disclosure, where the expression recognition method may be suitable for an electronic device with an emotion detection model and an expression recognition model, and includes:

s100: and in the process that the user performs man-machine interaction through the video, the video of the user is collected in real time.

S102: and carrying out face detection on each image frame in the video, and determining each face image contained in the video.

The expression recognition method provided in the specification can be applied to a scene of human-computer interaction through video, wherein the scene of human-computer interaction through video can comprise: the intelligent robot carries out man-machine conversation through the video, teaches the user through the video, and detects the fatigue state of the user through the video. In the process that the user performs man-machine interaction through the video, the video of the user can be acquired in real time, the acquired video is processed, and face area images contained in each image frame in the video are obtained and used as face images. It should be noted that, since the duration of the expression of the user in the man-machine communication process is generally 0.5-5 seconds, the video duration of the video data processed at a time needs to be greater than 5 seconds.

Specifically, the video of the user can be acquired in real time through the camera. The camera can be a high-definition camera so as to ensure the definition of a human face in the video. And then, carrying out face detection on each image frame in the video to determine each face image contained in the video. The face image may refer to an image that only includes a face region in an image frame including a face in a video. The face in each face image is the face of the same user.

Wherein, each face image is taken as a face image set, and the face image set is expressed as:

。/>

for a face image set containing face images, t represents a video sharing t face images,

and representing the ith face image extracted according to the sequence of the image frames in the video.

When each image frame in the video is subjected to face detection and each face image contained in the video is determined, the video can be analyzed first to obtain each image frame in the video. Then, for each image frame of each image in the video, performing face detection on the image frame to obtain a face detection result of the image frame. If the image frame contains the human face according to the human face detection result, extracting an image which only contains the human face area from the image frame to serve as a human face image. And finally, determining each face image from each image frame in the video. Wherein, not every image frame can present the face image, and some image frames do not contain the face and can not present the face image.

When a face image is extracted, the image frames are input into a face detection model trained in advance for each image frame in a video, the face detection model is used for carrying out face detection on the image frames, and if a face is detected in the image frames, an image only containing a face area is extracted from the image frames and is taken as a face image to be output. The face detection model may be a model of an SSD structure, for example: SSD-Lite model.

In addition, because the repetition rate of the image content of adjacent frames in the video is high and the facial expression change is small, in order to improve the calculation efficiency of facial detection, expression detection and expression recognition, the consumption of calculation resources is reduced, and when the video is processed, the frame extraction processing can be performed on the video to obtain each image frame. After the frame extraction process, face detection is performed on each image frame.

Specifically, after the video of the user is collected, the video can be subjected to equidistant frame extraction to obtain each image frame. Wherein, the video with different frame rates is processed by 1 second for 15 frames.

After obtaining each image frame after frame extraction processing, inputting each image frame into a pre-trained face detection model for each image frame, carrying out face detection on the image frame through the face detection model, and extracting an image only containing a face area from the image frame as a face image and outputting the image if the face is detected in the image frame.

S104: inputting the face images into a pre-trained emotion detection model to output emotion detection results of the face images through the emotion detection model; the emotion detection result is used for representing the emotion fluctuation degree of the face in the face image.

In the embodiment of the present specification, after obtaining each face image, each face image may be input into a pre-trained emotion detection model to output an emotion detection result of each face image through the emotion detection model. The emotion detection result is used for representing the emotion fluctuation degree of the face in the face image. The emotion detection result can be represented by an emotion detection value, and the larger the emotion detection result or the emotion detection value is, the larger the emotion fluctuation degree of the face in the face image is, which means that the expression of the face is less calm, and the expression of the face is less calmThe more likely the expression of a face is in calm conditions: expression categories such as happiness, surprise, fear, etc.; the smaller the emotion detection result or emotion detection value is, the smaller the face emotion fluctuation degree in the face image is, and the calmer the face expression is. The emotion detection model may be a resnet-18 network model. For the emotion detection model, the emotion detection model is input into a face image, and the output emotion detection value is in a section

Within the range.

Prior to using this emotion detection model, training of the emotion detection model is required.

Specifically, a sample video is firstly obtained, wherein the sample video comprises sample image frames with different expressions of human faces in a speaking state, the sample image frames refer to image frames of the same user face contained in the sample video, and the sample image frames in the sample video are continuous. In the training process, the sample image frame corresponds to a face image.

And then, inputting the sample video into an emotion detection model to be trained so as to output emotion detection results of each sample image frame in the sample video through the emotion detection model to be trained. And finally, training the emotion detection model by taking the difference between emotion detection results of each sample image frame in the sample video and the corresponding label as an optimization target.

When the sample video is acquired, the video containing the face of the same user can be acquired from the open source data set Affwild2 and used as the sample video.

The method for training the emotion detection model by using the difference between the emotion detection results of each sample image frame in the sample video and the corresponding labels as the optimization target can comprise the following steps: the mean and variance of all emotion detection results can be determined according to emotion detection results of each sample image frame in the sample video. Meanwhile, the mean value and the variance of all the label values can be determined according to the label values of each sample image frame in the sample video. Then, the first penalty is determined from the mean and variance for all emotion detection results, and the mean and variance for all tag values. And determining the second loss according to the difference between the emotion detection result of each sample image frame in the sample video and the corresponding label value. And then, determining the comprehensive loss according to the first loss and the second loss. And finally, training the emotion detection model by taking the minimum comprehensive loss as an optimization target.

When the first loss is determined, determining the correlation degree between all emotion detection results and all label values according to the emotion detection results of each sample image frame, the average value of all emotion detection results, the label value of each sample image frame and the average value of all label values. Then, the first loss is determined according to the correlation degree between all emotion detection results and all label values, the mean and variance of all emotion detection results, and the mean and variance of all label values.

When the second loss is determined, the second loss may be determined according to differences between emotion detection results of each sample image frame in the sample video and respective corresponding tag values, and the number of sample image frames in the sample video.

When the comprehensive loss is determined, a difference between the preset value and the first loss can be determined, and the determined difference and the second loss are summed to obtain the comprehensive loss. Wherein the preset value may be 1.

It should be noted that, because the expression change of the face in the sample video has continuity, the consistency of the expression trend of the predicted value and the true value in the training process can be ensured by using the first loss, and meanwhile, the absolute error of the predicted value and the true value in the training result can be reduced by using the second loss.

Assume that the number of sample image frames input into the emotion detection model per batch during training is Q, for example:

. Outputting data +.>

Data->

Is +.>

Is provided with->

，

Is->

And outputting emotion detection values of the faces in the sample image frames through an emotion detection model. Meanwhile, the label value corresponding to the sample image frames is +.>

，/>

Is->

Expression realism values of faces in individual sample image frames.

The formula for calculating the first loss is:

，/>

。

for vector->

Variance of->

For vector->

Variance of->

For vector->

Mean value of->

For vector->

Is a mean value of (c). />

For vector->

Sum vector->

Is a correlation of (a) and (b).

The formula for calculating the second loss is:

。

the formula for calculating the comprehensive loss is as follows:

。

after the trained emotion detection model is obtained through training, the trained emotion detection model can be used for detecting the degree of facial emotion fluctuation in the facial image.

Inputting each face image in step 102 into the trained emotion detection model to output an emotion detection result of each face image through the trained emotion detection model.

Wherein, the face images are assembled

Inputting into a trained emotion detection model to output emotion detection result set aiming at face image set through the trained emotion detection model >

。/>

Wherein->

For face image +.>

Is a result of emotion detection.

S106: and determining a comprehensive emotion detection result corresponding to each combination in each combination according to each combination obtained by grouping the face images in advance and the emotion detection result of each face image.

In the embodiment of the present disclosure, the face images may be first grouped according to the number of face images included in the combination, and according to the time sequence of the image frames where the face images are located, to obtain each combination. As shown in fig. 2. Wherein, one combination contains partial face images in each face image, each combination except the last combination in each combination contains face images with the number of face images, and the last combination in each combination may contain face images with the number of face images or face images with the number less than the number of face images. And then, according to the emotion detection results of the face images and the combinations, determining the comprehensive emotion detection results corresponding to the combinations. For each combination, the larger the comprehensive emotion detection result corresponding to the combination is, the larger the emotion fluctuation degree of the face in the face image contained in the combination is, and under the condition that the emotion fluctuation degree of the face is larger, the more likely the expression of the face is: an effective expression category of happiness, surprise, fear, etc.; the smaller the comprehensive emotion detection result corresponding to the combination is, the smaller the emotion fluctuation degree of the face in the face image contained in the combination is, and under the condition that the emotion fluctuation degree of the face is smaller, the more likely the expression of the face is: calm expression category.

The method comprises the steps of grouping face images according to the number of the face images contained in the combination and the time sequence of the image frames of the face images, so as to obtain each combination, namely segmenting the video according to the time sequence, and obtaining each video segment. A video segment contains image frames in which face images are located.

Assuming that the number of preset face images is M and the number of face images is t, each face image may be divided into S combinations. Wherein, the formula for determining the number of combinations is:

. Wherein (1)>

Representing modulo computation. Then, use the collection->

Representing a set of S combinations, < >>

. In FIG. 2, in +.>

，/>

For example, 2 combinations are determined, denoted +.>

。

When determining the comprehensive emotion detection result corresponding to each combination according to the emotion detection result of each face image and each combination, determining the comprehensive emotion detection result corresponding to each combination according to the emotion detection result of the face image contained in each combination. And accumulating emotion detection results of the face images contained in the combination to obtain comprehensive emotion detection results corresponding to the combination.

In addition, since the emotion detection results of the face images are all calculated by the model, discrete values can appear in the emotion detection results, and in order to reduce data disturbance caused by the discrete values, the emotion detection results of the face images can be subjected to smooth filtering. The method adopts a mean filtering mode.

When the comprehensive emotion detection results corresponding to each combination are determined, average filtering can be performed on the emotion detection results of the face images to obtain filtered emotion detection results of the face images. And then, according to the filtered emotion detection results of the face images and the combinations, determining a comprehensive emotion detection result corresponding to each combination.

When average filtering is carried out on emotion detection results of all face images, determining all face images in a preset filtering window when aiming at the face images according to the preset filtering window aiming at each face image, and averaging emotion detection results of all face images in the preset filtering window to obtain an average value which is used as a filtered emotion detection result of the face images.

Assume that the filter window size is

In the present invention, +.>

The calculation formula of the filtering process is as follows:

. For emotion detection result set->

The filtered set is: />

. Wherein->

For face image +.>

Is a filtered emotion detection result.

When determining the comprehensive emotion detection result corresponding to each combination according to the filtered emotion detection result of each face image and each combination, determining the comprehensive emotion detection result corresponding to each combination according to the filtered emotion detection result of the face image contained in each combination. And accumulating the filtered emotion detection results of the face images contained in the combination to obtain a comprehensive emotion detection result corresponding to the combination.

Assuming that the number of preset face images is M, the number of each combination is S, and the combination is set

，

. Aggregate for each combination->

Determining a filtered emotion detection result set H,/-for face images contained in each combination>

，

。/>

And (3) a set of filtered emotion detection results for each face image in the ith combination. The formula for determining the comprehensive emotion detection result corresponding to each combination is as follows:

。/>

and (5) the comprehensive emotion detection result corresponding to the ith combination. By this formula for calculating the comprehensive emotion detection result, it is possible to obtain a target +.>

New collection of->

。

S108: and selecting a target combination from the combinations according to the comprehensive emotion detection result corresponding to each combination, wherein the target combination comprises the face image with the largest face emotion fluctuation degree.

In the embodiment of the present disclosure, after the comprehensive emotion detection result corresponding to each combination is obtained, in order to accurately identify the expression of the face, a combination with the largest emotion fluctuation of the face needs to be selected from the combinations, that is, a combination with the effective expression of the face is selected from the combinations. That is, a target combination may be selected from the combinations based on the comprehensive emotion detection results corresponding to the respective combinations. The target combination comprises a face image, and the face emotion fluctuation degree is the largest.

Specifically, according to the comprehensive emotion detection result corresponding to each combination, the combination with the largest comprehensive emotion detection result is selected from the combinations and is used as the target combination.

Continuing to use the above example, the slave set

Is selected to be the maximum +.>

. Then, determine +.>

Corresponding set +.>

Element->

Further, a->

Corresponding set +.>

Is->

Will->

As a target combination.

S110: and inputting the facial image contained in the target combination into a pre-trained expression recognition model so as to output the expression category of the face in the facial image contained in the target combination through the expression recognition model.

S112: and determining the final expression category of the user according to the expression category of the face in the face image contained in the target combination.

In the embodiment of the present disclosure, after determining the target combination, the expression class of the face in the face image included in the target combination may be identified, and the final expression class of the user may be determined according to the identification result of each face image in the target combination. Wherein, expression categories can be classified into seven categories: calm, surprise, happiness, fear, aversion, anger and injury. After determining the final expression category of the user, an interaction policy for the user may be determined based on the final expression category of the user to interact with the user through the interaction policy. The interaction policy may refer to a speaking policy and/or a content presentation policy that matches the final expression category.

Specifically, the facial image contained in the target combination is input into a pre-trained expression recognition model, so that the expression category of the face in the facial image contained in the target combination is output through the expression recognition model. The expression recognition model may be a resnet-50 network model, and the target combination may include a plurality of face images.

When the target combination contains a plurality of face images, inputting the face images into a pre-trained expression recognition model aiming at each face image contained in the target combination, so as to output the expression category of the face in the face image through the expression recognition model as the expression category of the face image.

When determining the final expression category of the user, the most expression categories can be determined according to the expression category of each facial image contained in the target combination or the expression category of the face in each facial image contained in the target combination and used as the final expression category of the user.

The expression categories of the faces in each face image output by the expression recognition model are counted, and the maximum expression category is used as the final expression category of the user, so that the robustness of expression recognition can be improved.

Continuing to use the above example, will

Sequentially inputting into expression recognition model, outputting +.>

Expression categories of each facial image in the face image are formed into a set

Wherein->

Is a human face image

Expression category of (c).

Further, the training of the expression recognition model will be described.

Specifically, a sample face image set may be first obtained from an open source data set RAF-DB. And then, inputting the sample face image set into an expression recognition model to be trained so as to output the expression category of the face in each sample face image in the sample face image set through the expression recognition model to be trained. And training the expression recognition model to be trained by taking the minimum difference between the expression category and the true expression category of the face in each sample face image as an optimization target. The human faces in the sample human face images contained in the sample human face image set have the characteristics of different ages, sexes, race, head postures, illumination conditions, shielding and the like.

It should be noted that, all actions of acquiring signals, information or video data in this specification are performed under the condition of conforming to the corresponding data protection rule policy of the country of the location and obtaining the authorization given by the owner of the corresponding device.

As can be seen from the method shown in fig. 1, after face detection is performed on the collected video to obtain face images, the emotion detection result of each face image is determined through an emotion detection model. The emotion detection result is used for representing the emotion fluctuation degree of the face. Then, a target combination is selected from combinations obtained by grouping face images in advance based on the emotion detection result of the face images. And inputting the face image in the target combination into the expression recognition model to output the expression category, and determining the final expression category based on the output expression category. In the method, key video segments with large facial emotion fluctuation are screened out from the video through the emotion detection model, and only facial expressions in the key video segments are identified through the expression identification model, so that facial expression identification is not required to be carried out on other video segments except the key video segments in the video, and the accuracy of expression identification is improved. The key video segment has a large emotion fluctuation degree of the face, the face is most likely to represent other expressions except the calm expression, and the face in other video segments has a small emotion fluctuation degree, and the expression represented by the face is most likely to be the calm expression. In addition, the emotion detection model is obtained by training sample videos of different expressions of the face in the speaking state, so that the emotion detection model can accurately judge whether the face has emotion fluctuation or not from the face in the speaking state, and interference of mouth shape on emotion recognition in the speaking state is indirectly eliminated.

The expression recognition method provided for the embodiment of the present specification further provides a corresponding device, a storage medium and an electronic apparatus based on the same idea.

Fig. 3 is a schematic structural diagram of an expression recognition device according to an embodiment of the present disclosure, where the device includes:

the acquisition module 301 is configured to acquire a video of a user in real time during a human-computer interaction process performed by the user through the video;

the face detection module 302 is configured to perform face detection on each image frame in the video, and determine each face image included in the video;

the emotion detection module 303 is configured to input the face images into a pre-trained emotion detection model, so as to output an emotion detection result of the face images through the emotion detection model; the emotion detection result is used for representing the emotion fluctuation degree of the face in the face image;

a first determining module 304, configured to determine, according to each combination obtained by grouping the face images in advance and a emotion detection result of each face image, a comprehensive emotion detection result corresponding to each combination in each combination;

a selecting module 305, configured to select a target combination from the combinations according to a comprehensive emotion detection result corresponding to each combination, where the target combination includes a face image with a maximum degree of facial emotion fluctuation;

The expression recognition module 306 is configured to input a facial image included in the target combination into a pre-trained expression recognition model, so as to output an expression category of a face in the facial image included in the target combination through the expression recognition model;

and a second determining module 307, configured to determine a final expression category of the user according to the expression category of the face in the face image included in the target combination.

Optionally, the face detection module 302 is specifically configured to perform frame extraction processing on the video to obtain each image frame; and inputting each image frame into a face detection model for each image frame, carrying out face detection on the image frame through the face detection model, and extracting an image only comprising a face area from the image frame as a face image if the face is detected in the image frame.

Optionally, the emotion detection module 303 is specifically configured to obtain a sample video in advance, where the sample video includes sample image frames in which a face presents different expressions in a speaking state; inputting the sample video into an emotion detection model to be trained, and outputting emotion detection results of each sample image frame in the sample video through the emotion detection model; and training the emotion detection model by taking the difference between emotion detection results of each sample image frame in the sample video and the corresponding label as an optimization target.

Optionally, the emotion detection module 303 is specifically configured to determine, according to emotion detection results of each sample image frame in the sample video, a mean value and a variance of all emotion detection results; determining the mean value and the variance of all the label values according to the label values of each sample image frame in the sample video; determining a first loss according to the mean and variance of the detection results for all emotions and the mean and variance of the detection results for all tag values; determining a second loss according to differences between emotion detection results of each sample image frame in the sample video and the corresponding label values; determining a comprehensive loss according to the first loss and the second loss; and training the emotion detection model by taking the minimization of the comprehensive loss as an optimization target.

Optionally, the first determining module 304 is specifically configured to group the face images according to the number of face images included in the combination and according to a time sequence of an image frame where the face images are located, so as to obtain each combination; and determining a comprehensive emotion detection result corresponding to each combination according to the emotion detection result of the face image contained in the combination aiming at each combination.

Optionally, the first determining module 304 is specifically configured to perform mean filtering on the emotion detection result of each face image to obtain a filtered emotion detection result of each face image; and determining a comprehensive emotion detection result corresponding to each combination in each combination according to each combination obtained by grouping the face images in advance and the emotion detection result after filtering of the face images.

Optionally, the selecting module 305 is specifically configured to select, as the target combination, a combination with the largest comprehensive emotion detection result from the combinations according to the comprehensive emotion detection result corresponding to each combination.

Optionally, the second determining module 307 is specifically configured to determine, according to the expression category of the face in each face image included in the target combination, the most expression category as the final expression category of the user.

The present specification also provides a computer readable storage medium storing a computer program which when executed by a processor is operable to perform the expression recognition method provided in fig. 1 above.

Based on the expression recognition method shown in fig. 1, the embodiment of the present disclosure further provides a schematic structural diagram of the electronic device shown in fig. 4. At the hardware level, as in fig. 4, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile storage, although it may include hardware required for other services. The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to implement the expression recognition method described in fig. 1.

Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description.

Claims

1. An expression recognition method, comprising:

2. The method of claim 1, wherein face detection is performed on each image frame in the video, and determining face images contained in the video comprises:

performing frame extraction processing on the video to obtain each image frame;

3. The method of claim 1, wherein pre-training the emotion detection model specifically comprises:

4. A method as claimed in claim 3, wherein training the emotion detection model with the difference between the emotion detection results of each sample image frame in the sample video and the respective corresponding label minimized as an optimization objective, specifically comprises:

5. The method according to claim 1, wherein determining the comprehensive emotion detection result corresponding to each of the combinations based on the combinations obtained by grouping the face images in advance and the emotion detection results of the face images, specifically includes:

6. The method according to claim 1, wherein determining the comprehensive emotion detection result corresponding to each of the combinations based on the combinations obtained by grouping the face images in advance and the emotion detection results of the face images, specifically includes:

7. The method of claim 1, wherein selecting a target combination from the combinations based on the comprehensive emotion detection results for each combination, specifically comprises:

8. The method according to claim 1, wherein determining the final expression category of the user according to the expression category of the face in the face image included in the target combination specifically includes:

9. An expression recognition apparatus, characterized by comprising:

10. The apparatus of claim 9, wherein the face detection module is specifically configured to perform frame extraction processing on the video to obtain each image frame; and inputting each image frame into a face detection model for each image frame, carrying out face detection on the image frame through the face detection model, and extracting an image only comprising a face area from the image frame as a face image if the face is detected in the image frame.

11. The apparatus of claim 9, wherein the emotion detection module is specifically configured to obtain a sample video in advance, where the sample video includes sample image frames in which a face presents different expressions in a speaking state; inputting the sample video into an emotion detection model to be trained, and outputting emotion detection results of each sample image frame in the sample video through the emotion detection model; and training the emotion detection model by taking the difference between emotion detection results of each sample image frame in the sample video and the corresponding label as an optimization target.

12. The apparatus of claim 11, wherein the emotion detection module is specifically configured to determine a mean and a variance for all emotion detection results based on emotion detection results for each sample image frame in the sample video; determining the mean value and the variance of all the label values according to the label values of each sample image frame in the sample video; determining a first loss according to the mean and variance of the detection results for all emotions and the mean and variance of the detection results for all tag values; determining a second loss according to differences between emotion detection results of each sample image frame in the sample video and the corresponding label values; determining a comprehensive loss according to the first loss and the second loss; and training the emotion detection model by taking the minimization of the comprehensive loss as an optimization target.

13. The apparatus of claim 9, wherein the first determining module is specifically configured to group the face images according to a number of face images included in the combination and a timing sequence of an image frame in which the face images are located, so as to obtain each combination; and determining a comprehensive emotion detection result corresponding to each combination according to the emotion detection result of the face image contained in the combination aiming at each combination.

14. The apparatus of claim 9, wherein the first determining module is specifically configured to perform mean filtering on emotion detection results of the face images to obtain filtered emotion detection results of the face images; and determining a comprehensive emotion detection result corresponding to each combination in each combination according to each combination obtained by grouping the face images in advance and the emotion detection result after filtering of the face images.

15. A computer readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-8.

16. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of the preceding claims 1-8 when executing the program.