CN113920568A

CN113920568A - Face and human body posture emotion recognition method based on video image

Info

Publication number: CN113920568A
Application number: CN202111285381.9A
Authority: CN
Inventors: 秦瑾; 席明; 焦勇; 秦煜婷; 毛智勇
Original assignee: China Telecom Wanwei Information Technology Co Ltd
Current assignee: China Telecom Wanwei Information Technology Co Ltd
Priority date: 2021-11-02
Filing date: 2021-11-02
Publication date: 2022-01-11

Abstract

The invention belongs to the technical field of computer vision recognition, and particularly relates to a human face and human body posture emotion recognition method based on video images. The method comprises the following steps: video image acquisition, video frame sequence, human face detection, human body posture detection, image preprocessing, human face emotion feature extraction, human body posture emotion feature extraction, human face emotion recognition, human body posture emotion recognition, softmax classification layer, weighted average, emotion classification and the like; experiments prove that the method has obvious superiority in multi-modal emotion recognition compared with single-modal emotion recognition. According to the invention, multiple rounds of experiments prove that the accuracy of multi-modal emotion recognition is higher than that of single-modal emotion recognition; the human face features are endowed with relatively larger weight values, the human posture features are endowed with relatively smaller weight values, and relatively higher recognition accuracy can be obtained, namely the human face information plays a leading role in emotion recognition and the human posture plays an auxiliary role in emotion recognition.

Description

Face and human body posture emotion recognition method based on video image

Technical Field

The invention belongs to the technical field of computer vision recognition, and particularly relates to a human face and human body posture emotion recognition method based on video images.

Background

The emotion is the intuitive reflection and high generalization of subjective feeling, internal heart activity and external behavior, and has important significance in daily interaction of people. The method has wide application prospects in the fields of medicine, education and safe driving, and is one of hot spots of research in the field of computer vision for emotion recognition.

The face image contains rich physiological characteristic information, such as gender, age, emotion and the like, and is one of biological characteristic identification research directions. The human body posture also contains rich physiological characteristic information, for example, people in different emotional stages have obviously different posture characteristics, which provides possibility for recognizing emotion through the research of the physiological characteristic information. At present, the face emotion recognition based on a single mode has low recognition accuracy, emotion recognition based on physiological characteristic signals has objectivity and relatively high recognition results, but data acquisition of the physiological characteristic signals needs specific equipment, and is difficult to acquire and poor in user perception. Meanwhile, the physiological characteristic signal acquisition is purposefully maintained with a certain emotion, so that facial expressions and body muscle stiffness are easily caused. Taking a photo as an example, the face is often required to have a certain smile in order to ensure the quality of the photo, but the face and body states cannot be effectively controlled when the smile needs to be expressed due to various reasons such as emotional disorder, facial tension and emotional runaway, and in order to overcome the problem, a photographer needs to call out an eggplant to simulate a smiling state before pressing a shutter. However, the face emotion is various, and a simulation state cannot be set for all expressions. In actual recognition, the emotion often expressed by recording a video for a certain time or snapping the video is the most appropriate. However, the existing identification technology is mostly carried out on static images, and the identification of the dynamic field still has a large blank.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a face and human body posture emotion recognition method based on a video image, which can effectively overcome the defects and obviously improve the recognition accuracy rate compared with a single-mode emotion recognition method.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a face and human body posture emotion recognition method based on video images is characterized by comprising the following steps:

the method comprises the following steps: video image acquisition, video frame sequence, human face detection, human body posture detection, image preprocessing, human face emotion feature extraction, human body posture emotion feature extraction, human face emotion recognition, human body posture emotion recognition, softmax classification layer, weighted average, emotion classification and the like;

the video image acquisition adopts equipment such as a camera or a mobile phone and the like to acquire pedestrian videos, converts continuous pedestrian videos into frame sequence images, and removes non-pedestrian videos and incomplete pedestrian images in the images; the face detection carries out face detection on pedestrian images in the video frame sequence according to the geometric outline shape of the face; detecting a human body posture in a pedestrian image in human body posture detection, wherein the human body posture comprises a head inclination angle, arm swinging amplitude and speed, and leg swinging amplitude and speed; the image preprocessing is used for enhancing the image through light compensation, gray level transformation is used for reducing the calculated amount and the storage space, geometric correction is used for respectively cutting and correcting the face image and the posture image, and filtering and sharpening operation is used for highlighting the edge detail characteristics of the image and removing the influence of noise; extracting human face emotional characteristics by using a ConvLSTM neural network model based on human face key points and human face key parts, wherein the key points comprise eyebrows, eyes, a nose, a mouth, key parts of a forehead, a cheek, an eye pouch, upper and lower jaws and lip parts; extracting human body posture features by using a ConvLSTM neural network model and taking key points as a basis to extract posture emotion features, wherein the human body posture comprises a head inclination angle, arm swinging amplitude and speed, leg swinging amplitude and speed and the like; the softmax classification layer classifies the recognition results to obtain probability recognition results; and carrying out weighted averaging on the multi-modal recognition result to obtain a final output result.

The method comprises the steps of acquiring a pedestrian dynamic video through video image acquisition, converting the dynamic video into an image frame sequence, eliminating images with unclear human faces and incomplete human postures in the image frame sequence, acquiring the pedestrian video through a monitoring camera or a mobile phone, converting the pedestrian video into the image frame sequence, and filtering the images without pedestrian information and incomplete pedestrians in the image frame sequence.

Respectively carrying out face detection and human body posture detection in the image frame sequence after the elimination and screening; preprocessing the detected face image and human body posture image, including light compensation, gray level transformation, geometric correction and filtering sharpening, and removing the influence of noise on emotion feature extraction; the method comprises the steps of carrying out face detection and human posture detection on screened pedestrian images according to face geometric outline shapes and human posture key point sequences, classifying and preprocessing the detected face images and human posture images, wherein the preprocessing comprises light compensation to enhance the images, gray level transformation is used for reducing calculated amount and storage space, geometric correction is used for cutting and correcting the face images and the posture images respectively, filtering and sharpening operation is used for highlighting image edge detail characteristics and removing noise influence, and preparation is made for next-stage feature extraction.

And extracting the emotion characteristics of the human face of the preprocessed human face image sequence through a ConvLSTM neural network model. Extracting face emotion characteristic information through face key points and face key parts, and obtaining a face emotion recognition classification result through a softmax classification layer.

And extracting the emotion characteristics of the human body posture of the preprocessed human body posture image through a ConvLSTM neural network model, and obtaining a human body posture emotion recognition classification result at a softmax classification layer.

And carrying out weighted average on the emotion recognition result obtained based on the face image and the emotion recognition result obtained based on the human body posture to obtain an overall emotion recognition result, and outputting an emotion recognition category.

The sum of the weighted value of the face emotion recognition result and the weighted value of the human posture emotion recognition result is 100%, the weighted value of the face emotion recognition result is 20-80%, and the rest is the weighted value of the global age.

The weighted value of the face emotion recognition result is 40-60%, and the rest is the weighted value of the global age.

The invention firstly collects the dynamic pedestrian video and converts the dynamic pedestrian video into a frame sequence. And then the face image and the human body posture image are respectively processed through different channels. Extracting face emotion characteristics from the preprocessed face image in a ConvLSTM neural network model through key points and key parts, recognizing face emotion in a pre-trained emotion recognition model, and obtaining a face image emotion recognition result through a softmax classification layer. Extracting emotion characteristic information such as head inclination position, arm swinging amplitude and speed, leg swinging amplitude and speed and the like from the preprocessed human body posture sequence image in a ConvLSTM neural network model through human body posture sequence key points, calling a pre-trained human body posture emotion recognition model to recognize human body posture emotion, and obtaining a human body posture emotion recognition result in a softmax classification layer. And respectively carrying out weighted averaging on the two results in the weighted averaging to obtain a final identification result.

In order to improve the emotion recognition accuracy, the invention adopts a multi-modal emotion recognition method combining the face image and the human body posture, so that the problem of low single-modal emotion recognition rate can be effectively solved; in order to reduce the problem of complex network model parameter setting, the ConvLSTM neural network model is adopted in different channels for feature extraction and emotion recognition, and deep level detail feature information of the image can be extracted more effectively. In order to reduce the problem of local information loss caused by feature fusion, the emotion recognition result is obtained through the face image and the human body posture respectively, the recognition result is weighted and averaged in the weighted average to obtain the final recognition result, the local information loss caused by the feature fusion and the complexity caused by designing a feature fusion network can be effectively reduced, and the emotion recognition accuracy is effectively improved. The invention extracts the face emotion characteristics through face key points and face key parts in face image characteristic extraction, and mainly comprises texture information and facial expression information which can reflect face emotion. The key points mainly comprise eyebrows, eyes, a nose, a mouth and the like, and the key parts mainly comprise a forehead part, a cheek part, an eye bag part, an upper jaw part, a lower jaw part, a mouth part and the like. In the human posture feature extraction, emotion feature information such as head inclination angle, arm swing amplitude and speed, leg swing amplitude and speed and the like is mainly extracted through a human posture key point sequence.

The method adopts the ConvLSTM neural network model to carry out feature extraction and emotion recognition, can effectively reduce the problem of inconsistent network design complexity and parameter setting, carries out weighted averaging on the emotion recognition result based on the face image and the human posture emotion recognition result to obtain the final emotion recognition result, and can effectively reduce the problem of local information loss caused by feature fusion so as to improve the emotion recognition accuracy.

The emotion of the invention is four classification results of happy, angry, sad and neutral. Experiments prove that the method has obvious superiority in multi-modal emotion recognition compared with single-modal emotion recognition. According to the invention, multiple rounds of experiments prove that the accuracy of multi-modal emotion recognition is higher than that of single-modal emotion recognition; the human face features are endowed with relatively larger weight values, the human posture features are endowed with relatively smaller weight values, and relatively higher recognition accuracy can be obtained, namely the human face information plays a leading role in emotion recognition and the human posture plays an auxiliary role in emotion recognition.

Drawings

FIG. 1 is a general flow diagram of the present invention.

Detailed Description

In actual use, the preprocessed human face image sequence is subjected to human face emotion feature extraction through a ConvLSTM neural network model. The face emotion characteristic information is extracted through face key points and face key parts, the key points mainly comprise eyebrows, eyes, a nose, a mouth and the like, and the key parts mainly comprise a forehead part, a cheek part, an eye pouch part, an upper jaw part, a lower jaw part, a lip part and the like. Carrying out emotion recognition on the extracted face emotion characteristic information by calling a pre-trained face emotion recognition model, and obtaining a face emotion recognition classification result through a softmax classification layer; extracting the emotion characteristics of the face of the preprocessed face image through a ConvLSTM network model, predicting the emotion category in a pre-trained emotion estimation model, outputting a probability prediction type through a softmax classification layer, taking a maximum probability prediction value as a current prediction type, and if the face image with happy emotion is input, outputting the face image with happy emotion: 70%, neutral: 20%, gas generation: 8%, sadness: 2%, the emotion is integrated and recognized as happy.

And extracting emotional characteristics of the human body posture of the preprocessed human body posture image through a ConvLSTM neural network model. And extracting emotional characteristic information such as head inclination angle, arm swing amplitude and speed, leg swing amplitude and speed and the like through the human posture key point sequence. And meanwhile, emotion recognition is carried out on the extracted human body posture emotion characteristics by calling a pre-trained human body posture emotion recognition model, and a human body posture emotion recognition classification result is obtained at a softmax classification layer. Extracting human posture emotion characteristics of the preprocessed human face image through a ConvLSTM network model, predicting emotion types in a pre-trained human posture emotion estimation model, outputting probability prediction types through softmax classification layers, taking a maximum probability prediction value as a current prediction type, and outputting a human posture image with high swing speed through the steps, wherein if an input head inclination angle is about 45 degrees, an arm swing amplitude is about 30cm, the swing speed is high, a leg swing amplitude is about 50cm, and the swing speed is high: 35% and neutral: 33%, happy: 28%, sadness: 4%, and integrating the emotion to identify the emotion as anger.

And carrying out weighted average on the emotion recognition result obtained based on the face image and the emotion recognition result obtained based on the human body posture to obtain an overall emotion recognition result, and outputting an emotion recognition category. The recognition results are independent recognition results obtained through a face image and a human body posture respectively, in order to verify the effectiveness of weighted average, the two recognition methods are subjected to 10 times of experiments to obtain an average value, and verification is performed through the following two methods, namely: and fusing the human face features and the human body posture features and then performing emotion recognition to obtain a final result with accuracy of 86.60%, wherein the method II comprises the following steps: the weight proportion is verified through 10 times of experiments to obtain an average result, the face features are endowed with 50% weight, the human posture features are endowed with 50% weight, the accuracy rate of the final output result is 83.20%, the face features are endowed with 60% weight, the human posture features are endowed with 40% weight, the accuracy rate of the final output result is 89.20%, the face features are endowed with 40% weight, the human posture features are endowed with 60% weight, and the accuracy rate of the final output result is 78.60%.

The method comprises the steps of video acquisition, video frame sequence, face detection, human posture detection, image preprocessing, face emotion feature extraction, human posture emotion feature extraction, face emotion recognition, human posture emotion recognition, softmax classification, weighted average and emotion classification.

The pedestrian video image is collected through the video collecting device, the video image is converted into the frame sequence image, in order to effectively process the face image and the human body posture image, the non-face image and the non-human body posture image in the frame sequence image need to be discarded, and meanwhile, the unclear image is filtered.

And processing the screened video frame sequence images at different channels simultaneously. One of the channels is used for face detection, and the other channel is used for human posture detection. Preprocessing the detected face image and human body posture image, wherein light compensation is used for enhancing the image quality; the gray scale transformation is used for reducing the image storage space on the premise of not influencing the image quality; the geometric correction carries out position correction and alignment operation on the human face and the human body posture in the image; filtering and sharpening are used for more accurate target positioning and highlighting image detail features.

And carrying out face emotion feature extraction and human posture emotion feature extraction on the preprocessed face image and the preprocessed human posture image in a ConvLSTM neural network model through different channels. The emotion feature extraction of the face image is mainly characterized by carrying out feature mapping through labeled face key points and face key parts, wherein the face key points mainly comprise eyebrows, eyes, a nose, a mouth and the like, and the key parts mainly comprise a forehead part, a cheek part, an eye bag part, an upper jaw part, a lower jaw part, a lip part and the like. The specific detection extraction of the human body posture emotion mainly extracts emotional characteristics through the head inclination angle, the arm swing amplitude and speed and the leg swing amplitude and speed in the human body key point sequence.

And carrying out emotion recognition on the extracted human face emotion characteristics and human body posture emotion characteristics by calling a pre-trained human face emotion recognition model and a pre-trained human body posture emotion recognition model, and obtaining a recognition result through a softmax classification layer.

And carrying out weighted averaging on the face emotion recognition result and the human posture emotion recognition result in weighted averaging to obtain a final recognition result, and outputting the emotion type.

Claims

1. A face and human body posture emotion recognition method based on video images is characterized by comprising the following steps:

2. The method for recognizing the emotion of human face and human body posture based on video image as claimed in claim 1, wherein: the method comprises the steps of acquiring a pedestrian dynamic video through video image acquisition, converting the dynamic video into an image frame sequence, eliminating images with unclear human faces and incomplete human postures in the image frame sequence, acquiring the pedestrian video through a monitoring camera or a mobile phone, converting the pedestrian video into the image frame sequence, and filtering the images without pedestrian information and incomplete pedestrians in the image frame sequence.

3. The method for recognizing the emotion of human face and human body posture based on video image as claimed in claim 1, wherein: respectively carrying out face detection and human body posture detection in the image frame sequence after the elimination and screening; preprocessing the detected face image and human body posture image, including light compensation, gray level transformation, geometric correction and filtering sharpening, and removing the influence of noise on emotion feature extraction; the method comprises the steps of carrying out face detection and human posture detection on screened pedestrian images according to face geometric outline shapes and human posture key point sequences, classifying and preprocessing the detected face images and human posture images, wherein the preprocessing comprises light compensation to enhance the images, gray level transformation is used for reducing calculated amount and storage space, geometric correction is used for cutting and correcting the face images and the posture images respectively, filtering and sharpening operation is used for highlighting image edge detail characteristics and removing noise influence, and preparation is made for next-stage feature extraction.

4. The method for recognizing the emotion of human face and human body posture based on video image as claimed in claim 1, wherein: extracting the emotion characteristics of the human face of the preprocessed human face image sequence through a ConvLSTM neural network model; extracting face emotion characteristic information through face key points and face key parts, and obtaining a face emotion recognition classification result through a softmax classification layer.

5. The method for recognizing the emotion of human face and human body posture based on video image as claimed in claim 1, wherein: and extracting the emotion characteristics of the human body posture of the preprocessed human body posture image through a ConvLSTM neural network model, and obtaining a human body posture emotion recognition classification result at a softmax classification layer.

6. The method for recognizing the emotion of human face and human body posture based on video image as claimed in claim 1, wherein: and carrying out weighted average on the emotion recognition result obtained by the face image and the emotion recognition result obtained by the human body posture to obtain an overall emotion recognition result, and outputting an emotion recognition category.

7. The method for recognizing the emotion of human face and human body posture based on video image as claimed in claim 6, wherein: the sum of the weighted value of the face emotion recognition result and the weighted value of the human posture emotion recognition result is 100%, the weighted value of the face emotion recognition result is 20-80%, and the rest is the weighted value of the global age.

8. The method for recognizing the emotion of human face and human body posture based on video image as claimed in claim 7, wherein: the weighted value of the face emotion recognition result is 40-60%, and the rest is the weighted value of the global age.