CN112232276A

CN112232276A - Emotion detection method and device based on voice recognition and image recognition

Info

Publication number: CN112232276A
Application number: CN202011213188.XA
Authority: CN
Inventors: 赵珍; 李小强
Original assignee: Individual
Current assignee: Shanghai Enterprise Information Technology Co ltd
Priority date: 2020-11-04
Filing date: 2020-11-04
Publication date: 2021-01-15
Anticipated expiration: 2040-11-04
Also published as: CN112232276B

Abstract

The invention relates to an emotion detection method and device based on voice recognition and image recognition, which are used for acquiring a self-portrait video of a section of user to be detected and an actual scene corresponding to the self-portrait video, processing the self-portrait video to obtain an image signal and a voice signal, processing the image signal to acquire an expression change trend, processing a voice signal to acquire a preliminary emotion result of the voice signal in the actual scene, and finally fusing the expression change trend and the preliminary emotion result to acquire a final emotion result of the user. The emotion detection method based on voice recognition and image recognition is an automatic detection method, and compared with a manual detection mode, the emotion detection method is not influenced by subjective factors, so that the detection accuracy is improved; detection personnel do not need to be specially arranged, so that the labor cost is reduced; the processing efficiency is very fast, moreover, setting up processing equipment, can handle a plurality of auto heterodyne videos simultaneously, efficiency is higher.

Description

Emotion detection method and device based on voice recognition and image recognition

Technical Field

The invention relates to a method and a device for emotion detection based on voice recognition and image recognition.

Background

In the past where information processing technology was not well developed, when the emotion of a speaker was judged from a piece of video or a piece of speech, the emotion of the speaker was judged by a special inspector from the expression of the speaker, the breath of the speaker, and related keywords appearing in the piece of speech. The artificial judgment method has the following defects: (1) the detection personnel are easily affected by subjective factors, so that detection errors are caused; (2) related personnel are required to be specially arranged, so that the labor cost is increased; (3) need the measurement personnel to see one section video or listen to one section pronunciation and just can detect the judgement, moreover, measurement personnel can only judge one section video or one section pronunciation simultaneously, and efficiency is very low.

Disclosure of Invention

In order to solve the technical problem, the invention provides a method and a device for emotion detection based on voice recognition and image recognition.

The invention adopts the following technical scheme:

a emotion detection method based on voice recognition and image recognition comprises the following steps:

acquiring a self-shooting video of a section of user to be detected and an actual scene corresponding to the self-shooting video;

processing the self-timer video to obtain an image signal and a voice signal;

performing screenshot processing on the image signal according to a preset period to obtain at least two images;

performing expression recognition on the at least two images to obtain the character expressions in the images;

acquiring expression change trends according to the character expressions in the images and the sequence time sequence of the images;

carrying out voice recognition on the voice signals to obtain corresponding character signals;

inputting the character signal and the actual scene into a preset detection model, and acquiring a preliminary emotion result of the voice signal in the actual scene;

and fusing the expression change trend and the preliminary emotion result to obtain a final emotion result of the user.

Preferably, the fusing the expression change trend and the preliminary emotion result to obtain a final emotion result, includes:

if the expression change trend is towards positive expression development, and the initial emotion result is positive emotion, the final emotion result is positive emotion;

and if the expression change trend is towards negative expressions and the preliminary emotion result is negative emotion, the final emotion result is negative emotion.

Preferably, the process of acquiring the preset detection model includes:

acquiring at least two correction texts in each scene from at least two scenes;

acquiring actual emotion results of all correction texts in all scenes;

inputting each correction text in each scene into an existing detection model to obtain a detection emotion result of each correction text in each scene;

acquiring correction texts under scenes in which the actual emotion result and the detected emotion result are positive emotions, acquiring first correction texts under a first scene, acquiring correction texts under scenes in which the actual emotion result and the detected emotion result are negative emotions, and acquiring second correction texts under a second scene;

and adjusting the existing detection model according to each first correction text in the first scene and each second correction text in the second scene to obtain the preset detection model.

Preferably, the performing expression recognition on the at least two images to obtain the human expression in each image includes:

carrying out user face recognition on the at least two images to obtain a user face image of the user;

and performing expression recognition on the face image of the user in each image to obtain the expression of the character in each image.

Preferably, the performing expression recognition on the facial image of the user in each image to obtain the expression of the person in each image includes:

acquiring a first sample set and a second sample set, wherein the first sample set comprises at least one positive expression sample image, and the second sample set comprises at least one negative expression sample image;

labeling each positive expression sample image in the first sample set to obtain a first expression category, wherein the first expression category is positive expression, labeling each negative expression sample image in the second sample set to obtain a second expression category, wherein the second expression category is negative expression, and the first expression category and the second expression category form labeling data;

inputting the first sample set and the second sample set into an expression recognition encoder for feature extraction, inputting a feature vector output by the expression recognition encoder into a Flatten layer, processing the feature vector by the Flatten layer to obtain a one-dimensional feature vector, using the one-dimensional feature vector as the input of a full connection layer, mapping the one-dimensional feature vector to a feature mark space by the full connection layer, then outputting the feature mark space to a softmax function, outputting the probabilities of two expression categories through the softmax function, and determining the corresponding initial expression category according to the output probabilities of the two expression categories;

calculating the initial expression category and the labeling data through a cross entropy loss function, and optimizing parameters in an expression recognition network;

and inputting the user face images in the images into the expression recognition network to obtain the character expressions of the user face images in the images.

An emotion detection apparatus based on speech recognition and image recognition, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the emotion detection method based on speech recognition and image recognition as follows when executing the computer program:

processing the self-timer video to obtain an image signal and a voice signal;

Preferably, the process of acquiring the preset detection model includes:

acquiring at least two correction texts in each scene from at least two scenes;

acquiring actual emotion results of all correction texts in all scenes;

The invention has the beneficial effects that: respectively carrying out image processing and voice processing on the self-timer video of the user, wherein the image processing is used for acquiring a plurality of character expressions and obtaining expression change trends according to the character expressions and the sequence of time, because even the same character signal can have different emotions under different scenes, and the character signal is a carrier which can reflect the emotion of a person, therefore, performing voice recognition on a voice signal of a self-timer video, acquiring a character signal, inputting the character signal and an actual scene into a preset detection model, acquiring a preliminary emotion result of the voice signal in the actual scene, the method comprises the steps of acquiring a preliminary emotion result corresponding to a voice signal in the actual scene, applying the actual scene to emotion detection, improving detection accuracy, and finally fusing an expression change trend and the preliminary emotion result to acquire a final emotion result of a user. Therefore, the emotion detection method based on voice recognition and image recognition is an automatic detection method, the video is processed in two aspects, the image is subjected to character expression recognition, the voice is subjected to character recognition, an emotion result is obtained according to a character signal and an actual scene, and a final emotion result is obtained by fusing two data information; detection personnel do not need to be specially arranged, so that the labor cost is reduced; the processing efficiency is fast, and moreover, after setting up processing equipment, can handle a plurality of auto heterodyne videos simultaneously, efficiency is higher.

Drawings

Fig. 1 is a flow chart of a method for emotion detection based on speech recognition and image recognition.

Detailed Description

The embodiment provides a method for emotion detection based on voice recognition and image recognition, a hardware execution main body of the method for emotion detection can be computer equipment, server equipment, an intelligent mobile terminal and the like, and the embodiment does not specifically limit the hardware execution main body.

As shown in fig. 1, the emotion detection method based on speech recognition and image recognition includes:

step S1: acquiring a self-shooting video of a section of user to be detected and an actual scene corresponding to the self-shooting video:

the user transmits the self-shooting video to the hardware execution main body, and the time length of the self-shooting video is set according to actual requirements, for example, the self-shooting video can be a short video within 30s, and can also be a longer video of 2-3 minutes. The user also transmits the actual scene of the self-shooting video to the hardware execution main body, and the actual scene may refer to the environment or the occasion where the self-shooting video is located, such as: at home, or at work, or in other public places, such as: KTV, supermarket, restaurant, etc. The actual scene is obtained and applied to emotion detection because the emotion may differ from scene to scene in the case where the same data is included in the video.

Step S2: processing the self-timer video to obtain an image signal and a voice signal:

analyzing the self-timer video to obtain an image signal and a voice signal, wherein the image signal is video data without sound and only with an image, and it should be understood that the image signal contains a face image of a user due to the self-timer video of the user; the voice signal is a sound signal in the self-timer video, and specifically, the voice signal is what the user said in the self-timer video.

Since the processing procedure of decomposing the video file into the image signal and the sound signal belongs to the conventional technology, the description is omitted.

Step S3: performing screenshot processing on the image signal according to a preset period to obtain at least two images:

and performing screenshot processing on the image signal according to a preset period to obtain at least two images. The preset period is set according to actual needs, and the longer the preset period is, the fewer the acquired images are. It should be understood that, since the video is a self-portrait video, each of the obtained images includes a face image of the user.

Step S4: performing expression recognition on the at least two images to obtain the character expressions in each image:

firstly, carrying out user face recognition on each image in at least two images to obtain a user face image of a user in each image.

Then, performing expression recognition on the face image of the user in each image to obtain the expression of the person in each image, and as a specific implementation manner, an expression recognition process is given as follows:

and inputting the facial images of the user in the images into an expression recognition network to obtain the expressions of the user in the images. The expression recognition network can be obtained by adopting the following training process:

a first sample set and a second sample set are obtained, the first sample set comprising at least one positive expression sample image, and the second sample set comprising at least one negative expression sample image. The positive expression sample image refers to a sample image of which the expression of a person is positive, and the positive expression is happy, happy and the like; the negative expression sample image refers to a sample image of a human expression, wherein the human expression is a negative expression, and the negative expression is particularly heart hurt, crying, hard passing and the like.

Labeling each positive expression sample image in the first sample set to obtain a first expression type, wherein the first expression type is positive expression, labeling each negative expression sample image in the second sample set to obtain a second expression type, and the second expression type is negative expression. That is to say, the expression categories of the labels are divided into two categories, different indexes can be used to represent different expression categories, where the index 0 corresponds to a positive expression and the index 1 corresponds to a negative expression, and the labels can be further encoded by one-hot. The first expression category and the second expression category constitute annotation data.

The expression recognition network comprises an expression recognition encoder, a Flatten layer, a full connection layer and a softmax function.

Inputting the first sample set and the second sample set into an expression recognition encoder for feature extraction, outputting a feature vector (such as mouth angle opening degree) by the expression recognition encoder, inputting the feature vector into a Flatten layer, processing the feature vector by the Flatten layer to obtain a one-dimensional feature vector, taking the one-dimensional feature vector as the input of a full connection layer, mapping the one-dimensional feature vector to a feature mark space by the full connection layer, outputting the feature vector to a softmax function, outputting the probabilities of two expression categories through the softmax function, wherein the probabilities of the two expression categories are added to be 1, and determining the corresponding initial expression category according to the output probabilities of the two expression categories.

And calculating the obtained initial expression categories and the marking data through a cross entropy loss function, and optimizing parameters in the expression recognition network so that the output expression categories are gradually close to the real values.

Inputting user face images in each image into an expression recognition network, and performing expression recognition through the expression recognition network, specifically, inputting the user face images in each image into an expression recognition encoder for feature extraction, outputting a feature vector by an image classification encoder, inputting the feature vector to a Flatten layer, processing the feature vector by the Flatten layer to obtain a one-dimensional feature vector, taking the one-dimensional feature vector as the input of a full connection layer, mapping the one-dimensional feature vector to a feature mark space by the full connection layer, then outputting the feature vector to a softmax function, and outputting a corresponding expression category through the softmax function, wherein the output expression category is a positive expression or a negative expression.

Step S5: acquiring expression change trends according to the character expressions in the images and the sequence time sequence of the images:

because each image is obtained according to the preset period, each image has a time sequence, and the time sequence is the time sequence of the self-timer video in the playing process. Then, after the expression of the character in each image is obtained, the expression change trend is obtained according to the sequence of time of each image. The expression change trend is the direction towards which the expression is developed, namely towards positive expression or towards negative expression.

Wherein, the development towards positive expression includes two kinds of circumstances, is respectively: the expression changes to a positive expression (e.g., the expression changes from a negative expression to a positive expression) or is always a positive expression. Similarly, the development towards negative expressions also includes two cases, respectively: the expression changes to a negative expression (e.g., the expression changes from a positive expression to a negative expression) or is always a negative expression.

Step S6: carrying out voice recognition on the voice signal to obtain a corresponding character signal:

and carrying out voice recognition on the voice signals to obtain character signals corresponding to the voice signals, namely converting the voice signals into the character signals. Since the speech recognition algorithm belongs to the conventional algorithm, the description is omitted.

Step S7: inputting the character signal and the actual scene into a preset detection model, and acquiring a preliminary emotion result of the voice signal in the actual scene:

and after the character signal is obtained, inputting the character signal and an actual scene corresponding to the self-timer video into a preset detection model, and acquiring a preliminary emotion result of the voice signal in the actual scene.

It should be understood that the predetermined detection model may be a previously constructed detection model, including: at least two scenes are arranged in each scene, and at least two texts and emotion results corresponding to the texts in each scene are arranged in each scene. Since the scene and the text are independent, the preset detection model can also be said to include: at least two texts, and an emotional result corresponding to each text in each of the at least two scenes. Moreover, in order to improve the detection accuracy, each text in the preset detection model may be a keyword, and is not necessarily a complete sentence, for example: the complete sentence is: "I do not want to do, the keywords may be: "do not want to dry".

The detection model referred to in the above paragraph may be an existing detection model, and as a specific implementation, the preset detection model is a detection model after the existing detection model is corrected, so that an acquisition process is given as follows:

(1) at least two corrected texts in each of at least two scenes are obtained. It should be understood that, in order to improve the reliability of the correction and improve the accuracy of the prediction detection model, in this step, the acquired scenes can be set to be sufficiently wide, and the corrected text in each scene is also set to be sufficiently wide.

(2) Since the corrected text is a text for correcting the existing detection model and is known, the actual emotion result of each corrected text in each scene is also known, and the actual emotion result of each corrected text in each scene is obtained.

(3) And inputting each corrected text in each scene into the existing detection model to obtain the detection emotion result of each corrected text in each scene.

(4) Obtaining an actual emotion result of each corrected text in each scene, and after detecting the emotion result, checking the two emotion results of each corrected text in each scene, specifically: and acquiring each corrected text in each scene with the actual emotion result and the detected emotion result both being positive emotions, acquiring each first corrected text in the first scene, and each corrected text in each scene with the actual emotion result and the detected emotion result both being negative emotions, and acquiring each second corrected text in the second scene.

(5) And adjusting the existing detection model according to each first correction text in the first scene and each second correction text in the second scene to obtain a preset detection model. Two adjustment methods are given below, the first: the method comprises the steps that an existing detection model is not considered, and a preset detection model is directly constructed according to first correction texts in a first scene and second correction texts in a second scene; and the second method comprises the following steps: deleting each text in each scene with positive emotion as the emotion result in the existing detection model, wherein each text in each scene does not accord with the condition that the actual emotion result and the detection emotion result are positive emotion; and deleting each text in each scene with negative emotion as the emotion result in the existing detection model, wherein each text in each scene does not meet the condition that the actual emotion result and the detection emotion result are negative emotions.

Therefore, the obtained text signal and the actual scene are input into a preset detection model, and a preliminary emotion result of the text signal in the actual scene, namely a preliminary emotion result of the corresponding voice signal in the actual scene, is obtained.

Through the correction, the detection precision of the preset detection model can be improved.

Step S8: and fusing the expression change trend and the preliminary emotion result to obtain a final emotion result of the user:

fusing the obtained expression change trend and the initial emotion result, specifically: if the expression change trend is towards the positive expression development, and the initial emotion result is positive emotion, the final emotion result of the user is positive emotion; if the expression change trend is towards negative expressions and the preliminary emotion result is negative emotion, the final emotion result of the user is negative emotion.

As another embodiment, two weights may be set, and the final emotion result of the user is obtained by combining the expression change trend, the preliminary emotion result, and the corresponding weights.

The present embodiment further provides an emotion detection apparatus based on speech recognition and image recognition, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the emotion detection method based on speech recognition and image recognition provided in the present embodiment. Therefore, the emotion detection device based on voice recognition and image recognition is a software device, and the essence of the emotion detection device is still an emotion detection method based on voice recognition and image recognition.

The above-mentioned embodiments are merely illustrative of the technical solutions of the present invention in a specific embodiment, and any equivalent substitutions and modifications or partial substitutions of the present invention without departing from the spirit and scope of the present invention should be covered by the claims of the present invention.

Claims

1. A emotion detection method based on voice recognition and image recognition is characterized by comprising the following steps:

processing the self-timer video to obtain an image signal and a voice signal;

2. The emotion detection method based on speech recognition and image recognition, as claimed in claim 1, wherein the fusing the expression change trend and the preliminary emotion result to obtain a final emotion result comprises:

3. The emotion detection method based on speech recognition and image recognition, as recited in claim 1, wherein the obtaining process of the preset detection model comprises:

acquiring at least two correction texts in each scene from at least two scenes;

acquiring actual emotion results of all correction texts in all scenes;

4. The emotion detection method based on speech recognition and image recognition, as claimed in claim 1, wherein the performing expression recognition on the at least two images to obtain the expression of the person in each image comprises:

5. The emotion detection method based on speech recognition and image recognition, as claimed in claim 4, wherein said performing expression recognition on the face image of the user in each image to obtain the expression of the person in each image comprises:

6. An emotion detection apparatus based on speech recognition and image recognition, comprising a memory, a processor and a computer program stored in the memory and operable on the processor, wherein the processor, when executing the computer program, implements the steps of the emotion detection method based on speech recognition and image recognition as follows:

processing the self-timer video to obtain an image signal and a voice signal;

7. The emotion detection device based on speech recognition and image recognition, as claimed in claim 6, wherein the fusing the expression change trend and the preliminary emotion result to obtain a final emotion result comprises:

8. The emotion detection apparatus based on speech recognition and image recognition, as recited in claim 6, wherein the preset detection model is obtained by:

acquiring at least two correction texts in each scene from at least two scenes;

acquiring actual emotion results of all correction texts in all scenes;

9. The emotion detection device based on speech recognition and image recognition, as claimed in claim 6, wherein the performing expression recognition on the at least two images to obtain the expression of the person in each image comprises:

10. The emotion detection device based on speech recognition and image recognition, as claimed in claim 9, wherein the performing expression recognition on the facial image of the user in each image to obtain the expression of the person in each image comprises: