Disclosure of Invention
In order to solve the technical problems, the invention provides a emotion detection method and device based on voice recognition and image recognition.
The invention adopts the following technical scheme:
an emotion detection method based on speech recognition and image recognition, comprising:
acquiring a section of self-timer video of a user to be detected and an actual scene corresponding to the self-timer video;
processing the self-timer video to obtain an image signal and a voice signal;
performing screenshot processing on the image signal according to a preset period to obtain at least two images;
carrying out expression recognition on the at least two images to obtain the character expression in each image;
acquiring expression change trends according to the character expressions in each image and the sequence of the time of each image;
performing voice recognition on the voice signal to obtain a corresponding text signal;
inputting the text signal and the actual scene into a preset detection model, and acquiring a preliminary emotion result of the voice signal in the actual scene;
and fusing the expression change trend and the preliminary emotion result to obtain a final emotion result of the user.
Preferably, the fusing the expression change trend and the preliminary emotion result to obtain a final emotion result includes:
if the expression change trend is towards the development of the frontal expression and the preliminary emotion result is the frontal emotion, the final emotion result is the frontal emotion;
and if the expression change trend is towards negative expression development and the preliminary emotion result is negative emotion, the final emotion result is negative emotion.
Preferably, the process of obtaining the preset detection model includes:
acquiring at least two correction texts in each of at least two scenes;
acquiring actual emotion results of each corrected text in each scene;
inputting each correction text in each scene into an existing detection model to obtain a detection emotion result of each correction text in each scene;
acquiring correction texts in each scene of which the actual emotion result and the detected emotion result are positive emotion to obtain first correction texts in a first scene, and acquiring correction texts in each scene of which the actual emotion result and the detected emotion result are negative emotion to obtain second correction texts in a second scene;
and adjusting the existing detection model according to each first correction text in the first scene and each second correction text in the second scene to obtain the preset detection model.
Preferably, the performing expression recognition on the at least two images to obtain the character expression in each image includes:
performing user face recognition on the at least two images to obtain a user face image of the user;
and carrying out expression recognition on the face images of the users in each image to obtain the character expression in each image.
Preferably, the performing expression recognition on the face image of the user in each image to obtain the character expression in each image includes:
obtaining a first sample set and a second sample set, the first sample set comprising at least one positive expression sample image and the second sample set comprising at least one negative expression sample image;
labeling each positive expression sample image in the first sample set to obtain a first expression category, wherein the first expression category is a positive expression, labeling each negative expression sample image in the second sample set to obtain a second expression category, wherein the second expression category is a negative expression, and the first expression category and the second expression category form labeling data;
inputting the first sample set and the second sample set into an expression recognition encoder for feature extraction, inputting a feature vector output by the expression recognition encoder into a flat layer, processing the feature vector by the flat layer to obtain a one-dimensional feature vector, taking the one-dimensional feature vector as input of a full connection layer, mapping the one-dimensional feature vector into a feature mark space by the full connection layer, outputting the feature vector to a softmax function, outputting probabilities of two expression categories through the softmax function, and determining corresponding initial expression categories according to the probabilities of the two output expression categories;
calculating the initial expression category and the annotation data through a cross entropy loss function, and optimizing parameters in an expression recognition network;
and inputting the face images of the users in the images into the expression recognition network to obtain the character expressions of the face images of the users in the images.
A speech recognition and image recognition based emotion detection device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of a speech recognition and image recognition based emotion detection method as follows when executing the computer program:
acquiring a section of self-timer video of a user to be detected and an actual scene corresponding to the self-timer video;
processing the self-timer video to obtain an image signal and a voice signal;
performing screenshot processing on the image signal according to a preset period to obtain at least two images;
carrying out expression recognition on the at least two images to obtain the character expression in each image;
acquiring expression change trends according to the character expressions in each image and the sequence of the time of each image;
performing voice recognition on the voice signal to obtain a corresponding text signal;
inputting the text signal and the actual scene into a preset detection model, and acquiring a preliminary emotion result of the voice signal in the actual scene;
and fusing the expression change trend and the preliminary emotion result to obtain a final emotion result of the user.
Preferably, the fusing the expression change trend and the preliminary emotion result to obtain a final emotion result includes:
if the expression change trend is towards the development of the frontal expression and the preliminary emotion result is the frontal emotion, the final emotion result is the frontal emotion;
and if the expression change trend is towards negative expression development and the preliminary emotion result is negative emotion, the final emotion result is negative emotion.
Preferably, the process of obtaining the preset detection model includes:
acquiring at least two correction texts in each of at least two scenes;
acquiring actual emotion results of each corrected text in each scene;
inputting each correction text in each scene into an existing detection model to obtain a detection emotion result of each correction text in each scene;
acquiring correction texts in each scene of which the actual emotion result and the detected emotion result are positive emotion to obtain first correction texts in a first scene, and acquiring correction texts in each scene of which the actual emotion result and the detected emotion result are negative emotion to obtain second correction texts in a second scene;
and adjusting the existing detection model according to each first correction text in the first scene and each second correction text in the second scene to obtain the preset detection model.
Preferably, the performing expression recognition on the at least two images to obtain the character expression in each image includes:
performing user face recognition on the at least two images to obtain a user face image of the user;
and carrying out expression recognition on the face images of the users in each image to obtain the character expression in each image.
Preferably, the performing expression recognition on the face image of the user in each image to obtain the character expression in each image includes:
obtaining a first sample set and a second sample set, the first sample set comprising at least one positive expression sample image and the second sample set comprising at least one negative expression sample image;
labeling each positive expression sample image in the first sample set to obtain a first expression category, wherein the first expression category is a positive expression, labeling each negative expression sample image in the second sample set to obtain a second expression category, wherein the second expression category is a negative expression, and the first expression category and the second expression category form labeling data;
inputting the first sample set and the second sample set into an expression recognition encoder for feature extraction, inputting a feature vector output by the expression recognition encoder into a flat layer, processing the feature vector by the flat layer to obtain a one-dimensional feature vector, taking the one-dimensional feature vector as input of a full connection layer, mapping the one-dimensional feature vector into a feature mark space by the full connection layer, outputting the feature vector to a softmax function, outputting probabilities of two expression categories through the softmax function, and determining corresponding initial expression categories according to the probabilities of the two output expression categories;
calculating the initial expression category and the annotation data through a cross entropy loss function, and optimizing parameters in an expression recognition network;
and inputting the face images of the users in the images into the expression recognition network to obtain the character expressions of the face images of the users in the images.
The beneficial effects of the invention are as follows: and respectively carrying out image processing and voice processing on the self-timer video of the user, wherein the image processing is used for acquiring a plurality of character expressions, obtaining expression change trends according to the character expressions and the sequence of time, and finally fusing the expression change trends and the preliminary emotion results to acquire the final emotion results of the user because the voice signals of the self-timer video are voice recognition, the character signals are acquired, the character signals and the actual scenes are input into a preset detection model, the preliminary emotion results of the voice signals in the actual scenes are acquired, namely, the preliminary emotion results corresponding to the voice signals in the actual scenes are acquired, the actual scenes are applied to emotion detection, and the detection accuracy can be improved. Therefore, the emotion detection method based on voice recognition and image recognition is an automatic detection method, processes two aspects of video, performs character expression recognition on the image, performs character recognition on the voice, obtains an emotion result according to a character signal and an actual scene, fuses two data information to obtain a final emotion result, and is not influenced by subjective factors compared with a manual detection mode, so that detection accuracy is improved; no special detection personnel are required, so that the labor cost is reduced; the processing efficiency is faster, and after setting up processing equipment, can handle a plurality of self-timer videos simultaneously, and efficiency is higher.
Detailed Description
The embodiment provides a emotion detection method based on voice recognition and image recognition, and a hardware execution subject of the emotion detection method may be a computer device, a server device, an intelligent mobile terminal, etc., and the embodiment does not specifically limit the hardware execution subject.
As shown in fig. 1, the emotion detection method based on voice recognition and image recognition includes:
step S1: acquiring a self-timer video of a section of user to be detected and an actual scene corresponding to the self-timer video:
the user transmits the self-timer video to the hardware execution main body, and the duration of the self-timer video is set according to the actual requirement, for example, the self-timer video can be a short video within 30s or a longer video of 2-3 minutes. The user also transmits the actual scene of the self-timer video to the hardware execution body, where the actual scene may refer to the environment or the place where the self-timer video is located, for example: at home, or in operation, or in other public places, such as: KTV, supermarket, restaurant, etc. The actual scene is acquired and applied to emotion detection because, in the case where the same data is contained in the video, the emotion in different scenes may be different.
Step S2: processing the self-timer video to obtain an image signal and a voice signal:
analyzing the self-timer video to obtain an image signal and a voice signal, wherein the image signal is video data of no sound and only images, and it is understood that the image signal contains face images of the user because the image signal is the self-timer video of the user; the speech signal is a sound signal in the self-timer video, specifically, the speech signal is what the user speaks in the self-timer video.
Since the process of decomposing the video file into the image signal and the sound signal belongs to the conventional technology, the description thereof will not be repeated.
Step S3: screenshot processing is carried out on the image signals according to a preset period, and at least two images are obtained:
and performing screenshot processing on the image signal according to a preset period to obtain at least two images. The preset period is set by actual requirements, and the longer the preset period is, the fewer the acquired images are. It should be appreciated that, due to the self-timer video, each image obtained includes a face image of the user.
Step S4: performing expression recognition on the at least two images to obtain the character expression in each image:
user face recognition is firstly carried out on each image in at least two images, and user face images of users in each image are obtained.
Then, the facial image of the user in each image is subjected to expression recognition to obtain the character expression in each image, and as a specific implementation manner, an expression recognition process is given as follows:
and inputting the face images of the users in the images into an expression recognition network to obtain the expressions of the users in the images. The expression recognition network can be trained by adopting the following training process:
a first sample set including at least one positive expression sample image and a second sample set including at least one negative expression sample image are acquired. The facial expression sample image refers to a sample image with a character expression as a facial expression, and the facial expression is particularly happy, happy and the like; the negative expression sample image refers to a sample image in which the character expression is a negative expression, and the negative expression is specifically a heart injury, crying, difficulty and the like.
Labeling each positive expression sample image in the first sample set to obtain a first expression category, wherein the first expression category is a positive expression, and labeling each negative expression sample image in the second sample set to obtain a second expression category, and the second expression category is a negative expression. That is, the expression categories of the annotation are divided into two types, and different expression categories can be represented by different indexes, wherein index 0 corresponds to a positive expression, index 1 corresponds to a negative expression, and the annotation can be subjected to one-hot coding. The first expression category and the second expression category constitute annotation data.
The expression recognition network includes an expression recognition encoder, a flat layer, a fully connected layer, and a softmax function.
The method comprises the steps of inputting a first sample set and a second sample set into an expression recognition encoder for feature extraction, outputting feature vectors (such as mouth angle degrees) by the expression recognition encoder, inputting the feature vectors into a flat layer, processing the feature vectors through the flat layer to obtain one-dimensional feature vectors, taking the one-dimensional feature vectors as the input of a full connection layer, mapping the one-dimensional feature vectors into feature mark space by the full connection layer, outputting the feature vectors to a softmax function, outputting probabilities of two expression categories through the softmax function, and determining corresponding initial expression categories according to the probabilities of the two output expression categories, wherein the probability of the two expression categories is 1.
And calculating the obtained initial expression category and the labeling data through a cross entropy loss function, and optimizing parameters in the expression recognition network so that the output expression category is gradually close to a true value.
The method comprises the steps of inputting user face images in all images into an expression recognition network, carrying out expression recognition through the expression recognition network, specifically inputting the user face images in all images into an expression recognition encoder for feature extraction, outputting feature vectors by an image classification encoder, inputting the feature vectors into a layer of the flat, processing the feature vectors by the layer of the flat to obtain one-dimensional feature vectors, wherein the one-dimensional feature vectors serve as input of a full-connection layer, mapping the one-dimensional feature vectors into feature mark space by the full-connection layer, outputting the feature vectors to a softmax function, outputting corresponding expression categories through the softmax function, and outputting the expression categories which are positive expressions or negative expressions.
Step S5: according to the character expressions in each image and the sequence of the time of each image, acquiring expression change trend:
because each image is obtained according to a preset period, each image has a sequence of time, and the sequence of time is the time sequence of the self-timer video in the playing process. Then, after the character expressions in each image are obtained, the expression change trend is obtained according to the sequence of the time of each image. The expression change trend is in which direction the expression is developed, i.e. towards a positive expression or towards a negative expression.
Wherein, the development towards the frontal expression includes two cases, respectively: the expression changes towards the positive expression (e.g. the expression changes from negative to positive) or is always positive. Similarly, the development toward negative expression also includes two cases, respectively: the expression changes to a negative expression (e.g., the expression changes from a positive expression to a negative expression) or is always a negative expression.
Step S6: performing voice recognition on the voice signal to obtain a corresponding text signal:
and performing voice recognition on the voice signal to obtain a text signal corresponding to the voice signal, namely converting the voice signal into the text signal. Since the voice recognition algorithm belongs to the conventional algorithm, the description thereof is omitted.
Step S7: inputting the text signal and the actual scene into a preset detection model, and obtaining a preliminary emotion result of the voice signal in the actual scene:
after the text signal is obtained, inputting the text signal and an actual scene corresponding to the self-timer video into a preset detection model, and obtaining a preliminary emotion result of the voice signal in the actual scene.
It should be appreciated that the predetermined detection model may be a detection model constructed in advance, including: at least two scenes, at least two texts are arranged in each scene, and emotion results corresponding to the texts in each scene, it should be understood that, in order to improve detection accuracy, the number of the scenes in the detection model and the number of the texts in each scene may be enough, that is, each currently known scene, and the texts that can occur or be generated in each scene are included in the detection model. Since the scene and the text are independent, the preset detection model can also be said to include: at least two texts, and emotion results corresponding to each text in each of the at least two scenes. Moreover, in order to improve the detection accuracy, each text in the preset detection model may be a keyword, not necessarily a complete sentence, for example: the complete sentence is: "I do not want to dry", the keywords may be: "do not want to dry".
The detection model referred to in the previous paragraph may be an existing detection model, and as a specific embodiment, the preset detection model is a detection model after correcting the existing detection model, so that an acquisition procedure is given as follows:
(1) At least two corrected text in each of at least two scenes is acquired. It should be appreciated that in order to improve the reliability of the correction, the accuracy of the predictive detection model is improved, in which step the acquired scenes may be set sufficiently large, and the correction text in each scene is also set sufficiently large.
(2) Since the corrected text is a text for correcting the existing detection model, the actual emotion result of each corrected text in each scene is also known, and the actual emotion result of each corrected text in each scene is acquired.
(3) And inputting each correction text in each scene into the existing detection model to obtain a detection emotion result of each correction text in each scene.
(4) Obtaining actual emotion results of each corrected text in each scene, and after detecting the emotion results, checking the two emotion results of each corrected text in each scene, specifically: obtaining correction texts in all scenes of which the actual emotion result and the detected emotion result are positive emotion, obtaining first correction texts in all scenes of which the actual emotion result and the detected emotion result are negative emotion, and obtaining second correction texts in all scenes of which the actual emotion result and the detected emotion result are negative emotion.
(5) And adjusting the existing detection model according to each first correction text in the first scene and each second correction text in the second scene to obtain a preset detection model. Two adjustment methods are given below, the first: the existing detection model is not considered, and a preset detection model is directly built according to each first correction text in a first scene and each second correction text in a second scene; second kind: deleting each text in each scene under the condition that the emotion result in the existing detection model is the positive emotion, wherein the actual emotion result and the detected emotion result are not met; and deleting each text in each scene which does not meet the condition that the actual emotion result and the detected emotion result are negative emotion for each text in each scene in which the emotion result in the existing detection model is negative emotion.
Therefore, the obtained text signal and the actual scene are input into a preset detection model, and a preliminary emotion result of the text signal in the actual scene, namely a preliminary emotion result of the corresponding voice signal in the actual scene, is obtained.
Through correction, the detection precision of a preset detection model can be improved.
Step S8: and fusing the expression change trend and the preliminary emotion result to obtain a final emotion result of the user:
fusing the obtained expression change trend and the preliminary emotion result, and specifically: if the expression change trend is towards the development of the frontal expression and the preliminary emotion result is the frontal emotion, the final emotion result of the user is the frontal emotion; if the expression change trend is towards negative expression development and the preliminary emotion result is negative emotion, the final emotion result of the user is negative emotion.
As other embodiments, two weights may be further set, and the final emotion result of the user may be obtained by combining the expression change trend with the preliminary emotion result and the corresponding weights.
The embodiment also provides a emotion detection device based on voice recognition and image recognition, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the steps of the emotion detection method based on voice recognition and image recognition provided by the embodiment when executing the computer program. Therefore, the emotion detection device based on voice recognition and image recognition is a software device, and the emotion detection method based on voice recognition and image recognition is still a method of emotion detection, and the detailed description of the emotion detection method based on voice recognition and image recognition is given in the above embodiment and will not be repeated.
The foregoing examples illustrate the technical solution of the present invention in only one specific embodiment, and any equivalent replacement of the present invention and modification or partial replacement without departing from the spirit and scope of the present invention should be covered by the scope of the claims of the present invention.