CN112232276A - Emotion detection method and device based on voice recognition and image recognition - Google Patents

Emotion detection method and device based on voice recognition and image recognition Download PDF

Info

Publication number
CN112232276A
CN112232276A CN202011213188.XA CN202011213188A CN112232276A CN 112232276 A CN112232276 A CN 112232276A CN 202011213188 A CN202011213188 A CN 202011213188A CN 112232276 A CN112232276 A CN 112232276A
Authority
CN
China
Prior art keywords
expression
emotion
image
recognition
images
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011213188.XA
Other languages
Chinese (zh)
Other versions
CN112232276B (en
Inventor
赵珍
李小强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Enterprise Information Technology Co ltd
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202011213188.XA priority Critical patent/CN112232276B/en
Publication of CN112232276A publication Critical patent/CN112232276A/en
Application granted granted Critical
Publication of CN112232276B publication Critical patent/CN112232276B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention relates to an emotion detection method and device based on voice recognition and image recognition, which are used for acquiring a self-portrait video of a section of user to be detected and an actual scene corresponding to the self-portrait video, processing the self-portrait video to obtain an image signal and a voice signal, processing the image signal to acquire an expression change trend, processing a voice signal to acquire a preliminary emotion result of the voice signal in the actual scene, and finally fusing the expression change trend and the preliminary emotion result to acquire a final emotion result of the user. The emotion detection method based on voice recognition and image recognition is an automatic detection method, and compared with a manual detection mode, the emotion detection method is not influenced by subjective factors, so that the detection accuracy is improved; detection personnel do not need to be specially arranged, so that the labor cost is reduced; the processing efficiency is very fast, moreover, setting up processing equipment, can handle a plurality of auto heterodyne videos simultaneously, efficiency is higher.

Description

Emotion detection method and device based on voice recognition and image recognition
Technical Field
The invention relates to a method and a device for emotion detection based on voice recognition and image recognition.
Background
In the past where information processing technology was not well developed, when the emotion of a speaker was judged from a piece of video or a piece of speech, the emotion of the speaker was judged by a special inspector from the expression of the speaker, the breath of the speaker, and related keywords appearing in the piece of speech. The artificial judgment method has the following defects: (1) the detection personnel are easily affected by subjective factors, so that detection errors are caused; (2) related personnel are required to be specially arranged, so that the labor cost is increased; (3) need the measurement personnel to see one section video or listen to one section pronunciation and just can detect the judgement, moreover, measurement personnel can only judge one section video or one section pronunciation simultaneously, and efficiency is very low.
Disclosure of Invention
In order to solve the technical problem, the invention provides a method and a device for emotion detection based on voice recognition and image recognition.
The invention adopts the following technical scheme:
a emotion detection method based on voice recognition and image recognition comprises the following steps:
acquiring a self-shooting video of a section of user to be detected and an actual scene corresponding to the self-shooting video;
processing the self-timer video to obtain an image signal and a voice signal;
performing screenshot processing on the image signal according to a preset period to obtain at least two images;
performing expression recognition on the at least two images to obtain the character expressions in the images;
acquiring expression change trends according to the character expressions in the images and the sequence time sequence of the images;
carrying out voice recognition on the voice signals to obtain corresponding character signals;
inputting the character signal and the actual scene into a preset detection model, and acquiring a preliminary emotion result of the voice signal in the actual scene;
and fusing the expression change trend and the preliminary emotion result to obtain a final emotion result of the user.
Preferably, the fusing the expression change trend and the preliminary emotion result to obtain a final emotion result, includes:
if the expression change trend is towards positive expression development, and the initial emotion result is positive emotion, the final emotion result is positive emotion;
and if the expression change trend is towards negative expressions and the preliminary emotion result is negative emotion, the final emotion result is negative emotion.
Preferably, the process of acquiring the preset detection model includes:
acquiring at least two correction texts in each scene from at least two scenes;
acquiring actual emotion results of all correction texts in all scenes;
inputting each correction text in each scene into an existing detection model to obtain a detection emotion result of each correction text in each scene;
acquiring correction texts under scenes in which the actual emotion result and the detected emotion result are positive emotions, acquiring first correction texts under a first scene, acquiring correction texts under scenes in which the actual emotion result and the detected emotion result are negative emotions, and acquiring second correction texts under a second scene;
and adjusting the existing detection model according to each first correction text in the first scene and each second correction text in the second scene to obtain the preset detection model.
Preferably, the performing expression recognition on the at least two images to obtain the human expression in each image includes:
carrying out user face recognition on the at least two images to obtain a user face image of the user;
and performing expression recognition on the face image of the user in each image to obtain the expression of the character in each image.
Preferably, the performing expression recognition on the facial image of the user in each image to obtain the expression of the person in each image includes:
acquiring a first sample set and a second sample set, wherein the first sample set comprises at least one positive expression sample image, and the second sample set comprises at least one negative expression sample image;
labeling each positive expression sample image in the first sample set to obtain a first expression category, wherein the first expression category is positive expression, labeling each negative expression sample image in the second sample set to obtain a second expression category, wherein the second expression category is negative expression, and the first expression category and the second expression category form labeling data;
inputting the first sample set and the second sample set into an expression recognition encoder for feature extraction, inputting a feature vector output by the expression recognition encoder into a Flatten layer, processing the feature vector by the Flatten layer to obtain a one-dimensional feature vector, using the one-dimensional feature vector as the input of a full connection layer, mapping the one-dimensional feature vector to a feature mark space by the full connection layer, then outputting the feature mark space to a softmax function, outputting the probabilities of two expression categories through the softmax function, and determining the corresponding initial expression category according to the output probabilities of the two expression categories;
calculating the initial expression category and the labeling data through a cross entropy loss function, and optimizing parameters in an expression recognition network;
and inputting the user face images in the images into the expression recognition network to obtain the character expressions of the user face images in the images.
An emotion detection apparatus based on speech recognition and image recognition, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the emotion detection method based on speech recognition and image recognition as follows when executing the computer program:
acquiring a self-shooting video of a section of user to be detected and an actual scene corresponding to the self-shooting video;
processing the self-timer video to obtain an image signal and a voice signal;
performing screenshot processing on the image signal according to a preset period to obtain at least two images;
performing expression recognition on the at least two images to obtain the character expressions in the images;
acquiring expression change trends according to the character expressions in the images and the sequence time sequence of the images;
carrying out voice recognition on the voice signals to obtain corresponding character signals;
inputting the character signal and the actual scene into a preset detection model, and acquiring a preliminary emotion result of the voice signal in the actual scene;
and fusing the expression change trend and the preliminary emotion result to obtain a final emotion result of the user.
Preferably, the fusing the expression change trend and the preliminary emotion result to obtain a final emotion result, includes:
if the expression change trend is towards positive expression development, and the initial emotion result is positive emotion, the final emotion result is positive emotion;
and if the expression change trend is towards negative expressions and the preliminary emotion result is negative emotion, the final emotion result is negative emotion.
Preferably, the process of acquiring the preset detection model includes:
acquiring at least two correction texts in each scene from at least two scenes;
acquiring actual emotion results of all correction texts in all scenes;
inputting each correction text in each scene into an existing detection model to obtain a detection emotion result of each correction text in each scene;
acquiring correction texts under scenes in which the actual emotion result and the detected emotion result are positive emotions, acquiring first correction texts under a first scene, acquiring correction texts under scenes in which the actual emotion result and the detected emotion result are negative emotions, and acquiring second correction texts under a second scene;
and adjusting the existing detection model according to each first correction text in the first scene and each second correction text in the second scene to obtain the preset detection model.
Preferably, the performing expression recognition on the at least two images to obtain the human expression in each image includes:
carrying out user face recognition on the at least two images to obtain a user face image of the user;
and performing expression recognition on the face image of the user in each image to obtain the expression of the character in each image.
Preferably, the performing expression recognition on the facial image of the user in each image to obtain the expression of the person in each image includes:
acquiring a first sample set and a second sample set, wherein the first sample set comprises at least one positive expression sample image, and the second sample set comprises at least one negative expression sample image;
labeling each positive expression sample image in the first sample set to obtain a first expression category, wherein the first expression category is positive expression, labeling each negative expression sample image in the second sample set to obtain a second expression category, wherein the second expression category is negative expression, and the first expression category and the second expression category form labeling data;
inputting the first sample set and the second sample set into an expression recognition encoder for feature extraction, inputting a feature vector output by the expression recognition encoder into a Flatten layer, processing the feature vector by the Flatten layer to obtain a one-dimensional feature vector, using the one-dimensional feature vector as the input of a full connection layer, mapping the one-dimensional feature vector to a feature mark space by the full connection layer, then outputting the feature mark space to a softmax function, outputting the probabilities of two expression categories through the softmax function, and determining the corresponding initial expression category according to the output probabilities of the two expression categories;
calculating the initial expression category and the labeling data through a cross entropy loss function, and optimizing parameters in an expression recognition network;
and inputting the user face images in the images into the expression recognition network to obtain the character expressions of the user face images in the images.
The invention has the beneficial effects that: respectively carrying out image processing and voice processing on the self-timer video of the user, wherein the image processing is used for acquiring a plurality of character expressions and obtaining expression change trends according to the character expressions and the sequence of time, because even the same character signal can have different emotions under different scenes, and the character signal is a carrier which can reflect the emotion of a person, therefore, performing voice recognition on a voice signal of a self-timer video, acquiring a character signal, inputting the character signal and an actual scene into a preset detection model, acquiring a preliminary emotion result of the voice signal in the actual scene, the method comprises the steps of acquiring a preliminary emotion result corresponding to a voice signal in the actual scene, applying the actual scene to emotion detection, improving detection accuracy, and finally fusing an expression change trend and the preliminary emotion result to acquire a final emotion result of a user. Therefore, the emotion detection method based on voice recognition and image recognition is an automatic detection method, the video is processed in two aspects, the image is subjected to character expression recognition, the voice is subjected to character recognition, an emotion result is obtained according to a character signal and an actual scene, and a final emotion result is obtained by fusing two data information; detection personnel do not need to be specially arranged, so that the labor cost is reduced; the processing efficiency is fast, and moreover, after setting up processing equipment, can handle a plurality of auto heterodyne videos simultaneously, efficiency is higher.
Drawings
Fig. 1 is a flow chart of a method for emotion detection based on speech recognition and image recognition.
Detailed Description
The embodiment provides a method for emotion detection based on voice recognition and image recognition, a hardware execution main body of the method for emotion detection can be computer equipment, server equipment, an intelligent mobile terminal and the like, and the embodiment does not specifically limit the hardware execution main body.
As shown in fig. 1, the emotion detection method based on speech recognition and image recognition includes:
step S1: acquiring a self-shooting video of a section of user to be detected and an actual scene corresponding to the self-shooting video:
the user transmits the self-shooting video to the hardware execution main body, and the time length of the self-shooting video is set according to actual requirements, for example, the self-shooting video can be a short video within 30s, and can also be a longer video of 2-3 minutes. The user also transmits the actual scene of the self-shooting video to the hardware execution main body, and the actual scene may refer to the environment or the occasion where the self-shooting video is located, such as: at home, or at work, or in other public places, such as: KTV, supermarket, restaurant, etc. The actual scene is obtained and applied to emotion detection because the emotion may differ from scene to scene in the case where the same data is included in the video.
Step S2: processing the self-timer video to obtain an image signal and a voice signal:
analyzing the self-timer video to obtain an image signal and a voice signal, wherein the image signal is video data without sound and only with an image, and it should be understood that the image signal contains a face image of a user due to the self-timer video of the user; the voice signal is a sound signal in the self-timer video, and specifically, the voice signal is what the user said in the self-timer video.
Since the processing procedure of decomposing the video file into the image signal and the sound signal belongs to the conventional technology, the description is omitted.
Step S3: performing screenshot processing on the image signal according to a preset period to obtain at least two images:
and performing screenshot processing on the image signal according to a preset period to obtain at least two images. The preset period is set according to actual needs, and the longer the preset period is, the fewer the acquired images are. It should be understood that, since the video is a self-portrait video, each of the obtained images includes a face image of the user.
Step S4: performing expression recognition on the at least two images to obtain the character expressions in each image:
firstly, carrying out user face recognition on each image in at least two images to obtain a user face image of a user in each image.
Then, performing expression recognition on the face image of the user in each image to obtain the expression of the person in each image, and as a specific implementation manner, an expression recognition process is given as follows:
and inputting the facial images of the user in the images into an expression recognition network to obtain the expressions of the user in the images. The expression recognition network can be obtained by adopting the following training process:
a first sample set and a second sample set are obtained, the first sample set comprising at least one positive expression sample image, and the second sample set comprising at least one negative expression sample image. The positive expression sample image refers to a sample image of which the expression of a person is positive, and the positive expression is happy, happy and the like; the negative expression sample image refers to a sample image of a human expression, wherein the human expression is a negative expression, and the negative expression is particularly heart hurt, crying, hard passing and the like.
Labeling each positive expression sample image in the first sample set to obtain a first expression type, wherein the first expression type is positive expression, labeling each negative expression sample image in the second sample set to obtain a second expression type, and the second expression type is negative expression. That is to say, the expression categories of the labels are divided into two categories, different indexes can be used to represent different expression categories, where the index 0 corresponds to a positive expression and the index 1 corresponds to a negative expression, and the labels can be further encoded by one-hot. The first expression category and the second expression category constitute annotation data.
The expression recognition network comprises an expression recognition encoder, a Flatten layer, a full connection layer and a softmax function.
Inputting the first sample set and the second sample set into an expression recognition encoder for feature extraction, outputting a feature vector (such as mouth angle opening degree) by the expression recognition encoder, inputting the feature vector into a Flatten layer, processing the feature vector by the Flatten layer to obtain a one-dimensional feature vector, taking the one-dimensional feature vector as the input of a full connection layer, mapping the one-dimensional feature vector to a feature mark space by the full connection layer, outputting the feature vector to a softmax function, outputting the probabilities of two expression categories through the softmax function, wherein the probabilities of the two expression categories are added to be 1, and determining the corresponding initial expression category according to the output probabilities of the two expression categories.
And calculating the obtained initial expression categories and the marking data through a cross entropy loss function, and optimizing parameters in the expression recognition network so that the output expression categories are gradually close to the real values.
Inputting user face images in each image into an expression recognition network, and performing expression recognition through the expression recognition network, specifically, inputting the user face images in each image into an expression recognition encoder for feature extraction, outputting a feature vector by an image classification encoder, inputting the feature vector to a Flatten layer, processing the feature vector by the Flatten layer to obtain a one-dimensional feature vector, taking the one-dimensional feature vector as the input of a full connection layer, mapping the one-dimensional feature vector to a feature mark space by the full connection layer, then outputting the feature vector to a softmax function, and outputting a corresponding expression category through the softmax function, wherein the output expression category is a positive expression or a negative expression.
Step S5: acquiring expression change trends according to the character expressions in the images and the sequence time sequence of the images:
because each image is obtained according to the preset period, each image has a time sequence, and the time sequence is the time sequence of the self-timer video in the playing process. Then, after the expression of the character in each image is obtained, the expression change trend is obtained according to the sequence of time of each image. The expression change trend is the direction towards which the expression is developed, namely towards positive expression or towards negative expression.
Wherein, the development towards positive expression includes two kinds of circumstances, is respectively: the expression changes to a positive expression (e.g., the expression changes from a negative expression to a positive expression) or is always a positive expression. Similarly, the development towards negative expressions also includes two cases, respectively: the expression changes to a negative expression (e.g., the expression changes from a positive expression to a negative expression) or is always a negative expression.
Step S6: carrying out voice recognition on the voice signal to obtain a corresponding character signal:
and carrying out voice recognition on the voice signals to obtain character signals corresponding to the voice signals, namely converting the voice signals into the character signals. Since the speech recognition algorithm belongs to the conventional algorithm, the description is omitted.
Step S7: inputting the character signal and the actual scene into a preset detection model, and acquiring a preliminary emotion result of the voice signal in the actual scene:
and after the character signal is obtained, inputting the character signal and an actual scene corresponding to the self-timer video into a preset detection model, and acquiring a preliminary emotion result of the voice signal in the actual scene.
It should be understood that the predetermined detection model may be a previously constructed detection model, including: at least two scenes are arranged in each scene, and at least two texts and emotion results corresponding to the texts in each scene are arranged in each scene. Since the scene and the text are independent, the preset detection model can also be said to include: at least two texts, and an emotional result corresponding to each text in each of the at least two scenes. Moreover, in order to improve the detection accuracy, each text in the preset detection model may be a keyword, and is not necessarily a complete sentence, for example: the complete sentence is: "I do not want to do, the keywords may be: "do not want to dry".
The detection model referred to in the above paragraph may be an existing detection model, and as a specific implementation, the preset detection model is a detection model after the existing detection model is corrected, so that an acquisition process is given as follows:
(1) at least two corrected texts in each of at least two scenes are obtained. It should be understood that, in order to improve the reliability of the correction and improve the accuracy of the prediction detection model, in this step, the acquired scenes can be set to be sufficiently wide, and the corrected text in each scene is also set to be sufficiently wide.
(2) Since the corrected text is a text for correcting the existing detection model and is known, the actual emotion result of each corrected text in each scene is also known, and the actual emotion result of each corrected text in each scene is obtained.
(3) And inputting each corrected text in each scene into the existing detection model to obtain the detection emotion result of each corrected text in each scene.
(4) Obtaining an actual emotion result of each corrected text in each scene, and after detecting the emotion result, checking the two emotion results of each corrected text in each scene, specifically: and acquiring each corrected text in each scene with the actual emotion result and the detected emotion result both being positive emotions, acquiring each first corrected text in the first scene, and each corrected text in each scene with the actual emotion result and the detected emotion result both being negative emotions, and acquiring each second corrected text in the second scene.
(5) And adjusting the existing detection model according to each first correction text in the first scene and each second correction text in the second scene to obtain a preset detection model. Two adjustment methods are given below, the first: the method comprises the steps that an existing detection model is not considered, and a preset detection model is directly constructed according to first correction texts in a first scene and second correction texts in a second scene; and the second method comprises the following steps: deleting each text in each scene with positive emotion as the emotion result in the existing detection model, wherein each text in each scene does not accord with the condition that the actual emotion result and the detection emotion result are positive emotion; and deleting each text in each scene with negative emotion as the emotion result in the existing detection model, wherein each text in each scene does not meet the condition that the actual emotion result and the detection emotion result are negative emotions.
Therefore, the obtained text signal and the actual scene are input into a preset detection model, and a preliminary emotion result of the text signal in the actual scene, namely a preliminary emotion result of the corresponding voice signal in the actual scene, is obtained.
Through the correction, the detection precision of the preset detection model can be improved.
Step S8: and fusing the expression change trend and the preliminary emotion result to obtain a final emotion result of the user:
fusing the obtained expression change trend and the initial emotion result, specifically: if the expression change trend is towards the positive expression development, and the initial emotion result is positive emotion, the final emotion result of the user is positive emotion; if the expression change trend is towards negative expressions and the preliminary emotion result is negative emotion, the final emotion result of the user is negative emotion.
As another embodiment, two weights may be set, and the final emotion result of the user is obtained by combining the expression change trend, the preliminary emotion result, and the corresponding weights.
The present embodiment further provides an emotion detection apparatus based on speech recognition and image recognition, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the emotion detection method based on speech recognition and image recognition provided in the present embodiment. Therefore, the emotion detection device based on voice recognition and image recognition is a software device, and the essence of the emotion detection device is still an emotion detection method based on voice recognition and image recognition.
The above-mentioned embodiments are merely illustrative of the technical solutions of the present invention in a specific embodiment, and any equivalent substitutions and modifications or partial substitutions of the present invention without departing from the spirit and scope of the present invention should be covered by the claims of the present invention.

Claims (10)

1. A emotion detection method based on voice recognition and image recognition is characterized by comprising the following steps:
acquiring a self-shooting video of a section of user to be detected and an actual scene corresponding to the self-shooting video;
processing the self-timer video to obtain an image signal and a voice signal;
performing screenshot processing on the image signal according to a preset period to obtain at least two images;
performing expression recognition on the at least two images to obtain the character expressions in the images;
acquiring expression change trends according to the character expressions in the images and the sequence time sequence of the images;
carrying out voice recognition on the voice signals to obtain corresponding character signals;
inputting the character signal and the actual scene into a preset detection model, and acquiring a preliminary emotion result of the voice signal in the actual scene;
and fusing the expression change trend and the preliminary emotion result to obtain a final emotion result of the user.
2. The emotion detection method based on speech recognition and image recognition, as claimed in claim 1, wherein the fusing the expression change trend and the preliminary emotion result to obtain a final emotion result comprises:
if the expression change trend is towards positive expression development, and the initial emotion result is positive emotion, the final emotion result is positive emotion;
and if the expression change trend is towards negative expressions and the preliminary emotion result is negative emotion, the final emotion result is negative emotion.
3. The emotion detection method based on speech recognition and image recognition, as recited in claim 1, wherein the obtaining process of the preset detection model comprises:
acquiring at least two correction texts in each scene from at least two scenes;
acquiring actual emotion results of all correction texts in all scenes;
inputting each correction text in each scene into an existing detection model to obtain a detection emotion result of each correction text in each scene;
acquiring correction texts under scenes in which the actual emotion result and the detected emotion result are positive emotions, acquiring first correction texts under a first scene, acquiring correction texts under scenes in which the actual emotion result and the detected emotion result are negative emotions, and acquiring second correction texts under a second scene;
and adjusting the existing detection model according to each first correction text in the first scene and each second correction text in the second scene to obtain the preset detection model.
4. The emotion detection method based on speech recognition and image recognition, as claimed in claim 1, wherein the performing expression recognition on the at least two images to obtain the expression of the person in each image comprises:
carrying out user face recognition on the at least two images to obtain a user face image of the user;
and performing expression recognition on the face image of the user in each image to obtain the expression of the character in each image.
5. The emotion detection method based on speech recognition and image recognition, as claimed in claim 4, wherein said performing expression recognition on the face image of the user in each image to obtain the expression of the person in each image comprises:
acquiring a first sample set and a second sample set, wherein the first sample set comprises at least one positive expression sample image, and the second sample set comprises at least one negative expression sample image;
labeling each positive expression sample image in the first sample set to obtain a first expression category, wherein the first expression category is positive expression, labeling each negative expression sample image in the second sample set to obtain a second expression category, wherein the second expression category is negative expression, and the first expression category and the second expression category form labeling data;
inputting the first sample set and the second sample set into an expression recognition encoder for feature extraction, inputting a feature vector output by the expression recognition encoder into a Flatten layer, processing the feature vector by the Flatten layer to obtain a one-dimensional feature vector, using the one-dimensional feature vector as the input of a full connection layer, mapping the one-dimensional feature vector to a feature mark space by the full connection layer, then outputting the feature mark space to a softmax function, outputting the probabilities of two expression categories through the softmax function, and determining the corresponding initial expression category according to the output probabilities of the two expression categories;
calculating the initial expression category and the labeling data through a cross entropy loss function, and optimizing parameters in an expression recognition network;
and inputting the user face images in the images into the expression recognition network to obtain the character expressions of the user face images in the images.
6. An emotion detection apparatus based on speech recognition and image recognition, comprising a memory, a processor and a computer program stored in the memory and operable on the processor, wherein the processor, when executing the computer program, implements the steps of the emotion detection method based on speech recognition and image recognition as follows:
acquiring a self-shooting video of a section of user to be detected and an actual scene corresponding to the self-shooting video;
processing the self-timer video to obtain an image signal and a voice signal;
performing screenshot processing on the image signal according to a preset period to obtain at least two images;
performing expression recognition on the at least two images to obtain the character expressions in the images;
acquiring expression change trends according to the character expressions in the images and the sequence time sequence of the images;
carrying out voice recognition on the voice signals to obtain corresponding character signals;
inputting the character signal and the actual scene into a preset detection model, and acquiring a preliminary emotion result of the voice signal in the actual scene;
and fusing the expression change trend and the preliminary emotion result to obtain a final emotion result of the user.
7. The emotion detection device based on speech recognition and image recognition, as claimed in claim 6, wherein the fusing the expression change trend and the preliminary emotion result to obtain a final emotion result comprises:
if the expression change trend is towards positive expression development, and the initial emotion result is positive emotion, the final emotion result is positive emotion;
and if the expression change trend is towards negative expressions and the preliminary emotion result is negative emotion, the final emotion result is negative emotion.
8. The emotion detection apparatus based on speech recognition and image recognition, as recited in claim 6, wherein the preset detection model is obtained by:
acquiring at least two correction texts in each scene from at least two scenes;
acquiring actual emotion results of all correction texts in all scenes;
inputting each correction text in each scene into an existing detection model to obtain a detection emotion result of each correction text in each scene;
acquiring correction texts under scenes in which the actual emotion result and the detected emotion result are positive emotions, acquiring first correction texts under a first scene, acquiring correction texts under scenes in which the actual emotion result and the detected emotion result are negative emotions, and acquiring second correction texts under a second scene;
and adjusting the existing detection model according to each first correction text in the first scene and each second correction text in the second scene to obtain the preset detection model.
9. The emotion detection device based on speech recognition and image recognition, as claimed in claim 6, wherein the performing expression recognition on the at least two images to obtain the expression of the person in each image comprises:
carrying out user face recognition on the at least two images to obtain a user face image of the user;
and performing expression recognition on the face image of the user in each image to obtain the expression of the character in each image.
10. The emotion detection device based on speech recognition and image recognition, as claimed in claim 9, wherein the performing expression recognition on the facial image of the user in each image to obtain the expression of the person in each image comprises:
acquiring a first sample set and a second sample set, wherein the first sample set comprises at least one positive expression sample image, and the second sample set comprises at least one negative expression sample image;
labeling each positive expression sample image in the first sample set to obtain a first expression category, wherein the first expression category is positive expression, labeling each negative expression sample image in the second sample set to obtain a second expression category, wherein the second expression category is negative expression, and the first expression category and the second expression category form labeling data;
inputting the first sample set and the second sample set into an expression recognition encoder for feature extraction, inputting a feature vector output by the expression recognition encoder into a Flatten layer, processing the feature vector by the Flatten layer to obtain a one-dimensional feature vector, using the one-dimensional feature vector as the input of a full connection layer, mapping the one-dimensional feature vector to a feature mark space by the full connection layer, then outputting the feature mark space to a softmax function, outputting the probabilities of two expression categories through the softmax function, and determining the corresponding initial expression category according to the output probabilities of the two expression categories;
calculating the initial expression category and the labeling data through a cross entropy loss function, and optimizing parameters in an expression recognition network;
and inputting the user face images in the images into the expression recognition network to obtain the character expressions of the user face images in the images.
CN202011213188.XA 2020-11-04 2020-11-04 Emotion detection method and device based on voice recognition and image recognition Active CN112232276B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011213188.XA CN112232276B (en) 2020-11-04 2020-11-04 Emotion detection method and device based on voice recognition and image recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011213188.XA CN112232276B (en) 2020-11-04 2020-11-04 Emotion detection method and device based on voice recognition and image recognition

Publications (2)

Publication Number Publication Date
CN112232276A true CN112232276A (en) 2021-01-15
CN112232276B CN112232276B (en) 2023-10-13

Family

ID=74121979

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011213188.XA Active CN112232276B (en) 2020-11-04 2020-11-04 Emotion detection method and device based on voice recognition and image recognition

Country Status (1)

Country Link
CN (1) CN112232276B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112990301A (en) * 2021-03-10 2021-06-18 深圳市声扬科技有限公司 Emotion data annotation method and device, computer equipment and storage medium
CN112992148A (en) * 2021-03-03 2021-06-18 中国工商银行股份有限公司 Method and device for recognizing voice in video
CN114065742A (en) * 2021-11-19 2022-02-18 马上消费金融股份有限公司 Text detection method and device
CN118428343A (en) * 2024-07-03 2024-08-02 广州讯鸿网络技术有限公司 Full-media interactive intelligent customer service interaction method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020125386A1 (en) * 2018-12-18 2020-06-25 深圳壹账通智能科技有限公司 Expression recognition method and apparatus, computer device, and storage medium
WO2020135194A1 (en) * 2018-12-26 2020-07-02 深圳Tcl新技术有限公司 Emotion engine technology-based voice interaction method, smart terminal, and storage medium
CN111681681A (en) * 2020-05-22 2020-09-18 深圳壹账通智能科技有限公司 Voice emotion recognition method and device, electronic equipment and storage medium
CN111694959A (en) * 2020-06-08 2020-09-22 谢沛然 Network public opinion multi-mode emotion recognition method and system based on facial expressions and text information

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020125386A1 (en) * 2018-12-18 2020-06-25 深圳壹账通智能科技有限公司 Expression recognition method and apparatus, computer device, and storage medium
WO2020135194A1 (en) * 2018-12-26 2020-07-02 深圳Tcl新技术有限公司 Emotion engine technology-based voice interaction method, smart terminal, and storage medium
CN111368609A (en) * 2018-12-26 2020-07-03 深圳Tcl新技术有限公司 Voice interaction method based on emotion engine technology, intelligent terminal and storage medium
CN111681681A (en) * 2020-05-22 2020-09-18 深圳壹账通智能科技有限公司 Voice emotion recognition method and device, electronic equipment and storage medium
CN111694959A (en) * 2020-06-08 2020-09-22 谢沛然 Network public opinion multi-mode emotion recognition method and system based on facial expressions and text information

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
WENBIN ZHOU等: "Deep Learning-Based Emotion Recognition from Real-Time Videos", 《HCII 2020: HUMAN-COMPUTER INTERACTION. MULTIMODAL AND NATURAL INTERACTION》 *
陈师哲;王帅;金琴;: "多文化场景下的多模态情感识别", 软件学报, no. 04 *
饶元;吴连伟;王一鸣;冯聪;: "基于语义分析的情感计算技术研究进展", 软件学报, no. 08 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112992148A (en) * 2021-03-03 2021-06-18 中国工商银行股份有限公司 Method and device for recognizing voice in video
CN112990301A (en) * 2021-03-10 2021-06-18 深圳市声扬科技有限公司 Emotion data annotation method and device, computer equipment and storage medium
CN114065742A (en) * 2021-11-19 2022-02-18 马上消费金融股份有限公司 Text detection method and device
CN114065742B (en) * 2021-11-19 2023-08-25 马上消费金融股份有限公司 Text detection method and device
CN118428343A (en) * 2024-07-03 2024-08-02 广州讯鸿网络技术有限公司 Full-media interactive intelligent customer service interaction method and system

Also Published As

Publication number Publication date
CN112232276B (en) 2023-10-13

Similar Documents

Publication Publication Date Title
CN109658923B (en) Speech quality inspection method, equipment, storage medium and device based on artificial intelligence
US10438586B2 (en) Voice dialog device and voice dialog method
CN107818798B (en) Customer service quality evaluation method, device, equipment and storage medium
CN112232276B (en) Emotion detection method and device based on voice recognition and image recognition
CN112686048B (en) Emotion recognition method and device based on fusion of voice, semantics and facial expressions
WO2024000867A1 (en) Emotion recognition method and apparatus, device, and storage medium
CN108804526B (en) Interest determination system, interest determination method, and storage medium
CN106782603B (en) Intelligent voice evaluation method and system
CN109492221B (en) Information reply method based on semantic analysis and wearable equipment
CN108305618B (en) Voice acquisition and search method, intelligent pen, search terminal and storage medium
CN110418204B (en) Video recommendation method, device, equipment and storage medium based on micro expression
CN112614510B (en) Audio quality assessment method and device
US11238289B1 (en) Automatic lie detection method and apparatus for interactive scenarios, device and medium
CN117765981A (en) Emotion recognition method and system based on cross-modal fusion of voice text
CN110956958A (en) Searching method, searching device, terminal equipment and storage medium
CN114492579A (en) Emotion recognition method, camera device, emotion recognition device and storage device
CN112597889A (en) Emotion processing method and device based on artificial intelligence
CN116912663A (en) Text-image detection method based on multi-granularity decoder
CN116257816A (en) Accompanying robot emotion recognition method, device, storage medium and equipment
CN115438725A (en) State detection method, device, equipment and storage medium
CN112434953A (en) Customer service personnel assessment method and device based on computer data processing
CN112951274A (en) Voice similarity determination method and device, and program product
bin Sham et al. Voice Pathology Detection System Using Machine Learning Based on Internet of Things
KR102480722B1 (en) Apparatus for recognizing emotion aware in edge computer environment and method thereof
CN111881330B (en) Automatic home service scene restoration method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20230526

Address after: No. 16-44, No. 10A-10C, 12A, 12B, 13A, 13B, 15-18, Phase II of Wuyue Plaza Project, east of Zhengyang Street and south of Haoyue Road, Lvyuan District, Changchun City, Jilin Province, 130000

Applicant after: Jilin Huayuan Network Technology Co.,Ltd.

Address before: 450000 Wenhua Road, Jinshui District, Zhengzhou City, Henan Province

Applicant before: Zhao Zhen

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20230913

Address after: Room 1001, 1st floor, building B, 555 Dongchuan Road, Minhang District, Shanghai

Applicant after: Shanghai Enterprise Information Technology Co.,Ltd.

Address before: No. 16-44, No. 10A-10C, 12A, 12B, 13A, 13B, 15-18, Phase II of Wuyue Plaza Project, east of Zhengyang Street and south of Haoyue Road, Lvyuan District, Changchun City, Jilin Province, 130000

Applicant before: Jilin Huayuan Network Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: An emotion detection method and device based on speech recognition and image recognition

Granted publication date: 20231013

Pledgee: Agricultural Bank of China Limited Shanghai Huangpu Sub branch

Pledgor: Shanghai Enterprise Information Technology Co.,Ltd.

Registration number: Y2024310000041