CN112232276B - Emotion detection method and device based on voice recognition and image recognition - Google Patents

Emotion detection method and device based on voice recognition and image recognition Download PDF

Info

Publication number
CN112232276B
CN112232276B CN202011213188.XA CN202011213188A CN112232276B CN 112232276 B CN112232276 B CN 112232276B CN 202011213188 A CN202011213188 A CN 202011213188A CN 112232276 B CN112232276 B CN 112232276B
Authority
CN
China
Prior art keywords
expression
emotion
image
recognition
scene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011213188.XA
Other languages
Chinese (zh)
Other versions
CN112232276A (en
Inventor
赵珍
李小强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Enterprise Information Technology Co ltd
Original Assignee
Shanghai Enterprise Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Enterprise Information Technology Co ltd filed Critical Shanghai Enterprise Information Technology Co ltd
Priority to CN202011213188.XA priority Critical patent/CN112232276B/en
Publication of CN112232276A publication Critical patent/CN112232276A/en
Application granted granted Critical
Publication of CN112232276B publication Critical patent/CN112232276B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • General Physics & Mathematics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Theoretical Computer Science (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Signal Processing (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a method and a device for emotion detection based on voice recognition and image recognition, which are used for acquiring a section of self-timer video of a user to be detected and an actual scene corresponding to the self-timer video, processing the self-timer video to obtain an image signal and a voice signal, processing the image signal to acquire an emotion change trend, processing the voice signal to acquire a preliminary emotion result of the voice signal in the actual scene, and finally fusing the emotion change trend and the preliminary emotion result to acquire a final emotion result of the user. Compared with a manual detection mode, the emotion detection method based on voice recognition and image recognition is an automatic detection method and is not influenced by subjective factors, so that the detection accuracy is improved; no special detection personnel are required, so that the labor cost is reduced; the processing efficiency is faster, and in addition, processing equipment is being set up, can handle simultaneously a plurality of self-timer videos, and efficiency is higher.

Description

Emotion detection method and device based on voice recognition and image recognition
Technical Field
The invention relates to a emotion detection method and device based on voice recognition and image recognition.
Background
In the past, where information processing technology was not well developed, when judging the emotion of a speaker from a piece of video or a piece of speech, a dedicated inspector judges the emotion of the speaker from the expression of the speaker, the mouth smell of the speaker, and related keywords appearing in the piece of speech. The human judgment mode has the following defects: (1) The detection personnel are easily influenced by subjective factors, so that detection errors are caused; (2) Related personnel are required to be specially arranged, so that labor cost is increased; (3) The detection personnel can detect and judge after finishing a video or hearing a voice, and can judge a video or a voice at the same time, so that the efficiency is very low.
Disclosure of Invention
In order to solve the technical problems, the invention provides a emotion detection method and device based on voice recognition and image recognition.
The invention adopts the following technical scheme:
an emotion detection method based on speech recognition and image recognition, comprising:
acquiring a section of self-timer video of a user to be detected and an actual scene corresponding to the self-timer video;
processing the self-timer video to obtain an image signal and a voice signal;
performing screenshot processing on the image signal according to a preset period to obtain at least two images;
carrying out expression recognition on the at least two images to obtain the character expression in each image;
acquiring expression change trends according to the character expressions in each image and the sequence of the time of each image;
performing voice recognition on the voice signal to obtain a corresponding text signal;
inputting the text signal and the actual scene into a preset detection model, and acquiring a preliminary emotion result of the voice signal in the actual scene;
and fusing the expression change trend and the preliminary emotion result to obtain a final emotion result of the user.
Preferably, the fusing the expression change trend and the preliminary emotion result to obtain a final emotion result includes:
if the expression change trend is towards the development of the frontal expression and the preliminary emotion result is the frontal emotion, the final emotion result is the frontal emotion;
and if the expression change trend is towards negative expression development and the preliminary emotion result is negative emotion, the final emotion result is negative emotion.
Preferably, the process of obtaining the preset detection model includes:
acquiring at least two correction texts in each of at least two scenes;
acquiring actual emotion results of each corrected text in each scene;
inputting each correction text in each scene into an existing detection model to obtain a detection emotion result of each correction text in each scene;
acquiring correction texts in each scene of which the actual emotion result and the detected emotion result are positive emotion to obtain first correction texts in a first scene, and acquiring correction texts in each scene of which the actual emotion result and the detected emotion result are negative emotion to obtain second correction texts in a second scene;
and adjusting the existing detection model according to each first correction text in the first scene and each second correction text in the second scene to obtain the preset detection model.
Preferably, the performing expression recognition on the at least two images to obtain the character expression in each image includes:
performing user face recognition on the at least two images to obtain a user face image of the user;
and carrying out expression recognition on the face images of the users in each image to obtain the character expression in each image.
Preferably, the performing expression recognition on the face image of the user in each image to obtain the character expression in each image includes:
obtaining a first sample set and a second sample set, the first sample set comprising at least one positive expression sample image and the second sample set comprising at least one negative expression sample image;
labeling each positive expression sample image in the first sample set to obtain a first expression category, wherein the first expression category is a positive expression, labeling each negative expression sample image in the second sample set to obtain a second expression category, wherein the second expression category is a negative expression, and the first expression category and the second expression category form labeling data;
inputting the first sample set and the second sample set into an expression recognition encoder for feature extraction, inputting a feature vector output by the expression recognition encoder into a flat layer, processing the feature vector by the flat layer to obtain a one-dimensional feature vector, taking the one-dimensional feature vector as input of a full connection layer, mapping the one-dimensional feature vector into a feature mark space by the full connection layer, outputting the feature vector to a softmax function, outputting probabilities of two expression categories through the softmax function, and determining corresponding initial expression categories according to the probabilities of the two output expression categories;
calculating the initial expression category and the annotation data through a cross entropy loss function, and optimizing parameters in an expression recognition network;
and inputting the face images of the users in the images into the expression recognition network to obtain the character expressions of the face images of the users in the images.
A speech recognition and image recognition based emotion detection device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of a speech recognition and image recognition based emotion detection method as follows when executing the computer program:
acquiring a section of self-timer video of a user to be detected and an actual scene corresponding to the self-timer video;
processing the self-timer video to obtain an image signal and a voice signal;
performing screenshot processing on the image signal according to a preset period to obtain at least two images;
carrying out expression recognition on the at least two images to obtain the character expression in each image;
acquiring expression change trends according to the character expressions in each image and the sequence of the time of each image;
performing voice recognition on the voice signal to obtain a corresponding text signal;
inputting the text signal and the actual scene into a preset detection model, and acquiring a preliminary emotion result of the voice signal in the actual scene;
and fusing the expression change trend and the preliminary emotion result to obtain a final emotion result of the user.
Preferably, the fusing the expression change trend and the preliminary emotion result to obtain a final emotion result includes:
if the expression change trend is towards the development of the frontal expression and the preliminary emotion result is the frontal emotion, the final emotion result is the frontal emotion;
and if the expression change trend is towards negative expression development and the preliminary emotion result is negative emotion, the final emotion result is negative emotion.
Preferably, the process of obtaining the preset detection model includes:
acquiring at least two correction texts in each of at least two scenes;
acquiring actual emotion results of each corrected text in each scene;
inputting each correction text in each scene into an existing detection model to obtain a detection emotion result of each correction text in each scene;
acquiring correction texts in each scene of which the actual emotion result and the detected emotion result are positive emotion to obtain first correction texts in a first scene, and acquiring correction texts in each scene of which the actual emotion result and the detected emotion result are negative emotion to obtain second correction texts in a second scene;
and adjusting the existing detection model according to each first correction text in the first scene and each second correction text in the second scene to obtain the preset detection model.
Preferably, the performing expression recognition on the at least two images to obtain the character expression in each image includes:
performing user face recognition on the at least two images to obtain a user face image of the user;
and carrying out expression recognition on the face images of the users in each image to obtain the character expression in each image.
Preferably, the performing expression recognition on the face image of the user in each image to obtain the character expression in each image includes:
obtaining a first sample set and a second sample set, the first sample set comprising at least one positive expression sample image and the second sample set comprising at least one negative expression sample image;
labeling each positive expression sample image in the first sample set to obtain a first expression category, wherein the first expression category is a positive expression, labeling each negative expression sample image in the second sample set to obtain a second expression category, wherein the second expression category is a negative expression, and the first expression category and the second expression category form labeling data;
inputting the first sample set and the second sample set into an expression recognition encoder for feature extraction, inputting a feature vector output by the expression recognition encoder into a flat layer, processing the feature vector by the flat layer to obtain a one-dimensional feature vector, taking the one-dimensional feature vector as input of a full connection layer, mapping the one-dimensional feature vector into a feature mark space by the full connection layer, outputting the feature vector to a softmax function, outputting probabilities of two expression categories through the softmax function, and determining corresponding initial expression categories according to the probabilities of the two output expression categories;
calculating the initial expression category and the annotation data through a cross entropy loss function, and optimizing parameters in an expression recognition network;
and inputting the face images of the users in the images into the expression recognition network to obtain the character expressions of the face images of the users in the images.
The beneficial effects of the invention are as follows: and respectively carrying out image processing and voice processing on the self-timer video of the user, wherein the image processing is used for acquiring a plurality of character expressions, obtaining expression change trends according to the character expressions and the sequence of time, and finally fusing the expression change trends and the preliminary emotion results to acquire the final emotion results of the user because the voice signals of the self-timer video are voice recognition, the character signals are acquired, the character signals and the actual scenes are input into a preset detection model, the preliminary emotion results of the voice signals in the actual scenes are acquired, namely, the preliminary emotion results corresponding to the voice signals in the actual scenes are acquired, the actual scenes are applied to emotion detection, and the detection accuracy can be improved. Therefore, the emotion detection method based on voice recognition and image recognition is an automatic detection method, processes two aspects of video, performs character expression recognition on the image, performs character recognition on the voice, obtains an emotion result according to a character signal and an actual scene, fuses two data information to obtain a final emotion result, and is not influenced by subjective factors compared with a manual detection mode, so that detection accuracy is improved; no special detection personnel are required, so that the labor cost is reduced; the processing efficiency is faster, and after setting up processing equipment, can handle a plurality of self-timer videos simultaneously, and efficiency is higher.
Drawings
Fig. 1 is a flow chart of a emotion detection method based on speech recognition and image recognition.
Detailed Description
The embodiment provides a emotion detection method based on voice recognition and image recognition, and a hardware execution subject of the emotion detection method may be a computer device, a server device, an intelligent mobile terminal, etc., and the embodiment does not specifically limit the hardware execution subject.
As shown in fig. 1, the emotion detection method based on voice recognition and image recognition includes:
step S1: acquiring a self-timer video of a section of user to be detected and an actual scene corresponding to the self-timer video:
the user transmits the self-timer video to the hardware execution main body, and the duration of the self-timer video is set according to the actual requirement, for example, the self-timer video can be a short video within 30s or a longer video of 2-3 minutes. The user also transmits the actual scene of the self-timer video to the hardware execution body, where the actual scene may refer to the environment or the place where the self-timer video is located, for example: at home, or in operation, or in other public places, such as: KTV, supermarket, restaurant, etc. The actual scene is acquired and applied to emotion detection because, in the case where the same data is contained in the video, the emotion in different scenes may be different.
Step S2: processing the self-timer video to obtain an image signal and a voice signal:
analyzing the self-timer video to obtain an image signal and a voice signal, wherein the image signal is video data of no sound and only images, and it is understood that the image signal contains face images of the user because the image signal is the self-timer video of the user; the speech signal is a sound signal in the self-timer video, specifically, the speech signal is what the user speaks in the self-timer video.
Since the process of decomposing the video file into the image signal and the sound signal belongs to the conventional technology, the description thereof will not be repeated.
Step S3: screenshot processing is carried out on the image signals according to a preset period, and at least two images are obtained:
and performing screenshot processing on the image signal according to a preset period to obtain at least two images. The preset period is set by actual requirements, and the longer the preset period is, the fewer the acquired images are. It should be appreciated that, due to the self-timer video, each image obtained includes a face image of the user.
Step S4: performing expression recognition on the at least two images to obtain the character expression in each image:
user face recognition is firstly carried out on each image in at least two images, and user face images of users in each image are obtained.
Then, the facial image of the user in each image is subjected to expression recognition to obtain the character expression in each image, and as a specific implementation manner, an expression recognition process is given as follows:
and inputting the face images of the users in the images into an expression recognition network to obtain the expressions of the users in the images. The expression recognition network can be trained by adopting the following training process:
a first sample set including at least one positive expression sample image and a second sample set including at least one negative expression sample image are acquired. The facial expression sample image refers to a sample image with a character expression as a facial expression, and the facial expression is particularly happy, happy and the like; the negative expression sample image refers to a sample image in which the character expression is a negative expression, and the negative expression is specifically a heart injury, crying, difficulty and the like.
Labeling each positive expression sample image in the first sample set to obtain a first expression category, wherein the first expression category is a positive expression, and labeling each negative expression sample image in the second sample set to obtain a second expression category, and the second expression category is a negative expression. That is, the expression categories of the annotation are divided into two types, and different expression categories can be represented by different indexes, wherein index 0 corresponds to a positive expression, index 1 corresponds to a negative expression, and the annotation can be subjected to one-hot coding. The first expression category and the second expression category constitute annotation data.
The expression recognition network includes an expression recognition encoder, a flat layer, a fully connected layer, and a softmax function.
The method comprises the steps of inputting a first sample set and a second sample set into an expression recognition encoder for feature extraction, outputting feature vectors (such as mouth angle degrees) by the expression recognition encoder, inputting the feature vectors into a flat layer, processing the feature vectors through the flat layer to obtain one-dimensional feature vectors, taking the one-dimensional feature vectors as the input of a full connection layer, mapping the one-dimensional feature vectors into feature mark space by the full connection layer, outputting the feature vectors to a softmax function, outputting probabilities of two expression categories through the softmax function, and determining corresponding initial expression categories according to the probabilities of the two output expression categories, wherein the probability of the two expression categories is 1.
And calculating the obtained initial expression category and the labeling data through a cross entropy loss function, and optimizing parameters in the expression recognition network so that the output expression category is gradually close to a true value.
The method comprises the steps of inputting user face images in all images into an expression recognition network, carrying out expression recognition through the expression recognition network, specifically inputting the user face images in all images into an expression recognition encoder for feature extraction, outputting feature vectors by an image classification encoder, inputting the feature vectors into a layer of the flat, processing the feature vectors by the layer of the flat to obtain one-dimensional feature vectors, wherein the one-dimensional feature vectors serve as input of a full-connection layer, mapping the one-dimensional feature vectors into feature mark space by the full-connection layer, outputting the feature vectors to a softmax function, outputting corresponding expression categories through the softmax function, and outputting the expression categories which are positive expressions or negative expressions.
Step S5: according to the character expressions in each image and the sequence of the time of each image, acquiring expression change trend:
because each image is obtained according to a preset period, each image has a sequence of time, and the sequence of time is the time sequence of the self-timer video in the playing process. Then, after the character expressions in each image are obtained, the expression change trend is obtained according to the sequence of the time of each image. The expression change trend is in which direction the expression is developed, i.e. towards a positive expression or towards a negative expression.
Wherein, the development towards the frontal expression includes two cases, respectively: the expression changes towards the positive expression (e.g. the expression changes from negative to positive) or is always positive. Similarly, the development toward negative expression also includes two cases, respectively: the expression changes to a negative expression (e.g., the expression changes from a positive expression to a negative expression) or is always a negative expression.
Step S6: performing voice recognition on the voice signal to obtain a corresponding text signal:
and performing voice recognition on the voice signal to obtain a text signal corresponding to the voice signal, namely converting the voice signal into the text signal. Since the voice recognition algorithm belongs to the conventional algorithm, the description thereof is omitted.
Step S7: inputting the text signal and the actual scene into a preset detection model, and obtaining a preliminary emotion result of the voice signal in the actual scene:
after the text signal is obtained, inputting the text signal and an actual scene corresponding to the self-timer video into a preset detection model, and obtaining a preliminary emotion result of the voice signal in the actual scene.
It should be appreciated that the predetermined detection model may be a detection model constructed in advance, including: at least two scenes, at least two texts are arranged in each scene, and emotion results corresponding to the texts in each scene, it should be understood that, in order to improve detection accuracy, the number of the scenes in the detection model and the number of the texts in each scene may be enough, that is, each currently known scene, and the texts that can occur or be generated in each scene are included in the detection model. Since the scene and the text are independent, the preset detection model can also be said to include: at least two texts, and emotion results corresponding to each text in each of the at least two scenes. Moreover, in order to improve the detection accuracy, each text in the preset detection model may be a keyword, not necessarily a complete sentence, for example: the complete sentence is: "I do not want to dry", the keywords may be: "do not want to dry".
The detection model referred to in the previous paragraph may be an existing detection model, and as a specific embodiment, the preset detection model is a detection model after correcting the existing detection model, so that an acquisition procedure is given as follows:
(1) At least two corrected text in each of at least two scenes is acquired. It should be appreciated that in order to improve the reliability of the correction, the accuracy of the predictive detection model is improved, in which step the acquired scenes may be set sufficiently large, and the correction text in each scene is also set sufficiently large.
(2) Since the corrected text is a text for correcting the existing detection model, the actual emotion result of each corrected text in each scene is also known, and the actual emotion result of each corrected text in each scene is acquired.
(3) And inputting each correction text in each scene into the existing detection model to obtain a detection emotion result of each correction text in each scene.
(4) Obtaining actual emotion results of each corrected text in each scene, and after detecting the emotion results, checking the two emotion results of each corrected text in each scene, specifically: obtaining correction texts in all scenes of which the actual emotion result and the detected emotion result are positive emotion, obtaining first correction texts in all scenes of which the actual emotion result and the detected emotion result are negative emotion, and obtaining second correction texts in all scenes of which the actual emotion result and the detected emotion result are negative emotion.
(5) And adjusting the existing detection model according to each first correction text in the first scene and each second correction text in the second scene to obtain a preset detection model. Two adjustment methods are given below, the first: the existing detection model is not considered, and a preset detection model is directly built according to each first correction text in a first scene and each second correction text in a second scene; second kind: deleting each text in each scene under the condition that the emotion result in the existing detection model is the positive emotion, wherein the actual emotion result and the detected emotion result are not met; and deleting each text in each scene which does not meet the condition that the actual emotion result and the detected emotion result are negative emotion for each text in each scene in which the emotion result in the existing detection model is negative emotion.
Therefore, the obtained text signal and the actual scene are input into a preset detection model, and a preliminary emotion result of the text signal in the actual scene, namely a preliminary emotion result of the corresponding voice signal in the actual scene, is obtained.
Through correction, the detection precision of a preset detection model can be improved.
Step S8: and fusing the expression change trend and the preliminary emotion result to obtain a final emotion result of the user:
fusing the obtained expression change trend and the preliminary emotion result, and specifically: if the expression change trend is towards the development of the frontal expression and the preliminary emotion result is the frontal emotion, the final emotion result of the user is the frontal emotion; if the expression change trend is towards negative expression development and the preliminary emotion result is negative emotion, the final emotion result of the user is negative emotion.
As other embodiments, two weights may be further set, and the final emotion result of the user may be obtained by combining the expression change trend with the preliminary emotion result and the corresponding weights.
The embodiment also provides a emotion detection device based on voice recognition and image recognition, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the steps of the emotion detection method based on voice recognition and image recognition provided by the embodiment when executing the computer program. Therefore, the emotion detection device based on voice recognition and image recognition is a software device, and the emotion detection method based on voice recognition and image recognition is still a method of emotion detection, and the detailed description of the emotion detection method based on voice recognition and image recognition is given in the above embodiment and will not be repeated.
The foregoing examples illustrate the technical solution of the present invention in only one specific embodiment, and any equivalent replacement of the present invention and modification or partial replacement without departing from the spirit and scope of the present invention should be covered by the scope of the claims of the present invention.

Claims (8)

1. A emotion detection method based on speech recognition and image recognition, comprising:
acquiring a section of self-timer video of a user to be detected and an actual scene corresponding to the self-timer video;
processing the self-timer video to obtain an image signal and a voice signal;
performing screenshot processing on the image signal according to a preset period to obtain at least two images;
carrying out expression recognition on the at least two images to obtain the character expression in each image;
acquiring expression change trends according to the character expressions in each image and the sequence of the time of each image;
performing voice recognition on the voice signal to obtain a corresponding text signal;
inputting the text signal and the actual scene into a preset detection model, and acquiring a preliminary emotion result of the voice signal in the actual scene;
fusing the expression change trend and the preliminary emotion result to obtain a final emotion result of the user;
the acquisition process of the preset detection model comprises the following steps:
acquiring at least two correction texts in each of at least two scenes;
acquiring actual emotion results of each corrected text in each scene;
inputting each correction text in each scene into an existing detection model to obtain a detection emotion result of each correction text in each scene;
acquiring correction texts in each scene of which the actual emotion result and the detected emotion result are positive emotion to obtain first correction texts in a first scene, and acquiring correction texts in each scene of which the actual emotion result and the detected emotion result are negative emotion to obtain second correction texts in a second scene;
and adjusting the existing detection model according to each first correction text in the first scene and each second correction text in the second scene to obtain the preset detection model.
2. The emotion detection method based on speech recognition and image recognition according to claim 1, wherein said fusing the expression change trend and the preliminary emotion result to obtain a final emotion result includes:
if the expression change trend is towards the development of the frontal expression and the preliminary emotion result is the frontal emotion, the final emotion result is the frontal emotion;
and if the expression change trend is towards negative expression development and the preliminary emotion result is negative emotion, the final emotion result is negative emotion.
3. The emotion detection method based on voice recognition and image recognition according to claim 1, wherein said performing expression recognition on the at least two images to obtain a character expression in each image comprises:
performing user face recognition on the at least two images to obtain a user face image of the user;
and carrying out expression recognition on the face images of the users in each image to obtain the character expression in each image.
4. The emotion detection method based on voice recognition and image recognition of claim 3, wherein said performing expression recognition on the face image of the user in each image to obtain the character expression in each image comprises:
obtaining a first sample set and a second sample set, the first sample set comprising at least one positive expression sample image and the second sample set comprising at least one negative expression sample image;
labeling each positive expression sample image in the first sample set to obtain a first expression category, wherein the first expression category is a positive expression, labeling each negative expression sample image in the second sample set to obtain a second expression category, wherein the second expression category is a negative expression, and the first expression category and the second expression category form labeling data;
inputting the first sample set and the second sample set into an expression recognition encoder for feature extraction, inputting a feature vector output by the expression recognition encoder into a flat layer, processing the feature vector by the flat layer to obtain a one-dimensional feature vector, taking the one-dimensional feature vector as input of a full connection layer, mapping the one-dimensional feature vector into a feature mark space by the full connection layer, outputting the feature vector to a softmax function, outputting probabilities of two expression categories through the softmax function, and determining corresponding initial expression categories according to the probabilities of the two output expression categories;
calculating the initial expression category and the annotation data through a cross entropy loss function, and optimizing parameters in an expression recognition network;
and inputting the face images of the users in the images into the expression recognition network to obtain the character expressions of the face images of the users in the images.
5. A speech recognition and image recognition based emotion detection device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor, when executing the computer program, implements the steps of a speech recognition and image recognition based emotion detection method as follows:
acquiring a section of self-timer video of a user to be detected and an actual scene corresponding to the self-timer video;
processing the self-timer video to obtain an image signal and a voice signal;
performing screenshot processing on the image signal according to a preset period to obtain at least two images;
carrying out expression recognition on the at least two images to obtain the character expression in each image;
acquiring expression change trends according to the character expressions in each image and the sequence of the time of each image;
performing voice recognition on the voice signal to obtain a corresponding text signal;
inputting the text signal and the actual scene into a preset detection model, and acquiring a preliminary emotion result of the voice signal in the actual scene;
fusing the expression change trend and the preliminary emotion result to obtain a final emotion result of the user;
the acquisition process of the preset detection model comprises the following steps:
acquiring at least two correction texts in each of at least two scenes;
acquiring actual emotion results of each corrected text in each scene;
inputting each correction text in each scene into an existing detection model to obtain a detection emotion result of each correction text in each scene;
acquiring correction texts in each scene of which the actual emotion result and the detected emotion result are positive emotion to obtain first correction texts in a first scene, and acquiring correction texts in each scene of which the actual emotion result and the detected emotion result are negative emotion to obtain second correction texts in a second scene;
and adjusting the existing detection model according to each first correction text in the first scene and each second correction text in the second scene to obtain the preset detection model.
6. The emotion detection device based on speech recognition and image recognition of claim 5, wherein the fusing the expression change trend and the preliminary emotion result to obtain a final emotion result includes:
if the expression change trend is towards the development of the frontal expression and the preliminary emotion result is the frontal emotion, the final emotion result is the frontal emotion;
and if the expression change trend is towards negative expression development and the preliminary emotion result is negative emotion, the final emotion result is negative emotion.
7. The emotion detection device based on speech recognition and image recognition of claim 5, wherein performing expression recognition on the at least two images to obtain a character expression in each image comprises:
performing user face recognition on the at least two images to obtain a user face image of the user;
and carrying out expression recognition on the face images of the users in each image to obtain the character expression in each image.
8. The emotion detection device based on voice recognition and image recognition of claim 7, wherein performing expression recognition on the face image of the user in each image to obtain the character expression in each image comprises:
obtaining a first sample set and a second sample set, the first sample set comprising at least one positive expression sample image and the second sample set comprising at least one negative expression sample image;
labeling each positive expression sample image in the first sample set to obtain a first expression category, wherein the first expression category is a positive expression, labeling each negative expression sample image in the second sample set to obtain a second expression category, wherein the second expression category is a negative expression, and the first expression category and the second expression category form labeling data;
inputting the first sample set and the second sample set into an expression recognition encoder for feature extraction, inputting a feature vector output by the expression recognition encoder into a flat layer, processing the feature vector by the flat layer to obtain a one-dimensional feature vector, taking the one-dimensional feature vector as input of a full connection layer, mapping the one-dimensional feature vector into a feature mark space by the full connection layer, outputting the feature vector to a softmax function, outputting probabilities of two expression categories through the softmax function, and determining corresponding initial expression categories according to the probabilities of the two output expression categories;
calculating the initial expression category and the annotation data through a cross entropy loss function, and optimizing parameters in an expression recognition network;
and inputting the face images of the users in the images into the expression recognition network to obtain the character expressions of the face images of the users in the images.
CN202011213188.XA 2020-11-04 2020-11-04 Emotion detection method and device based on voice recognition and image recognition Active CN112232276B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011213188.XA CN112232276B (en) 2020-11-04 2020-11-04 Emotion detection method and device based on voice recognition and image recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011213188.XA CN112232276B (en) 2020-11-04 2020-11-04 Emotion detection method and device based on voice recognition and image recognition

Publications (2)

Publication Number Publication Date
CN112232276A CN112232276A (en) 2021-01-15
CN112232276B true CN112232276B (en) 2023-10-13

Family

ID=74121979

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011213188.XA Active CN112232276B (en) 2020-11-04 2020-11-04 Emotion detection method and device based on voice recognition and image recognition

Country Status (1)

Country Link
CN (1) CN112232276B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112992148A (en) * 2021-03-03 2021-06-18 中国工商银行股份有限公司 Method and device for recognizing voice in video
CN112990301A (en) * 2021-03-10 2021-06-18 深圳市声扬科技有限公司 Emotion data annotation method and device, computer equipment and storage medium
CN114065742B (en) * 2021-11-19 2023-08-25 马上消费金融股份有限公司 Text detection method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020125386A1 (en) * 2018-12-18 2020-06-25 深圳壹账通智能科技有限公司 Expression recognition method and apparatus, computer device, and storage medium
WO2020135194A1 (en) * 2018-12-26 2020-07-02 深圳Tcl新技术有限公司 Emotion engine technology-based voice interaction method, smart terminal, and storage medium
CN111681681A (en) * 2020-05-22 2020-09-18 深圳壹账通智能科技有限公司 Voice emotion recognition method and device, electronic equipment and storage medium
CN111694959A (en) * 2020-06-08 2020-09-22 谢沛然 Network public opinion multi-mode emotion recognition method and system based on facial expressions and text information

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020125386A1 (en) * 2018-12-18 2020-06-25 深圳壹账通智能科技有限公司 Expression recognition method and apparatus, computer device, and storage medium
WO2020135194A1 (en) * 2018-12-26 2020-07-02 深圳Tcl新技术有限公司 Emotion engine technology-based voice interaction method, smart terminal, and storage medium
CN111368609A (en) * 2018-12-26 2020-07-03 深圳Tcl新技术有限公司 Voice interaction method based on emotion engine technology, intelligent terminal and storage medium
CN111681681A (en) * 2020-05-22 2020-09-18 深圳壹账通智能科技有限公司 Voice emotion recognition method and device, electronic equipment and storage medium
CN111694959A (en) * 2020-06-08 2020-09-22 谢沛然 Network public opinion multi-mode emotion recognition method and system based on facial expressions and text information

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Deep Learning-Based Emotion Recognition from Real-Time Videos;Wenbin Zhou等;《HCII 2020: Human-Computer Interaction. Multimodal and Natural Interaction》;全文 *
基于语义分析的情感计算技术研究进展;饶元;吴连伟;王一鸣;冯聪;;软件学报(第08期);全文 *
多文化场景下的多模态情感识别;陈师哲;王帅;金琴;;软件学报(第04期);全文 *

Also Published As

Publication number Publication date
CN112232276A (en) 2021-01-15

Similar Documents

Publication Publication Date Title
CN112232276B (en) Emotion detection method and device based on voice recognition and image recognition
CN107818798B (en) Customer service quality evaluation method, device, equipment and storage medium
US10438586B2 (en) Voice dialog device and voice dialog method
US11281945B1 (en) Multimodal dimensional emotion recognition method
CN110428820B (en) Chinese and English mixed speech recognition method and device
WO2024000867A1 (en) Emotion recognition method and apparatus, device, and storage medium
CN108305618B (en) Voice acquisition and search method, intelligent pen, search terminal and storage medium
CN110418204B (en) Video recommendation method, device, equipment and storage medium based on micro expression
CN112614489A (en) User pronunciation accuracy evaluation method and device and electronic equipment
CN109872714A (en) A kind of method, electronic equipment and storage medium improving accuracy of speech recognition
CN111413877A (en) Method and device for controlling household appliance
CN114495217A (en) Scene analysis method, device and system based on natural language and expression analysis
CN110956958A (en) Searching method, searching device, terminal equipment and storage medium
CN109408175B (en) Real-time interaction method and system in general high-performance deep learning calculation engine
CN110910898B (en) Voice information processing method and device
CN110827799A (en) Method, apparatus, device and medium for processing voice signal
CN111126084A (en) Data processing method and device, electronic equipment and storage medium
WO2024093578A1 (en) Voice recognition method and apparatus, and electronic device, storage medium and computer program product
CN112597889A (en) Emotion processing method and device based on artificial intelligence
CN112434953A (en) Customer service personnel assessment method and device based on computer data processing
CN112584238A (en) Movie and television resource matching method and device and smart television
CN110781329A (en) Image searching method and device, terminal equipment and storage medium
CN112614510B (en) Audio quality assessment method and device
CN114267324A (en) Voice generation method, device, equipment and storage medium
CN114297409A (en) Model training method, information extraction method and device, electronic device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230526

Address after: No. 16-44, No. 10A-10C, 12A, 12B, 13A, 13B, 15-18, Phase II of Wuyue Plaza Project, east of Zhengyang Street and south of Haoyue Road, Lvyuan District, Changchun City, Jilin Province, 130000

Applicant after: Jilin Huayuan Network Technology Co.,Ltd.

Address before: 450000 Wenhua Road, Jinshui District, Zhengzhou City, Henan Province

Applicant before: Zhao Zhen

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20230913

Address after: Room 1001, 1st floor, building B, 555 Dongchuan Road, Minhang District, Shanghai

Applicant after: Shanghai Enterprise Information Technology Co.,Ltd.

Address before: No. 16-44, No. 10A-10C, 12A, 12B, 13A, 13B, 15-18, Phase II of Wuyue Plaza Project, east of Zhengyang Street and south of Haoyue Road, Lvyuan District, Changchun City, Jilin Province, 130000

Applicant before: Jilin Huayuan Network Technology Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: An emotion detection method and device based on speech recognition and image recognition

Granted publication date: 20231013

Pledgee: Agricultural Bank of China Limited Shanghai Huangpu Sub branch

Pledgor: Shanghai Enterprise Information Technology Co.,Ltd.

Registration number: Y2024310000041

PE01 Entry into force of the registration of the contract for pledge of patent right