CN113380271B - Emotion recognition method, system, device and medium - Google Patents

Emotion recognition method, system, device and medium Download PDF

Info

Publication number
CN113380271B
CN113380271B CN202110922781.XA CN202110922781A CN113380271B CN 113380271 B CN113380271 B CN 113380271B CN 202110922781 A CN202110922781 A CN 202110922781A CN 113380271 B CN113380271 B CN 113380271B
Authority
CN
China
Prior art keywords
user
detected
voice
data
emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110922781.XA
Other languages
Chinese (zh)
Other versions
CN113380271A (en
Inventor
姚娟娟
钟南山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Mingping Medical Data Technology Co ltd
Original Assignee
Mingpinyun Beijing Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mingpinyun Beijing Data Technology Co Ltd filed Critical Mingpinyun Beijing Data Technology Co Ltd
Priority to CN202110922781.XA priority Critical patent/CN113380271B/en
Publication of CN113380271A publication Critical patent/CN113380271A/en
Application granted granted Critical
Publication of CN113380271B publication Critical patent/CN113380271B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Abstract

The invention provides a method, a system, equipment and a medium for emotion recognition, which comprise the following steps: acquiring voice data and face data of a user to be detected; processing the voice data by utilizing a voice recognition technology to generate text data; analyzing the text data to obtain emotional characteristics of the user to be detected, extracting the tone characteristics of the user to be detected in the voice data, and extracting the expression characteristics of the user to be detected in the face data; constructing an emotion recognition model for recognizing emotion of a user to be detected based on a distilled neural network; and inputting the emotional characteristics, the tone characteristics and the expression characteristics into the emotion recognition module for recognition to obtain the emotion category of the user to be detected. Compared with the existing emotion recognition based on the user text content, the emotion recognition method based on the user text content has the advantages that emotion recognition is carried out from multiple dimensions of tone, expression and emotion, and the accuracy of emotion recognition is greatly improved.

Description

Emotion recognition method, system, device and medium
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to an emotion recognition method, system, device and medium.
Background
With the development of social media, people can anonymously announce emotions on social platforms such as Twitter, microblog and internet forum, various information on the social platforms can also be used as psychological illness diagnosis indexes which are tracked and observed, and the emotions of users are identified through text detection.
However, conventional text content is used to recognize emotion, and positive (negative) words are marked to calculate the final emotion, while the actual emotion of the user cannot be detected accurately by text. Therefore, a method for recognizing emotion with high accuracy is needed in the prior art.
Disclosure of Invention
In view of the above-mentioned shortcomings of the prior art, the present invention provides a method, a system, a device and a medium for emotion recognition, which are used to solve the problem of low emotion recognition accuracy when recognizing emotion based on text content in the prior art.
To achieve the above and other related objects, a first aspect of the present invention provides an emotion recognition method, including:
acquiring voice data and face data of a user to be detected;
processing the voice data by utilizing a voice recognition technology to generate text data;
analyzing the text data to obtain emotional characteristics of the user to be detected, extracting the tone characteristics of the user to be detected in the voice data, and extracting the expression characteristics of the user to be detected in the face data;
constructing an emotion recognition model for recognizing emotion of a user to be detected based on a distilled neural network;
and inputting the emotional characteristics, the tone characteristics and the expression characteristics into the emotion recognition module for recognition to obtain the emotion category of the user to be detected.
In an embodiment of the first aspect, the method further includes:
acquiring voice data of a current multi-person conversation;
processing the voice data formed by the multi-person conversation by utilizing a voice recognition technology to generate corresponding text data;
extracting time sequence information of each statement in the voice data;
detecting the voice characteristics of each speaker, and marking by combining the voice characteristics and the time sequence information to distinguish the speakers corresponding to the sentences in the text data;
recognizing the text data corresponding to the multi-person conversation by using a natural language processing technology to obtain the character content of the user to be tested; meanwhile, identifying voice data of a user to be detected in the multi-person conversation according to the mark;
and extracting the emotional characteristics of the text data corresponding to the user to be detected, and extracting the tone characteristics in the voice data corresponding to the user to be detected.
In an embodiment of the first aspect, the step of processing the voice data formed by the multi-person conversation by using voice recognition to generate corresponding text data includes:
constructing a speech character matching model base, and training an RNN-T speech recognition model based on the model base;
and converting the voice data into text data by using the trained RNN-T voice recognition model.
In an embodiment of the first aspect, the step of extracting timing information of each statement in the speech data includes:
the method comprises the steps of obtaining voice data of a speaker and lip image data corresponding to the voice data, wherein the lip image data comprise lip image sequences of all speakers related to the voice data of the speaker, and determining time sequence information of all sentences in the voice data according to content identified by the lip image sequences.
In an embodiment of the first aspect, the method further includes:
intercepting a voice feature set to be detected with accumulated time as a preset time threshold from the voice data according to a time sequence to obtain a plurality of voice feature sets to be detected, clustering each voice feature set to be detected, grading the cluster, and obtaining voice features of different callers according to a grading result;
marking and distinguishing each sentence in the text data according to the voice characteristics and the time sequence characteristics to obtain the sentence contents corresponding to different speakers in the text data;
recognizing each sentence field in the text data by using a natural language processing technology, judging the semantics of each sentence by combining context, obtaining the content of the text data of the user to be detected according to the semantics and the mark of each sentence, and obtaining the emotional characteristic of the text corresponding to the user to be detected;
and recognizing voice data corresponding to the user to be detected according to the voice characteristics and the time sequence characteristics, and extracting the tone characteristics belonging to the user to be detected in the voice data.
In an embodiment of the first aspect, the step of constructing an emotion recognition model for recognizing emotion of a user to be detected based on a distilled neural network includes:
forming a training set by the preprocessed emotional characteristics, tone characteristics and expression characteristics;
training an emotion recognition model by using the training set based on a distillation neural network, wherein parameters of the network are optimized by adopting a back propagation algorithm and a distillation loss function, and the distillation neural network is composed of a plurality of neural networks;
and combining the prediction labels of the plurality of neural networks, training the combined labels by using the one-dimensional convolutional neural network, and distributing different weights to the plurality of neural networks to obtain an integrated decision of the emotion recognition model.
In an embodiment of the first aspect, before the inputting the emotional feature, the mood feature, and the expression feature into the emotion recognition module for recognition, the method further includes: and preprocessing the emotional characteristics, the tone characteristics and the expression characteristics to obtain a characteristic vector with preset dimensionality.
A second aspect of the present invention provides an emotion recognition system, comprising:
the data acquisition module is used for acquiring voice data and face data of a user to be detected;
the voice recognition module is used for processing the voice data by utilizing a voice recognition technology to generate text data;
the feature extraction module is used for analyzing the text data to obtain emotional features of the user to be detected, extracting the tone features of the user to be detected in the voice data and extracting the expression features of the user to be detected in the face data;
the model construction module is used for constructing an emotion recognition model for recognizing the emotion of the user to be detected based on the distilled neural network;
and the emotion recognition module is used for inputting the emotion characteristics, the tone characteristics and the expression characteristics into the emotion recognition module for recognition to obtain the emotion category of the user to be detected.
A third aspect of the present invention provides an emotion recognition apparatus comprising:
one or more processing devices;
a memory for storing one or more programs; when the one or more programs are executed by the one or more processing devices, the one or more processing devices are caused to implement the emotion recognition method described above.
A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program characterized by causing a computer to execute the emotion recognition method described above.
As described above, the emotion recognition method, system, device and medium according to the present invention have the following advantages:
according to the emotion recognition method and device, the emotion recognition model constructed based on the distilled neural network is obtained by training the tone features in the voice data of the user to be detected, the facial expressions in the facial image and the emotion features in the text of the user to be detected.
Drawings
FIG. 1 shows a flow chart of a method for emotion recognition provided by the present invention;
FIG. 2 is a block diagram of a emotion recognition system provided in the present invention;
fig. 3 shows a schematic structural diagram of an emotion recognition apparatus provided in the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
The invention solves the problems that in the prior art, on the doctor inquiry site, since outpatients doctors belong to respective diagnosis rooms, under the common conditions, doctors usually pay excessive attention to the diagnosis result or the diagnosis effect by family members or patients through inquiring, observing and detecting the physical conditions of patients in the diagnosis rooms, and at the moment, doctors usually pay attention to the conditions of the patients without paying attention to the mood of the patients, and meanwhile, many doctors cannot recognize the mood of the current patients due to no psychological success. Therefore, the emotion of the patient cannot be recognized at the same time during the inquiry, and on one hand, the poor emotion not only causes psychological or emotional interference to the patient, but also influences the recovery of the body of the patient; on the other hand, if the patient is under violent emotion or the patient is under depression, extreme phenomena, such as self-injury, suicide or medical staff injury, are easily caused during the medical diagnosis and treatment process, and thus unnecessary medical disputes are easily caused. In addition to addressing the physician interview phenomenon, the present embodiment can also address other treatments in the hospital, such as physicians, nurses visiting patients in the ward, and operating room operating records, etc.
Referring to fig. 1, a flowchart of a method for emotion recognition provided by the present invention includes:
step S1, acquiring voice data and face data of a user to be detected;
acquiring user voice data and face data from a target video, for example, installing a camera or video device in a preset field to acquire a video containing the voice data and the face data of a target user; and obtaining the voice data and the face data of the user to be detected by separating the videos to obtain the target video.
It should be noted that, in the preferred target video, only the voice data and the face data of the user to be detected are included, but in practical applications, in most cases, the target video includes other conversation persons besides the user to be detected.
Step S2, processing the voice data by using a voice recognition technology to generate text data;
in particular, speech data is processed with a speech recognition system, e.g., a speech recognition system comprising one or more computers programmed to: the method includes receiving speech data input from a user, determining a transcribed text of the speech data, and outputting the transcribed text.
Step S3, analyzing the text data to obtain the emotional characteristics of the user to be detected, extracting the tone characteristics of the user to be detected in the voice data, and extracting the expression characteristics of the user to be detected in the face data;
the emotion categories are classified into three major categories, positive (positive), negative (negative), and neutral (neutral). For example, the positive emotion represents a positive emotion of a person, and represents states of happiness, optimism, confidence, appreciation, relaxation and the like expressed in the face image; the negative emotion represents a negative emotion of a person, and psychology refers to emotions which are unfavorable for mind and body, such as anxiety, tension, anger, depression, sadness, pain and the like, and are collectively called negative emotions; neutral emotions represent categories of emotions that are not disoriented, without any emotional coloration. The emotional characteristics, the tone characteristics and the expression characteristics belong to three main emotion categories, and each characteristic can correspond to a subclass of emotions of a main class.
Optionally, the emotion features, the mood features and the expression features are preprocessed to obtain feature vectors with preset dimensions, and the features are preprocessed to obtain feature vectors with preset specifications, so that a training set can be greatly optimized, and subsequent training of a model is facilitated.
For example, extracting the emotion of the text data by adopting a DUTIR emotion recognition model to obtain emotional characteristics; and inputting the voice data of the user to be tested into the trained neural network to obtain the tone characteristics corresponding to the voice data.
For another example, the facial expression pictures of the human faces are acquired, the main network preprocesses the facial expression pictures by inputting the facial expression pictures, the main network learns the facial expression characteristics of the preprocessed facial expression pictures to obtain facial expression characteristic information, and emotion classification is performed on the facial expressions of the human faces through the facial expression characteristic information to obtain emotion classification information; obtaining privilege information through a privilege network, performing privilege learning on a loss function by using the privilege information, and further optimizing parameters of a main network to obtain an optimized deep privilege network; and inputting the tested facial expression picture in the main network model, preprocessing the tested facial expression picture to obtain a preprocessed tested facial expression picture, and extracting expression characteristics by adopting a depth privileged network after privilege learning.
Step S4, constructing an emotion recognition model for recognizing the emotion of the user to be detected based on the distilled neural network;
besides the emotion recognition model constructed based on the distillation neural network, the emotion recognition model can be trained based on other neural networks, such as a shallow convolutional neural network and a convolutional neural network.
And step S5, inputting the emotional characteristics, the tone characteristics and the expression characteristics into the emotion recognition module for recognition to obtain the emotion category of the user to be detected.
In the embodiment, the emotion recognition model constructed based on the distilled neural network is obtained by training the tone features in the voice data of the user to be detected, the facial expressions in the facial image and the emotion features in the text of the user to be detected, and compared with the existing emotion recognition based on the text content of the user, the emotion recognition method based on the distilled neural network identifies the tone, the expressions and emotion from multiple dimensions, so that the accuracy of emotion recognition is greatly improved; through multi-dimensional training, not only can large emotion categories be identified, but also a specific certain emotion can be accurately identified.
In other embodiments, further comprising:
acquiring voice data of a current multi-person conversation;
for example, a microphone, a recording terminal, or other recording device is used for recording, and the device or equipment must ensure a controllable acquisition range to ensure the acquisition quality of voice data.
Processing the voice data formed by the multi-person conversation by utilizing voice recognition to generate corresponding text data;
extracting time sequence information of each statement in the voice data;
the lip language time sequence has time sequence information, namely, the time sequence information is mapped to the same sentence of voice recognition in the text data, and the time sequence information of each sentence is obtained through the mapping relation between the two sentences. Of course, timing information may also be formed by time critical point triggering, for example, by button triggering before each utterance by a particular user.
Detecting the voice characteristics of each speaker, and marking by combining the voice characteristics and the time sequence information to distinguish the speakers corresponding to the sentences in the text data;
the method comprises the steps of determining human voice data in voice data, then determining sliding window data contained in the human voice data, carrying out audio feature extraction on each sliding window data in each human voice data, inputting the extracted audio features into a voice classification model, and determining the probability that the sliding window data belongs to a certain human voice feature.
Recognizing the text data corresponding to the multi-person conversation by using a natural language processing technology to obtain the character content of the user to be tested; meanwhile, identifying voice data of a user to be detected in the multi-person conversation according to the mark; extracting emotional characteristics of text data corresponding to a user to be detected, and extracting tone characteristics in voice data corresponding to the user to be detected;
the method comprises the steps of processing text data by using an NLP technology to obtain semantics of each sentence, and judging which speaker should speak each sentence according to the current conversation scene and the semantics. For example, in the case of a doctor-patient consultation session, the following terms "name", "age", "discomfort", "time to start" and some medical technical terms, etc. are used to determine that a specific conversation person is a doctor from the above-mentioned statements, and the corresponding answers to the specific conversation person may include a patient, a family member, etc., and are not listed here.
In this embodiment, by identifying each sentence in the text data corresponding to a patient (i.e., a user to be detected) among a plurality of speakers, the sentences in the text data can be conveniently analyzed subsequently to obtain emotional characteristics of the user to be detected, meanwhile, a voice data set corresponding to the user to be detected is obtained by segmenting the voice data, and then the voice characteristics belonging to the user to be detected are extracted from the voice data set, that is, the voice characteristics and emotional characteristics of the user to be detected can be identified from a multi-user conversation scene, so that the emotion of the user to be detected can be conveniently and accurately identified subsequently.
In addition, in the voice recognition, a first candidate transcription character of the first segmentation of the voice data can be obtained; determining one or more contexts associated with the first candidate transcript; adjusting a respective weight for each of the one or more contexts; and determining a second candidate transcript text for the second segment of the speech data based on the adjusted weight.
For example, in this manner, a doctor-patient diagnosis scene is confirmed by segmentation, the weight of the context is adjusted using the segmentation based on the voice data, and the transcribed text of the subsequent voice data is determined based on the adjusted weight, so that the recognition performance can be dynamically improved and the voice recognition accuracy can be improved using this manner.
In other embodiments, the face image of the user to be detected is extracted from the video image, the position of key feature points of the face is extracted from the face image of the user to be detected to serve as a feature region, the key feature points comprise eyebrows, eyelids, lips and chin, the key feature points are subjected to intensity grading to generate expression features, and the emotion recognition accuracy is greatly improved by matching the expression features with simple text content recognition.
In other embodiments, the step of constructing an emotion recognition model for recognizing emotion of a user to be detected based on a distilled neural network includes:
forming a training set by the preprocessed emotional characteristics, tone characteristics and expression characteristics;
and preprocessing the extracted emotional characteristics, tone characteristics and expression characteristics to obtain the emotional characteristics, tone characteristics and expression characteristics of preset specifications.
Training an emotion recognition model by using the training set based on a distillation neural network, wherein parameters of the network are optimized by adopting a back propagation algorithm and a distillation loss function, and the distillation neural network is composed of a plurality of neural networks;
wherein the distillation loss function is expressed as:
Figure 467266DEST_PATH_IMAGE001
wherein r represents a real label of a prediction sample, p is a network prediction output, T is a temperature coefficient and is used as a soft label coefficient, λ represents a balance term of a front term and a rear term, L represents cross entropy loss, and softmax is a loss function.
Specifically, time is reduced by the plurality of neural networks, and meanwhile, the transverse depth of the deep neural network is converted into the longitudinal depth, so that the identification of feature extraction is improved.
And combining the prediction labels of the plurality of neural networks, training the combined labels by using the one-dimensional convolutional neural network, and distributing different weights to different networks to obtain an integrated decision of the emotion recognition model.
Specifically, M basic neural networks are initialized, each network is composed of an LSTM coding-decoding structure and a FrameGate eigengating unit (the hidden state of the input is weighted and, in this way, we can give corresponding weights to different input frames, and finally obtain a series of high-weight regions through network learning), two networks are arbitrarily selected, the first network is defined as a main network, the second network is defined as a sub-network, the main network receives a 3D-tensor of S × N × D as an input, the output of the eigengating unit does not change the dimensionality of the original data, which is also a three-dimensional tensor of S × N × D, but the three-dimensional tensor is weighted and combined in channels and frames, so that the sub-network with the same structure can naturally take the same as the input of itself, and the operation can be continuously iterated, that is, the network may be initialized continuously, and the feature matrix extracted from the network may be passed on until the iteration is terminated.
For example, M base networks with an encoding-decoding structure, feature gating units, multi-layer classification networks are initialized. For the first network (main network), the cross entropy is utilized to carry out preliminary learning, then corresponding extracted features are output to the sub-networks, the subsequent sub-networks further optimize an objective function based on the extracted features and the prediction labels, and finally,a conv1 x 1 is introduced and based on the weight variables in the real label training conv1 x 1 network, after fixing the weight variables, based on the network, the aggregation label matrix P isS*MAnd converting the probability vector into a probability vector, thereby giving a final decision result. An Adam optimizer is used in the model, dropout regularization is introduced, the number M of the network is set to be three for time and performance, the hidden variable number of all LSTM layers is consistent with the input dimension, and the number of neurons of the fully-connected network for classification is 32, 16 and the class number respectively.
For another example, the emotion recognition model is three initial networks, which respectively correspond to three channels, each initial network includes a first LTSM network (coding layer), a feature gating unit, a second LTSM network (decoding layer for classification), and a multi-layer classification network (for classification), which are connected in sequence, the initial network of the first channel outputs a first probability through the multi-layer classification network, the input of the initial network of the second channel is a first feature vector processed by the feature gating unit in the first channel, and the output of the initial network of the second channel outputs a second probability for the multi-layer classification network therein; similarly, the input of the initial network of the third channel is a second feature vector processed by the feature gating unit in the second channel, and the output of the initial network of the third channel is a third probability output by the multilayer classification network in the initial network of the third channel; inputting the convolution layers with the first probability, the second probability and the third probability having the dimension of 1, aggregating according to the weight variable of each initial network, and aggregating the aggregation label matrix PS*MAnd converting into a probability vector, namely, pentamble, so as to give a final decision result P.
In the embodiment, inter-frame correlation information in the data is mined through a frame-level gating unit, and concentrated abstract data representation is obtained through multi-network characteristic distillation, so that the accuracy of emotion recognition is improved.
In another embodiment, the step of identifying text spoken by a particular user in said text data using natural language processing techniques comprises:
constructing a speech character matching model base, and training an RNN-T speech recognition model based on the model base;
and converting the voice data into text data by using the trained RNN-T voice recognition model.
The RNN-T (RNN-Transducer) speech recognition framework, which is actually an improvement on the CTC model, skillfully integrates language model acoustic models together while performing joint optimization, is a theoretically relatively perfect model structure; the RNN-T model introduces the Transcription Net (any structure of acoustic model can be used) which is equivalent to the acoustic model part, and the Prediction Net is actually equivalent to the language model (can be constructed using a one-way recurrent neural network). Meanwhile, the most important structure of the method is a combined network, a forward network can be generally used for modeling, the combined network has the function of combining the states of a language model and an acoustic model together through a certain thought, can be splicing operation, can be directly added and the like, and the splicing operation seems to be more reasonable in consideration of different weight problems of the language model and the acoustic model; the RNN-T model has end-to-end joint optimization, language model modeling capacity and monotonicity, and can perform real-time online decoding; compared with the GMM-HMM (hidden Markov model) and DNN-HMM (deep neural network) which are commonly used at present, the RNN-T model has the characteristics of high training speed and high accuracy.
In other embodiments, the step of extracting timing information of each sentence in the speech data includes:
the method comprises the steps of obtaining voice data of a speaker and lip image data corresponding to the voice data, wherein the lip image data comprise lip image sequences of all speakers related to the voice data of the speaker, and determining time sequence information of all sentences in the voice data according to content identified by the lip image sequences.
Specifically, a video recording device or a camera device is adopted to collect a target video containing voice data of a current speaker and lip image data corresponding to the voice data, firstly, the voice data and an image sequence are separated from the target video data, and the separated voice data is used as target voice data; then, a lip image sequence of each speaker relating to the target speech data is acquired from the separated image sequences, and the lip image sequence of each speaker relating to the target speech data is used as lip image data corresponding to the target speech data.
For example, a face region image of the speaker is obtained from the image, the face region image is scaled to a preset first size, and finally a lip image of a preset second size (for example, 80 × 80) is cut from the scaled face region image with the center point of the lip of the speaker as the center.
It should be noted that when performing voice separation and recognition on target voice data, lip image data corresponding to the target voice data is combined, and when performing voice separation and recognition, the lip image data is supplemented, so that the voice recognition method provided by the present invention has certain robustness to noise, and can improve the voice recognition effect.
It should be further noted that the time sequence information of each statement in the text data is indirectly obtained through the lip language time sequence, and the text statements corresponding to each speaker can be distinguished and marked more accurately.
In other embodiments, further comprising:
intercepting a voice feature set to be detected with accumulated time as a preset time threshold from the voice data according to a time sequence to obtain a plurality of voice feature sets to be detected, clustering each voice feature set to be detected, grading the cluster, and obtaining voice features of different callers according to a grading result;
the voice data is divided into voice feature sets to be detected according to the accumulated duration, for example, two seconds, three seconds or five seconds, the voice feature sets to be detected are clustered, voice feature sets corresponding to speakers with different voice features are obtained according to clustering scores, and text sentences spoken by different speakers are distinguished.
Marking and distinguishing each sentence in the text data according to the voice characteristics and the time sequence characteristics to obtain the sentence contents corresponding to different speakers in the text data;
and performing mutual verification by combining the vocal features and the time sequence features, and determining the attribution of each statement mark, for example, if there are two vocal features, "doctor and patient", then the corresponding label must be two attribute marks, "doctor" and "patient". And obtaining the sentence content corresponding to each conversant by distinguishing different sentences in the text data.
Recognizing each sentence field in the text data by using a natural language processing technology, judging the semantics of each sentence by combining context, selecting the text content spoken by a specific speaker according to the semantics and the mark of each sentence, and obtaining the emotional characteristic of the text corresponding to the user to be detected;
recognizing voice data corresponding to the user to be detected according to the voice characteristics and the time sequence characteristics, and extracting tone characteristics belonging to the user to be detected in the voice data
On the basis, whether the sentence belongs to the corresponding mark can be laterally proved through sentence semantics, so that the fact that a speaker (user) is selected from each sentence in the text data is realized, the accuracy of the voice conversion text is improved, and the tone feature and the emotion feature which belong to the user to be detected are accurately extracted.
In other embodiments, further comprising:
establishing a word library database, wherein the word library database comprises a pronoun database, a verb database and a noun database, and words and idioms which are the attributes of pronouns, verbs and nouns in the Chinese characters are respectively stored into the corresponding pronoun database, verb database and noun database;
the nouns in the noun database are further classified and stored according to different service fields, wherein the service fields comprise catering, medical treatment, shopping, sports, accommodation, transportation and the like, and the noun database in the medical field is optimized.
Establishing a semantic frame database, wherein the semantic frame database comprises stored word combination modes and Chinese meanings corresponding to the combination modes;
recognizing the voice data into Chinese sentences, and disassembling the sentences in the following form: pronouns + verbs + nouns, corresponding to a word bank database and a semantic framework database, and obtaining the semantics of each sentence in the voice data.
For example, a camera of the device is turned on, a voice recognition system is started, and voice data and a face video input by a user are collected through the voice recognition system; the system identifies the voice data as Chinese sentences, and then disassembles the Chinese sentences in the following form: pronouns + verbs + nouns, and corresponding to a word bank database and a semantic frame database, obtaining the Chinese semantic meaning of the voice instruction.
For another example, the voice semantics is matched with lip language, and if the matching result is wrong, certain data of the text content is highlighted so as to achieve the purpose of reminding the user. The voice data are accurately converted into the text data through the matching of the voice semantic recognition result and the lip language recognition result, particularly, the voice data of the user to be detected are converted into the corresponding text data, and the accuracy of the acquisition and recognition of the original data is improved through mutual verification and supplement of the voice semantic recognition result and the lip language recognition result.
Referring to fig. 2, a block diagram of a emotion recognition system according to the present invention is shown, wherein the emotion recognition system is described in detail as follows:
the data acquisition module 1 is used for acquiring voice data and face data of a user to be detected;
the voice recognition module 2 processes the voice data by using voice recognition to generate text data;
the feature extraction module 3 is configured to analyze the text data to obtain emotional features of the user to be detected, extract the mood features of the user to be detected in the voice data, and extract the expression features of the user to be detected in the face data;
the model building module 4 is used for building an emotion recognition model for recognizing the emotion of the user to be detected based on the distilled neural network;
and the emotion recognition module 5 is used for inputting the emotion characteristics, the tone characteristics and the expression characteristics into the emotion recognition module for recognition to obtain the emotion category of the user to be detected.
It should be further noted that the emotion recognition method and the emotion recognition system are in a one-to-one correspondence relationship, and here, technical details and technical effects related to the emotion recognition system are the same as those of the above recognition method, and are not repeated herein, please refer to the above emotion recognition method.
The emotion recognition system is arranged in a specific place of a hospital in a terminal or system mode, the emotion of a patient can be accurately recognized, even if slight emotion fluctuation can be recognized, a doctor can know the emotion condition of the patient, and appropriate intervention and treatment can be conveniently performed.
Referring now to FIG. 3, there is shown a schematic diagram of an emotion recognition device (e.g., an electronic device or server 600. the electronic device in the embodiments of the present disclosure may include, but is not limited to, a holder such as a cell phone, a tablet, a laptop, a desktop, a kiosk, a server, a workstation, a television, a set-top box, smart glasses, a smart watch, a digital camera, an MP4 player, an MP5 player, a learning machine, a point-reading machine, an electronic book, an electronic dictionary, a vehicle mounted terminal, a Virtual Reality (VR) player, or an Augmented Reality (AR) player, etc. the electronic device shown in FIG. 3 is merely an example and should not impose any limitations on the functionality and scope of use of the embodiments of the present disclosure.
As shown in fig. 3, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 3 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: the method of the above-described steps S1 to S5 is performed.
In conclusion, the emotion recognition model constructed based on the distilled neural network is obtained by training the tone features in the voice data of the user to be detected, the facial expressions in the facial image and the emotion features in the text of the user to be detected.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (7)

1. A method of emotion recognition, comprising:
acquiring voice data of a current multi-person conversation; processing the voice data formed by the multi-person conversation by utilizing a voice recognition technology to generate corresponding text data; extracting time sequence information of each statement in the voice data; acquiring voice data of a speaker and lip image data corresponding to the voice data, wherein the lip image data comprises a lip image sequence of each speaker related to the voice data of the speaker, and determining time sequence information of each sentence in the voice data according to content identified by the lip image sequence; detecting the voice characteristics of each speaker, and marking by combining the voice characteristics and the time sequence information to distinguish the speakers corresponding to the sentences in the text data; recognizing the text data corresponding to the multi-person conversation by using a natural language processing technology to obtain the character content of the user to be tested; meanwhile, identifying voice data of a user to be detected in the multi-person conversation according to the mark; extracting emotional characteristics of text data corresponding to a user to be detected, and extracting tone characteristics in voice data corresponding to the user to be detected;
intercepting a voice feature set to be detected with accumulated time as a preset time threshold from the voice data according to a time sequence to obtain a plurality of voice feature sets to be detected, clustering each voice feature set to be detected, grading the cluster, and obtaining voice features of different callers according to a grading result; marking and distinguishing each sentence in the text data according to the voice characteristics and the time sequence characteristics to obtain the sentence contents corresponding to different speakers in the text data; recognizing each sentence field in the text data by using a natural language processing technology, judging the semantics of each sentence by combining context, obtaining the content of the text data of the user to be detected according to the semantics and the mark of each sentence, and obtaining the emotional characteristic of the text corresponding to the user to be detected; recognizing voice data corresponding to a user to be detected according to the voice characteristics and the time sequence characteristics, and extracting tone characteristics belonging to the user to be detected in the voice data;
acquiring voice data and face data of a user to be detected;
processing the voice data by utilizing a voice recognition technology to generate text data;
analyzing the text data to obtain emotional characteristics of the user to be detected, extracting the tone characteristics of the user to be detected in the voice data, and extracting the expression characteristics of the user to be detected in the face data;
constructing an emotion recognition model for recognizing emotion of a user to be detected based on a distilled neural network;
and inputting the emotional characteristics, the tone characteristics and the expression characteristics into the emotion recognition module for recognition to obtain the emotion category of the user to be detected.
2. The emotion recognition method of claim 1, wherein the step of processing the speech data formed by the multi-person conversation using a speech recognition technique to generate corresponding text data includes:
constructing a speech character matching model base, and training an RNN-T speech recognition model based on the model base;
and converting the voice data into text data by using the trained RNN-T voice recognition model.
3. The emotion recognition method of claim 1, wherein the step of constructing an emotion recognition model for recognizing the emotion of the user to be detected based on the distilled neural network includes:
forming a training set by the preprocessed emotional characteristics, tone characteristics and expression characteristics;
training an emotion recognition model by using the training set based on a distillation neural network, wherein parameters of the network are optimized by adopting a back propagation algorithm and a distillation loss function, and the distillation neural network is composed of a plurality of neural networks;
and combining the prediction labels of the plurality of neural networks, training the combined labels by using the one-dimensional convolutional neural network, and distributing different weights to the plurality of neural networks to obtain an integrated decision of the emotion recognition model.
4. The emotion recognition method of claim 1, wherein before inputting the emotional, mood, and expressive features into the emotion recognition module for recognition, the method further comprises: and preprocessing the emotional characteristics, the tone characteristics and the expression characteristics to obtain a characteristic vector with preset dimensionality.
5. An emotion recognition system using the emotion recognition method of any one of claims 1 to 4, the emotion recognition system comprising:
the data acquisition module is used for acquiring voice data and face data of a user to be detected;
the voice recognition module is used for processing the voice data by utilizing a voice recognition technology to generate text data;
the feature extraction module is used for analyzing the text data to obtain emotional features of the user to be detected, extracting the tone features of the user to be detected in the voice data and extracting the expression features of the user to be detected in the face data;
the model construction module is used for constructing an emotion recognition model for recognizing the emotion of the user to be detected based on the distilled neural network;
and the emotion recognition module is used for inputting the emotion characteristics, the tone characteristics and the expression characteristics into the emotion recognition module for recognition to obtain the emotion category of the user to be detected.
6. An emotion recognition device, characterized by comprising:
one or more processing devices;
a memory for storing one or more programs; when executed by the one or more processing devices, cause the one or more processing devices to implement the emotion recognition method of any of claims 1 to 4.
7. A computer-readable storage medium having stored thereon a computer program for causing a computer to execute the emotion recognition method according to any of claims 1 to 4.
CN202110922781.XA 2021-08-12 2021-08-12 Emotion recognition method, system, device and medium Active CN113380271B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110922781.XA CN113380271B (en) 2021-08-12 2021-08-12 Emotion recognition method, system, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110922781.XA CN113380271B (en) 2021-08-12 2021-08-12 Emotion recognition method, system, device and medium

Publications (2)

Publication Number Publication Date
CN113380271A CN113380271A (en) 2021-09-10
CN113380271B true CN113380271B (en) 2021-12-21

Family

ID=77576964

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110922781.XA Active CN113380271B (en) 2021-08-12 2021-08-12 Emotion recognition method, system, device and medium

Country Status (1)

Country Link
CN (1) CN113380271B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113707185B (en) * 2021-09-17 2023-09-12 卓尔智联(武汉)研究院有限公司 Emotion recognition method and device and electronic equipment
CN113743126B (en) * 2021-11-08 2022-06-14 北京博瑞彤芸科技股份有限公司 Intelligent interaction method and device based on user emotion
CN117935323A (en) * 2022-10-09 2024-04-26 马上消费金融股份有限公司 Training method of face driving model, video generation method and device
CN117038055B (en) * 2023-07-05 2024-04-02 广州市妇女儿童医疗中心 Pain assessment method, system, device and medium based on multi-expert model

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113223560A (en) * 2021-04-23 2021-08-06 平安科技(深圳)有限公司 Emotion recognition method, device, equipment and storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10861347B2 (en) * 2017-01-06 2020-12-08 Alex B. Tavares Device and method for teaching phonics using a touch detecting interface
WO2019102884A1 (en) * 2017-11-21 2019-05-31 日本電信電話株式会社 Label generation device, model learning device, emotion recognition device, and method, program, and storage medium for said devices
CN111368609B (en) * 2018-12-26 2023-10-17 深圳Tcl新技术有限公司 Speech interaction method based on emotion engine technology, intelligent terminal and storage medium
CN110276259B (en) * 2019-05-21 2024-04-02 平安科技(深圳)有限公司 Lip language identification method, device, computer equipment and storage medium
CN112562682A (en) * 2020-12-02 2021-03-26 携程计算机技术(上海)有限公司 Identity recognition method, system, equipment and storage medium based on multi-person call
CN112989920B (en) * 2020-12-28 2023-08-11 华东理工大学 Electroencephalogram emotion classification system based on frame-level characteristic distillation neural network
CN112559835B (en) * 2021-02-23 2021-09-14 中国科学院自动化研究所 Multi-mode emotion recognition method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113223560A (en) * 2021-04-23 2021-08-06 平安科技(深圳)有限公司 Emotion recognition method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
融合人脸表情的手语到汉藏双语情感语音转换;宋南等;《声学技术》;20180815(第04期);全文 *

Also Published As

Publication number Publication date
CN113380271A (en) 2021-09-10

Similar Documents

Publication Publication Date Title
US10977452B2 (en) Multi-lingual virtual personal assistant
US20210081056A1 (en) Vpa with integrated object recognition and facial expression recognition
CN113380271B (en) Emotion recognition method, system, device and medium
CN111459290B (en) Interactive intention determining method and device, computer equipment and storage medium
US11900518B2 (en) Interactive systems and methods
CN115329779A (en) Multi-person conversation emotion recognition method
CN113380234A (en) Method, device, equipment and medium for generating form based on voice recognition
CN113689951A (en) Intelligent diagnosis guiding method, system and computer readable storage medium
Abouelenien et al. Gender-based multimodal deception detection
Fu et al. CONSK-GCN: conversational semantic-and knowledge-oriented graph convolutional network for multimodal emotion recognition
Yordanova et al. Automatic detection of everyday social behaviours and environments from verbatim transcripts of daily conversations
Sun et al. In your eyes: Modality disentangling for personality analysis in short video
Hong et al. When hearing the voice, who will come to your mind
Akinpelu et al. Lightweight Deep Learning Framework for Speech Emotion Recognition
Guo et al. Deep neural networks for depression recognition based on facial expressions caused by stimulus tasks
CN114492579A (en) Emotion recognition method, camera device, emotion recognition device and storage device
CN114297354A (en) Bullet screen generation method and device, storage medium and electronic device
Song et al. Supervised contrastive learning for game-play frustration detection from speech
US20230290371A1 (en) System and method for automatically generating a sign language video with an input speech using a machine learning model
Abbas Improving Arabic Sign Language to support communication between vehicle drivers and passengers from deaf people
Kiran et al. Predicting Human Personality using Multimedia by Employing Machine Learning Technique
Lázaro Herrasti Non-acted multi-view audio-visual dyadic interactions. Project non-verbal emotion recognition in dyadic scenarios and speaker segmentation
Sh Computational interference of trustworthiness in social figures through analysis of speech acoustic, textual, and visual signals
Kumar et al. A Hyper-Graph Embedded Bandlet-Based Facial Emotion Monitoring System for Enhanced Urban Health
CN115328296A (en) Self-service film watching method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220719

Address after: 201615 room 1904, G60 Kechuang building, No. 650, Xinzhuan Road, Songjiang District, Shanghai

Patentee after: Shanghai Mingping Medical Data Technology Co.,Ltd.

Address before: 102400 no.86-n3557, Wanxing Road, Changyang, Fangshan District, Beijing

Patentee before: Mingpinyun (Beijing) data Technology Co.,Ltd.

TR01 Transfer of patent right