CN113506586A - Method and system for recognizing emotion of user - Google Patents

Method and system for recognizing emotion of user Download PDF

Info

Publication number
CN113506586A
CN113506586A CN202110677222.7A CN202110677222A CN113506586A CN 113506586 A CN113506586 A CN 113506586A CN 202110677222 A CN202110677222 A CN 202110677222A CN 113506586 A CN113506586 A CN 113506586A
Authority
CN
China
Prior art keywords
features
feature
text
pronunciation
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110677222.7A
Other languages
Chinese (zh)
Other versions
CN113506586B (en
Inventor
高鹏
郝少春
袁兰
吴飞
周伟华
高峰
潘晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Mjoys Big Data Technology Co ltd
Original Assignee
Hangzhou Mjoys Big Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Mjoys Big Data Technology Co ltd filed Critical Hangzhou Mjoys Big Data Technology Co ltd
Priority to CN202110677222.7A priority Critical patent/CN113506586B/en
Publication of CN113506586A publication Critical patent/CN113506586A/en
Application granted granted Critical
Publication of CN113506586B publication Critical patent/CN113506586B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application relates to a method and a system for recognizing user emotion, wherein the method for recognizing the user emotion comprises the following steps: acquiring voice data, and extracting voice features according to the voice data; converting the voice data to obtain text data, and extracting text features according to the text data; inputting the voice features and the text features into a user emotion recognition model, and outputting emotion labels of the user, wherein in the user emotion recognition model: the method comprises the steps of using a convolutional neural network to represent voice characteristics to obtain first characteristics, using a long-time memory network to represent the first characteristics to obtain second characteristics, using the convolutional neural network to represent text characteristics to obtain third characteristics, using the long-time memory network to represent the third characteristics to obtain fourth characteristics, and fully connecting the second characteristics and the fourth characteristics to determine emotion labels of users.

Description

Method and system for recognizing emotion of user
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a method and a system for recognizing user emotion.
Background
With the development of artificial intelligence technology, the intelligent voice robot has become mature in industry, and more enterprises begin to pay attention to and use the intelligent voice robot.
The intelligent voice robot and the user can generate voice data in the interaction process, and emotion information of the user in the communication process can be obtained by performing emotion recognition on the voice data; in the related art, the emotion of the user is recognized from the perspective of the voice data alone according to the acquired voice data, or from the perspective of the text data alone according to the text data converted from the voice data, and the emotion recognition accuracy of the user is not high.
Aiming at the problem of low accuracy of user emotion recognition in the related technology, an effective solution is not provided yet.
Disclosure of Invention
The embodiment of the application provides a method and a system for recognizing user emotion, which are beneficial to improving the accuracy of recognizing the user emotion.
In a first aspect, an embodiment of the present application provides a method for recognizing a user emotion, where the method includes:
acquiring voice data, and extracting voice features according to the voice data, wherein the voice features comprise Mel cepstrum coefficient features;
converting the voice data to obtain text data, and extracting text features according to the text data, wherein the text features comprise word features and position features;
inputting the voice features and the text features into a user emotion recognition model, and outputting an emotion label of the user, wherein in the user emotion recognition model:
using a convolution neural network to represent the voice characteristics to obtain a first characteristic, using a long-time memory network to represent the first characteristic to obtain a second characteristic,
using a convolution neural network to represent the text characteristics to obtain a third characteristic, using a long-time memory network to represent the third characteristic to obtain a fourth characteristic,
and fully connecting the second characteristic and the fourth characteristic to determine the emotion label of the user.
In some embodiments, the speech features further include a pitch feature, a intonation feature, and a speech rate feature, and the extracting of the speech features includes:
determining the peak value, the frequency and the period of the waveform according to the waveform of the voice data;
determining the pitch feature, the intonation feature and the speech speed feature of the voice data in a one-to-one correspondence manner according to the peak value, the frequency and the period;
and representing the pitch feature, the intonation feature and the speech speed feature by Embelling.
In some embodiments, the text feature further includes a pronunciation feature, and the extracting of the text feature includes:
determining the pronunciation of each word in the text data to obtain the pronunciation characteristics, wherein the pronunciation comprises pinyin, tone, initial consonant or final sound;
the pronunciation characteristics are represented by Embedding.
In some of these embodiments, before the extracting the text feature, the method includes: correcting errors in the text data, wherein the error correcting process comprises:
inputting the text data into a mask language model, determining and hiding pronouncing and error-prone words in the text data by the mask language model according to an error confusion set, wherein the error confusion set comprises the incidence relation between the pronouncing and error-prone words, and in the process of incremental training in a model training stage, the mask language model replaces the pronunciation-prone words in the training data with the pronunciation-prone words according to a target proportion according to the error confusion set;
the mask language model determines a candidate word with the highest probability at the hidden position, and in the case that the pronunciation-prone word is inconsistent with the candidate word, the mask language model replaces the pronunciation-prone word with the candidate word.
In a second aspect, an embodiment of the present application provides a system for recognizing a user emotion, where the system includes:
the voice extraction module is used for acquiring voice data and extracting voice features according to the voice data, wherein the voice features comprise Mel cepstrum coefficient features;
the second extraction module is used for converting the voice data to obtain text data and extracting text features according to the text data, wherein the text features comprise word features and position features;
a determining module, configured to input the speech feature and the text feature into a user emotion recognition model, and output an emotion tag of the user, where in the user emotion recognition model:
using a convolution neural network to represent the voice characteristics to obtain a first characteristic, using a long-time memory network to represent the first characteristic to obtain a second characteristic,
using a convolution neural network to represent the text characteristics to obtain a third characteristic, using a long-time memory network to represent the third characteristic to obtain a fourth characteristic,
and fully connecting the second characteristic and the fourth characteristic to determine the emotion label of the user.
In some embodiments, the speech features further include a pitch feature, a intonation feature, and a speech rate feature, and the first extraction module is further configured to:
determining the peak value, the frequency and the period of the waveform according to the waveform of the voice data;
determining the pitch feature, the intonation feature and the speech speed feature of the voice data in a one-to-one correspondence manner according to the peak value, the frequency and the period;
and representing the pitch feature, the intonation feature and the speech speed feature by Embelling.
In some embodiments, the text feature further comprises a pronunciation feature, and the second extraction module is further configured to:
determining the pronunciation of each word in the text data to obtain the pronunciation characteristics, wherein the pronunciation comprises pinyin, tone, initial consonant or final sound;
the pronunciation characteristics are represented by Embedding.
In some of these embodiments, the system further comprises:
an error correction module, configured to correct an error of the text data before the extracting the text feature, where a process of the error correction includes:
inputting the text data into a mask language model, determining and hiding pronouncing and error-prone words in the text data by the mask language model according to an error confusion set, wherein the error confusion set comprises the incidence relation between the pronouncing and error-prone words, and in the process of incremental training in a model training stage, the mask language model replaces the pronunciation-prone words in the training data with the pronunciation-prone words according to a target proportion according to the error confusion set;
the mask language model determines a candidate word with the highest probability at the hidden position, and in the case that the pronunciation-prone word is inconsistent with the candidate word, the mask language model replaces the pronunciation-prone word with the candidate word.
In a third aspect, the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the method for emotion recognition of a user when executing the computer program.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the method for emotion recognition of a user.
Compared with the related art, the method for recognizing the user emotion provided by the embodiment of the application extracts the voice features by acquiring the voice data and according to the voice data, wherein the voice features comprise Mel cepstrum coefficient features; converting the voice data to obtain text data, and extracting text features according to the text data, wherein the text features comprise word features and position features; inputting the voice features and the text features into a user emotion recognition model, and outputting emotion labels of the user, wherein in the user emotion recognition model: the method comprises the steps of using a convolutional neural network to represent voice characteristics to obtain first characteristics, using a long-time memory network to represent the first characteristics to obtain second characteristics, using the convolutional neural network to represent text characteristics to obtain third characteristics, using the long-time memory network to represent the third characteristics to obtain fourth characteristics, and fully connecting the second characteristics and the fourth characteristics to determine emotion labels of users, so that the problem of low accuracy of user emotion recognition in the related technology is solved, and the improvement of the accuracy of user emotion recognition is facilitated.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a schematic diagram of an application environment of a method for emotion recognition of a user according to an embodiment of the present application;
FIG. 2 is a flow chart of a method of user emotion recognition according to a first embodiment of the present application;
FIG. 3 is a schematic diagram of a user emotion recognition model according to a second embodiment of the present application;
FIG. 4 is a flow chart of a process of extracting features of Mel cepstral coefficients according to a third embodiment of the present application;
FIG. 5 is a flowchart of a process of extracting word and location features according to a fourth embodiment of the present application;
FIG. 6 is a flow chart of a process of extracting speech features according to a fifth embodiment of the present application;
fig. 7 is a flowchart of an error correction process of text data according to a sixth embodiment of the present application;
fig. 8 is a flowchart of a text feature extraction process according to a seventh embodiment of the present application;
fig. 9 is a block diagram of a system for emotion recognition of a user according to an eighth embodiment of the present application;
fig. 10 is a block diagram of a system for emotion recognition of a user according to a ninth embodiment of the present application;
fig. 11 is an internal structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
The method for recognizing the emotion of the user provided by the present application can be applied to an application environment shown in fig. 1, fig. 1 is an application environment schematic diagram of the method for recognizing the emotion of the user according to the embodiment of the present application, as shown in fig. 1, a server 102 obtains voice data of a terminal 101 through a network and operates the method for recognizing the emotion of the user, so as to obtain emotion information of the user in the voice data, the server 102 can be implemented by an independent server or a server cluster composed of a plurality of servers, and the terminal 101 can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
The present embodiment provides a method for recognizing a user emotion, fig. 2 is a flowchart of a method for recognizing a user emotion according to a first embodiment of the present application, and as shown in fig. 2, the flowchart includes the following steps:
step S201, acquiring voice data, and extracting voice features according to the voice data, wherein the voice features comprise Mel cepstrum coefficient features;
step S202, converting the voice data to obtain text data, and extracting text features according to the text data, wherein the text features include word features and position features, for example, an Automatic Speech Recognition (ASR) technology can be used to convert the voice data to the text data;
step S203, inputting the voice characteristics and the text characteristics into the emotion recognition model of the user, and outputting the emotion label of the user.
Through the steps S201 to S203, compared with the problem that the accuracy of user emotion recognition is not high in the related art, in the present embodiment, the voice feature is extracted according to the voice data, the text feature is extracted according to the text data after voice conversion, the voice feature and the text feature are input into the user emotion recognition model, and the emotion tag of the user is output, so that multi-modal recognition of the user emotion is completed.
In addition, the intelligent voice robot can adjust the answering content according to the recognized emotion of the user, the whole conversation process of the intelligent voice robot and the user is closer to the conversation between people, and therefore the user experience is greatly improved.
In some embodiments, fig. 3 is a schematic diagram of a user emotion recognition model according to a second embodiment of the present application, as shown in fig. 3, a speech feature and a text feature respectively form a feature matrix, in the user emotion recognition model, a Convolutional Neural Network (CNN) is used to characterize the speech feature, a multi-core Convolutional and pooling layer processing is performed on the speech feature to obtain a first feature (the first feature is a sequence feature), a Long Short-Term Memory network (LSTM) is used to characterize the first feature, a second feature is obtained, a Convolutional Neural network is used to characterize the text feature, a multi-core Convolutional and pooling layer processing is performed on the text feature to obtain a third feature (the third feature is a sequence feature), a Long-Term Memory network is used to characterize the third feature, a fourth feature is obtained, in a fully connected layer (fully connected layers, FC for short) and the fourth feature are fully connected, and the emotion Label (Label) of the user is output on the output layer.
In some embodiments, fig. 4 is a flowchart of a process of extracting mel-frequency cepstrum coefficient features according to a third embodiment of the present application, and as shown in fig. 4, the process includes the following steps:
step S401, performing preprocessing on the voice data, including: removing segments which are caused by non-human reasons and have no sound at all in the voice data, for example, removing segments which are caused by packet loss and have no sound at all;
step S402, performing pre-emphasis processing on the pre-processed voice data, where the pre-emphasis processing is to pass a voice signal in the voice data through a high-pass filter:
H(Z)=1-μz-1equation 1
Wherein, the value of μ in formula 1 is between 0.9-1.0, usually 0.97, the pre-emphasis is to boost the high frequency part and flatten the spectrum of the signal, so that the spectrum can be obtained by the same signal-to-noise ratio in the whole frequency band from low frequency to high frequency, and the high frequency part of the voice signal, which is suppressed by the pronunciation system, is compensated for in order to highlight the formant of the high frequency;
step S403, performing framing processing on the pre-emphasized voice data, framing the voice data according to the statistical characteristic that the voice data is not constant in time, so as to divide a continuous voice signal into short segments, and assuming that the voice signal is constant in each segment region, generally selecting N sampling points as an observation unit, where the observation unit is called a frame, where N is 512 in general, and the frame length is 25ms, in order to ensure smooth transition between frames, overlapping is generally considered in framing, a time difference between start positions of two adjacent frames is called a frame shift, and in general, the frame shift is 10 ms;
step S404, performing windowing on the framed voice data, because the subsequent step needs to perform Fast Fourier Transform (FFT), the FFT needs to change the voice signal from negative infinity to positive infinity or periodicity, and the framed voice signal is aperiodic, so that windowing is needed to make the voice signal periodic, a hamming window is commonly used in the industry, so that the signal is shrunk to zero at the boundary, and when N is greater than or equal to 0 and less than or equal to N, the window function is as follows:
w (N) 0.54-0.46 cos [2 pi N/(N-1) ] formula 2
In other cases:
w (n) 0 formula 3
Multiplying each frame by a hamming window, thereby increasing the continuity of the left and right ends of the frame;
step S405, performing fast Fourier transform processing on the windowed voice data, wherein the transform formula of the fast Fourier transform is as follows:
Figure BDA0003121410960000071
because the characteristics of the signal are usually difficult to see by the transformation of the signal in the time domain, the signal is usually observed by transforming the signal into energy distribution in the frequency domain, and different energy distributions can represent the characteristics of different voice signals, so after multiplying a hamming window, each frame also needs to use fast fourier transform to obtain a spectrogram of each frame, and the spectrograms of each frame are stacked in time to obtain a spectrogram;
step S406, applying a Mel filter to the spectrogram, taking logarithm to obtain a Mel spectrogram, and converting the frequency of the spectrogram into American (Mel) scale, i.e. linear distribution below 1000Hz and logarithmic growth above 1000Hz, wherein the conversion formula is as follows:
m=2595 log10(1+ f/700) formula 5
Because the sound level heard by human ears is not in linear relation with the actual (Hz) frequency, the Mel frequency is more consistent with the auditory characteristics of human ears, so the frequency of the spectrogram needs to be changed into the American (Mel) scale;
step S407, performing Cepstrum analysis on the Mel spectrogram, wherein the most useful information is in a filter, namely a sound cavity, due to the characteristic of human voice, a sound source and the filter need to be separated, the voice signal can be decomposed by performing Cepstrum analysis on the Mel spectrogram, and the 2 nd to 14 th coefficients representing the information of the filter are taken, and the obtained 12 coefficients are the characteristics of Mel Frequency Cepstrum Coefficient (MFCC);
step S408, determining the first difference and the second difference of the MFCC features to obtain the change track of the Mel cepstrum coefficient along with time, wherein the energy of one frame is the sum of the sample powers of the frame in a certain time period, and because the voice signals are not constant from one frame to another frame, the feature related to the time change can be added, and the speed feature (or Delta feature) and the acceleration feature (or double Delta feature) are added to the Mel cepstrum coefficient feature and the energy feature to obtain the change track of the Mel cepstrum coefficient along with time.
In some embodiments, fig. 5 is a flowchart of a process of extracting word features and position features according to a fourth embodiment of the present application, and as shown in fig. 5, the process includes the following steps:
step S501, performing word segmentation processing on text data, performing part-of-speech recognition to obtain word characteristics and part-of-speech characteristics, for example, performing word segmentation by using word-plus-word granularity, determining whether the words in a word list need to be segmented according to context by using a language model, keeping the word granularity for the words not in the word list, and performing part-of-speech tagging on the words by using the part-of-speech and entity tagging information in the word list, wherein each word and the corresponding part-of-speech are respectively used as the word characteristics and the part-of-speech characteristics, so that the text data can be split into characteristics from different angles;
step S502, confirming the position information of the text data to obtain the position characteristics, for example, carrying out Embedding operation on the character and word characteristics by using One-Hot coding, the dimensionality is 512, simultaneously using absolute position coding, adding the position information, summing the data after all the Embedding operations as the data to be input into the emotion recognition model, and feeding the data into the model.
Considering that the pitch, intonation, and speech rate of the user can also reflect the emotion of the user, in some embodiments, the speech features further include a pitch feature, an intonation feature, a speech rate feature, and a pause feature, fig. 6 is a flowchart of a speech feature extraction process according to a fifth embodiment of the present application, and as shown in fig. 6, the process includes the following steps:
step S601, determining peak value, frequency, period and pause information of the waveform according to the waveform of the voice data;
step S602, determining pitch characteristics, tone characteristics, speech speed characteristics and pause characteristics of the voice data in a one-to-one correspondence manner according to the peak value, the frequency, the period and the pause information;
step S603, representing pitch feature, intonation feature, speech rate feature and pause feature by using Embedding.
Through steps S601 to S603, compared to the problem that the emotion recognition result is inaccurate because the pitch, intonation, speed, pause, and other information related to the emotion information of the user are not considered when the emotion of the user is recognized from the perspective of voice in the related art, in the embodiment, the pitch feature, the intonation feature, the speed feature, and the pause feature of the voice data are determined according to the waveform of the voice data, so that more data bases can be provided for determining the emotion label of the user for the subsequent emotion recognition model, thereby further improving the accuracy of emotion recognition of the user, and meanwhile, the format of the voice feature can meet the format requirement of the emotion recognition model on the input data by using Embedding to represent the pitch feature, the intonation feature, the speed feature, and the pause feature.
Considering that if the emotion of the user is to be accurately recognized, it is a prerequisite that the user's semantics are correctly understood, and problems that homophones, nears or pronunciation are not standard, which may exist in the text, all affect the correct understanding of the user's semantics, resulting in inaccurate emotion recognition results, in some embodiments, the method for recognizing the emotion of the user includes: error correction of text data, fig. 7 is a flowchart of an error correction process of text data according to a sixth embodiment of the present application, and as shown in fig. 7, the flowchart includes the following steps:
step S701, inputting text data into a Mask Language Model (MLM), where the mask Language Model determines and conceals pronounceable and error-prone words in the text data according to an error confusion set, where the error confusion set includes an association relationship between the pronounceable and error-prone words, and the mask Language Model performs incremental training in a Model training stage, replaces the pronounceable and error-prone words in the training data with the pronunciation-prone words according to a target ratio as error data, and the error data and the correct data form a set of training data, so that the mask Language Model is trained in a targeted manner to predict the concealed words as pronunciation-similar words of the concealed words, where the pronunciation-similar words include similar sounds or words that are wrong due to different pronunciation habits, such as a frank word, which may be hunan original word, and therefore frank is a pronounceable and error-prone word, hunan is a pronunciation similar word of Fran, and the wrong confusion set can be generated by combining initial consonants, vowels and tones by using a language model, and the target proportion can be 1.5%;
step S702, the mask language model determines the candidate word with the highest probability at the hiding position, and under the condition that the pronounce error-prone word is inconsistent with the candidate word, the mask language model replaces the pronounce error-prone word with the candidate word.
Through steps S701 to S702, in order to solve the problem that the mask language model in the related art is randomly selected to replace the content during the training process, and if the mask language model is directly applied to the text data for error correction, the error correction effect is poor, the implementation considers that the high-frequency error-prone content in the text data is an error-prone word in pronunciation, in the process of incremental training in the model training stage by a mask language model, pronunciation similar words in training data are replaced by pronunciation error-prone words according to a target proportion, the mask language model can be trained in a targeted manner to predict the hidden word as a similar pronunciation word of the hidden word, therefore, high-frequency error-prone contents in the text data can be found out more accurately in the using process of the mask language model, error correction of the text data is achieved, guarantee is provided for correctly understanding the semantics of the user, and therefore the accuracy of the recognized emotion of the user is improved.
Considering that the pinyin of each word in the text data may also reflect the semantics of the user, for example, a hand of a pickpocket is opened, the pinyin is ba (first sound) and is the meaning of being opened, the pickpocket is opened, and the pinyin is pa (second sound) and is the meaning of stealing property on another person, so the pinyin of each word in the text data may also be used as a text feature, fig. 8 is a flowchart of a text feature extraction process according to a seventh embodiment of the present application, and as shown in fig. 8, the process includes the following steps:
step S801, determining the pronunciation of each word in the text data to obtain pronunciation characteristics, wherein the pronunciation comprises pinyin, and further, considering that the tone, the initial consonant or the final sound of the user can also reflect the emotion of the user, the pronunciation can also comprise the tone, the initial consonant or the final sound;
in step S802, pronunciation characteristics are represented by Embedding.
Through steps S801 to S802, compared to the problem that the emotion recognition result is inaccurate because pronunciation information of pinyin, tone, initial consonant, final sound, etc. reflecting user semantics or emotion is not considered when recognizing user emotion from a text perspective in the related art, the embodiment obtains pronunciation characteristics by determining the pinyin, tone, initial consonant, or final sound of each word in the text data, and can provide more data bases for determining a user emotion tag for a subsequent emotion recognition model, thereby further improving the accuracy of user emotion recognition, and meanwhile, by using Embedding to represent the pronunciation characteristics of the text data, the format of the pronunciation characteristics can be made to meet the format requirements of the emotion recognition model on input data.
This embodiment also provides a system for recognizing a user emotion, fig. 9 is a block diagram of a structure of a system for recognizing a user emotion according to an eighth embodiment of the present application, and as shown in fig. 9, the system includes:
a first extraction module 901, configured to obtain voice data and extract a voice feature according to the voice data, where the voice feature includes a mel-frequency cepstrum coefficient feature;
a second extraction module 902, configured to perform conversion processing on the voice data to obtain text data, and extract text features according to the text data, where the text features include word features and position features;
a determining module 903, configured to input the speech feature and the text feature into a user emotion recognition model, and output an emotion label of the user, where in the user emotion recognition model: the method comprises the steps of using a convolutional neural network to represent voice characteristics to obtain first characteristics, using a long-term memory network to represent the first characteristics to obtain second characteristics, using the convolutional neural network to represent text characteristics to obtain third characteristics, using the long-term memory network to represent the third characteristics to obtain fourth characteristics, fully connecting the second characteristics with the fourth characteristics, and determining an emotion label of a user.
In some embodiments, the speech features further include a pitch feature, a intonation feature, and a speech speed feature, and the first extraction module 901 is further configured to: determining the peak value, frequency and period of the waveform according to the waveform of the voice data; determining pitch characteristics, intonation characteristics and speech speed characteristics of the voice data in a one-to-one correspondence manner according to the peak value, the frequency and the period; pitch, intonation, and speech rate features are denoted by Embedding.
In some embodiments, the text feature further comprises a pronunciation feature, and the second extraction module 902 is further configured to: determining the pronunciation of each word in the text data to obtain pronunciation characteristics, wherein the pronunciation comprises pinyin, tone, initial consonant or final sound; pronunciation features are denoted by Embedding.
In some embodiments, fig. 10 is a block diagram of a system for recognizing emotion of a user according to a ninth embodiment of the present application, and as shown in fig. 10, the system further includes:
an error correction module 1001, configured to correct errors in text data before extracting text features, where a process of correcting errors includes: inputting text data into a mask language model, determining and hiding pronouncing and error-prone words in the text data by the mask language model according to an error confusion set, wherein the error confusion set comprises the incidence relation between the pronouncing and error-prone words, and replacing the pronouncing and error-prone words in training data by the mask language model according to a target proportion according to the error confusion set in the incremental training process of the model training stage; and the mask language model determines the candidate word with the highest probability at the hiding position, and replaces the pronounced error-prone word with the candidate word under the condition that the pronounced error-prone word is inconsistent with the candidate word.
In an embodiment, fig. 11 is a schematic internal structure diagram of an electronic device according to an embodiment of the present application, and as shown in fig. 11, there is provided an electronic device, which may be a server, and its internal structure diagram may be as shown in fig. 11. The electronic device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the electronic device is used for storing data. The network interface of the electronic device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a method of emotion recognition for a user.
Those skilled in the art will appreciate that the architecture shown in fig. 11 is a block diagram of only a portion of the architecture associated with the subject application, and does not constitute a limitation on the electronic devices to which the subject application may be applied, and that a particular electronic device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It should be understood by those skilled in the art that various features of the above-described embodiments can be combined in any combination, and for the sake of brevity, all possible combinations of features in the above-described embodiments are not described in detail, but rather, all combinations of features which are not inconsistent with each other should be construed as being within the scope of the present disclosure.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method of emotion recognition for a user, the method comprising:
acquiring voice data, and extracting voice features according to the voice data, wherein the voice features comprise Mel cepstrum coefficient features;
converting the voice data to obtain text data, and extracting text features according to the text data, wherein the text features comprise word features and position features;
inputting the voice features and the text features into a user emotion recognition model, and outputting an emotion label of the user, wherein in the user emotion recognition model:
using a convolution neural network to represent the voice characteristics to obtain a first characteristic, using a long-time memory network to represent the first characteristic to obtain a second characteristic,
using a convolution neural network to represent the text characteristics to obtain a third characteristic, using a long-time memory network to represent the third characteristic to obtain a fourth characteristic,
and fully connecting the second characteristic and the fourth characteristic to determine the emotion label of the user.
2. The method according to claim 1, wherein the speech features further include pitch features, intonation features and speech speed features, and the extracting process of the speech features includes:
determining the peak value, the frequency and the period of the waveform according to the waveform of the voice data;
determining the pitch feature, the intonation feature and the speech speed feature of the voice data in a one-to-one correspondence manner according to the peak value, the frequency and the period;
and representing the pitch feature, the intonation feature and the speech speed feature by Embelling.
3. The method of claim 1, wherein the text features further comprise pronunciation features, and wherein the extracting of the text features comprises:
determining the pronunciation of each word in the text data to obtain the pronunciation characteristics, wherein the pronunciation comprises pinyin, tone, initial consonant or final sound;
the pronunciation characteristics are represented by Embedding.
4. The method of claim 3, wherein prior to said extracting the textual features, the method comprises: correcting errors in the text data, wherein the error correcting process comprises:
inputting the text data into a mask language model, determining and hiding pronouncing and error-prone words in the text data by the mask language model according to an error confusion set, wherein the error confusion set comprises the incidence relation between the pronouncing and error-prone words, and in the process of incremental training in a model training stage, the mask language model replaces the pronunciation-prone words in the training data with the pronunciation-prone words according to a target proportion according to the error confusion set;
the mask language model determines a candidate word with the highest probability at the hidden position, and in the case that the pronunciation-prone word is inconsistent with the candidate word, the mask language model replaces the pronunciation-prone word with the candidate word.
5. A system for emotion recognition of a user, the system comprising:
the voice extraction module is used for acquiring voice data and extracting voice features according to the voice data, wherein the voice features comprise Mel cepstrum coefficient features;
the second extraction module is used for converting the voice data to obtain text data and extracting text features according to the text data, wherein the text features comprise word features and position features;
a determining module, configured to input the speech feature and the text feature into a user emotion recognition model, and output an emotion tag of the user, where in the user emotion recognition model:
using a convolution neural network to represent the voice characteristics to obtain a first characteristic, using a long-time memory network to represent the first characteristic to obtain a second characteristic,
using a convolution neural network to represent the text characteristics to obtain a third characteristic, using a long-time memory network to represent the third characteristic to obtain a fourth characteristic,
and fully connecting the second characteristic and the fourth characteristic to determine the emotion label of the user.
6. The system of claim 5, wherein the speech features further include a pitch feature, a intonation feature, and a speech rate feature, and wherein the first extraction module is further configured to:
determining the peak value, the frequency and the period of the waveform according to the waveform of the voice data;
determining the pitch feature, the intonation feature and the speech speed feature of the voice data in a one-to-one correspondence manner according to the peak value, the frequency and the period;
and representing the pitch feature, the intonation feature and the speech speed feature by Embelling.
7. The system of claim 5, wherein the text features further include pronunciation features, and wherein the second extraction module is further configured to:
determining the pronunciation of each word in the text data to obtain the pronunciation characteristics, wherein the pronunciation comprises pinyin, tone, initial consonant or final sound;
the pronunciation characteristics are represented by Embedding.
8. The system of claim 7, further comprising:
an error correction module, configured to correct an error of the text data before the extracting the text feature, where a process of the error correction includes:
inputting the text data into a mask language model, determining and hiding pronouncing and error-prone words in the text data by the mask language model according to an error confusion set, wherein the error confusion set comprises the incidence relation between the pronouncing and error-prone words, and in the process of incremental training in a model training stage, the mask language model replaces the pronunciation-prone words in the training data with the pronunciation-prone words according to a target proportion according to the error confusion set;
the mask language model determines a candidate word with the highest probability at the hidden position, and in the case that the pronunciation-prone word is inconsistent with the candidate word, the mask language model replaces the pronunciation-prone word with the candidate word.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the computer program, implements a method of user emotion recognition as claimed in any of claims 1 to 4.
10. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method of emotion recognition of a user as claimed in any one of claims 1 to 4.
CN202110677222.7A 2021-06-18 2021-06-18 Method and system for identifying emotion of user Active CN113506586B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110677222.7A CN113506586B (en) 2021-06-18 2021-06-18 Method and system for identifying emotion of user

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110677222.7A CN113506586B (en) 2021-06-18 2021-06-18 Method and system for identifying emotion of user

Publications (2)

Publication Number Publication Date
CN113506586A true CN113506586A (en) 2021-10-15
CN113506586B CN113506586B (en) 2023-06-20

Family

ID=78010436

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110677222.7A Active CN113506586B (en) 2021-06-18 2021-06-18 Method and system for identifying emotion of user

Country Status (1)

Country Link
CN (1) CN113506586B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114141239A (en) * 2021-11-29 2022-03-04 江南大学 Voice short instruction identification method and system based on lightweight deep learning
WO2024008215A3 (en) * 2022-07-08 2024-02-29 顺丰科技有限公司 Speech emotion recognition method and apparatus

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170169008A1 (en) * 2015-12-15 2017-06-15 Le Holdings (Beijing) Co., Ltd. Method and electronic device for sentiment classification
US20180308487A1 (en) * 2017-04-21 2018-10-25 Go-Vivace Inc. Dialogue System Incorporating Unique Speech to Text Conversion Method for Meaningful Dialogue Response
CN109243490A (en) * 2018-10-11 2019-01-18 平安科技(深圳)有限公司 Driver's Emotion identification method and terminal device
CN110085262A (en) * 2018-01-26 2019-08-02 上海智臻智能网络科技股份有限公司 Voice mood exchange method, computer equipment and computer readable storage medium
CN110688499A (en) * 2019-08-13 2020-01-14 深圳壹账通智能科技有限公司 Data processing method, data processing device, computer equipment and storage medium
CN110706690A (en) * 2019-09-16 2020-01-17 平安科技(深圳)有限公司 Speech recognition method and device
CN110765763A (en) * 2019-09-24 2020-02-07 金蝶软件(中国)有限公司 Error correction method and device for speech recognition text, computer equipment and storage medium
CN111028827A (en) * 2019-12-10 2020-04-17 深圳追一科技有限公司 Interaction processing method, device, equipment and storage medium based on emotion recognition
CN111223498A (en) * 2020-01-10 2020-06-02 平安科技(深圳)有限公司 Intelligent emotion recognition method and device and computer readable storage medium
CN111798874A (en) * 2020-06-24 2020-10-20 西北师范大学 Voice emotion recognition method and system
CN112185389A (en) * 2020-09-22 2021-01-05 北京小米松果电子有限公司 Voice generation method and device, storage medium and electronic equipment
CN112735479A (en) * 2021-03-31 2021-04-30 南方电网数字电网研究院有限公司 Speech emotion recognition method and device, computer equipment and storage medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170169008A1 (en) * 2015-12-15 2017-06-15 Le Holdings (Beijing) Co., Ltd. Method and electronic device for sentiment classification
US20180308487A1 (en) * 2017-04-21 2018-10-25 Go-Vivace Inc. Dialogue System Incorporating Unique Speech to Text Conversion Method for Meaningful Dialogue Response
CN110085262A (en) * 2018-01-26 2019-08-02 上海智臻智能网络科技股份有限公司 Voice mood exchange method, computer equipment and computer readable storage medium
CN109243490A (en) * 2018-10-11 2019-01-18 平安科技(深圳)有限公司 Driver's Emotion identification method and terminal device
CN110688499A (en) * 2019-08-13 2020-01-14 深圳壹账通智能科技有限公司 Data processing method, data processing device, computer equipment and storage medium
CN110706690A (en) * 2019-09-16 2020-01-17 平安科技(深圳)有限公司 Speech recognition method and device
CN110765763A (en) * 2019-09-24 2020-02-07 金蝶软件(中国)有限公司 Error correction method and device for speech recognition text, computer equipment and storage medium
CN111028827A (en) * 2019-12-10 2020-04-17 深圳追一科技有限公司 Interaction processing method, device, equipment and storage medium based on emotion recognition
CN111223498A (en) * 2020-01-10 2020-06-02 平安科技(深圳)有限公司 Intelligent emotion recognition method and device and computer readable storage medium
CN111798874A (en) * 2020-06-24 2020-10-20 西北师范大学 Voice emotion recognition method and system
CN112185389A (en) * 2020-09-22 2021-01-05 北京小米松果电子有限公司 Voice generation method and device, storage medium and electronic equipment
CN112735479A (en) * 2021-03-31 2021-04-30 南方电网数字电网研究院有限公司 Speech emotion recognition method and device, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114141239A (en) * 2021-11-29 2022-03-04 江南大学 Voice short instruction identification method and system based on lightweight deep learning
WO2024008215A3 (en) * 2022-07-08 2024-02-29 顺丰科技有限公司 Speech emotion recognition method and apparatus

Also Published As

Publication number Publication date
CN113506586B (en) 2023-06-20

Similar Documents

Publication Publication Date Title
CN110211565B (en) Dialect identification method and device and computer readable storage medium
CN109741732B (en) Named entity recognition method, named entity recognition device, equipment and medium
WO2020215666A1 (en) Speech synthesis method and apparatus, computer device, and storage medium
CN112017644B (en) Sound transformation system, method and application
CN108198547B (en) Voice endpoint detection method and device, computer equipment and storage medium
US11881210B2 (en) Speech synthesis prosody using a BERT model
CN111145786A (en) Speech emotion recognition method and device, server and computer readable storage medium
CN110570876B (en) Singing voice synthesizing method, singing voice synthesizing device, computer equipment and storage medium
CN109714608B (en) Video data processing method, video data processing device, computer equipment and storage medium
CN113506586A (en) Method and system for recognizing emotion of user
CN114255740A (en) Speech recognition method, speech recognition device, computer equipment and storage medium
CN111179910A (en) Speed of speech recognition method and apparatus, server, computer readable storage medium
CN113744722A (en) Off-line speech recognition matching device and method for limited sentence library
CN114550706A (en) Smart campus voice recognition method based on deep learning
CN113963679A (en) Voice style migration method and device, electronic equipment and storage medium
Sharma et al. Speech and language recognition using MFCC and DELTA-MFCC
CN111370001A (en) Pronunciation correction method, intelligent terminal and storage medium
CN112712789A (en) Cross-language audio conversion method and device, computer equipment and storage medium
CN113327578B (en) Acoustic model training method and device, terminal equipment and storage medium
Kurian et al. Connected digit speech recognition system for Malayalam language
CN113539239A (en) Voice conversion method, device, storage medium and electronic equipment
CN111883106B (en) Audio processing method and device
CN113506561B (en) Text pinyin conversion method and device, storage medium and electronic equipment
Zhao et al. Multi-speaker Chinese news broadcasting system based on improved Tacotron2
CN113053409B (en) Audio evaluation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant