CN113506586A

CN113506586A - Method and system for recognizing emotion of user

Info

Publication number: CN113506586A
Application number: CN202110677222.7A
Authority: CN
Inventors: 高鹏; 郝少春; 袁兰; 吴飞; 周伟华; 高峰; 潘晶
Original assignee: Hangzhou Mjoys Big Data Technology Co ltd
Current assignee: Hangzhou Mjoys Big Data Technology Co ltd
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2021-10-15
Anticipated expiration: 2041-06-18
Also published as: CN113506586B

Abstract

The application relates to a method and a system for recognizing user emotion, wherein the method for recognizing the user emotion comprises the following steps: acquiring voice data, and extracting voice features according to the voice data; converting the voice data to obtain text data, and extracting text features according to the text data; inputting the voice features and the text features into a user emotion recognition model, and outputting emotion labels of the user, wherein in the user emotion recognition model: the method comprises the steps of using a convolutional neural network to represent voice characteristics to obtain first characteristics, using a long-time memory network to represent the first characteristics to obtain second characteristics, using the convolutional neural network to represent text characteristics to obtain third characteristics, using the long-time memory network to represent the third characteristics to obtain fourth characteristics, and fully connecting the second characteristics and the fourth characteristics to determine emotion labels of users.

Description

Method and system for recognizing emotion of user

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a method and a system for recognizing user emotion.

Background

With the development of artificial intelligence technology, the intelligent voice robot has become mature in industry, and more enterprises begin to pay attention to and use the intelligent voice robot.

The intelligent voice robot and the user can generate voice data in the interaction process, and emotion information of the user in the communication process can be obtained by performing emotion recognition on the voice data; in the related art, the emotion of the user is recognized from the perspective of the voice data alone according to the acquired voice data, or from the perspective of the text data alone according to the text data converted from the voice data, and the emotion recognition accuracy of the user is not high.

Aiming at the problem of low accuracy of user emotion recognition in the related technology, an effective solution is not provided yet.

Disclosure of Invention

The embodiment of the application provides a method and a system for recognizing user emotion, which are beneficial to improving the accuracy of recognizing the user emotion.

In a first aspect, an embodiment of the present application provides a method for recognizing a user emotion, where the method includes:

acquiring voice data, and extracting voice features according to the voice data, wherein the voice features comprise Mel cepstrum coefficient features;

converting the voice data to obtain text data, and extracting text features according to the text data, wherein the text features comprise word features and position features;

inputting the voice features and the text features into a user emotion recognition model, and outputting an emotion label of the user, wherein in the user emotion recognition model:

using a convolution neural network to represent the voice characteristics to obtain a first characteristic, using a long-time memory network to represent the first characteristic to obtain a second characteristic,

using a convolution neural network to represent the text characteristics to obtain a third characteristic, using a long-time memory network to represent the third characteristic to obtain a fourth characteristic,

and fully connecting the second characteristic and the fourth characteristic to determine the emotion label of the user.

In some embodiments, the speech features further include a pitch feature, a intonation feature, and a speech rate feature, and the extracting of the speech features includes:

determining the peak value, the frequency and the period of the waveform according to the waveform of the voice data;

determining the pitch feature, the intonation feature and the speech speed feature of the voice data in a one-to-one correspondence manner according to the peak value, the frequency and the period;

and representing the pitch feature, the intonation feature and the speech speed feature by Embelling.

In some embodiments, the text feature further includes a pronunciation feature, and the extracting of the text feature includes:

determining the pronunciation of each word in the text data to obtain the pronunciation characteristics, wherein the pronunciation comprises pinyin, tone, initial consonant or final sound;

the pronunciation characteristics are represented by Embedding.

In some of these embodiments, before the extracting the text feature, the method includes: correcting errors in the text data, wherein the error correcting process comprises:

inputting the text data into a mask language model, determining and hiding pronouncing and error-prone words in the text data by the mask language model according to an error confusion set, wherein the error confusion set comprises the incidence relation between the pronouncing and error-prone words, and in the process of incremental training in a model training stage, the mask language model replaces the pronunciation-prone words in the training data with the pronunciation-prone words according to a target proportion according to the error confusion set;

the mask language model determines a candidate word with the highest probability at the hidden position, and in the case that the pronunciation-prone word is inconsistent with the candidate word, the mask language model replaces the pronunciation-prone word with the candidate word.

In a second aspect, an embodiment of the present application provides a system for recognizing a user emotion, where the system includes:

the voice extraction module is used for acquiring voice data and extracting voice features according to the voice data, wherein the voice features comprise Mel cepstrum coefficient features;

the second extraction module is used for converting the voice data to obtain text data and extracting text features according to the text data, wherein the text features comprise word features and position features;

a determining module, configured to input the speech feature and the text feature into a user emotion recognition model, and output an emotion tag of the user, where in the user emotion recognition model:

In some embodiments, the speech features further include a pitch feature, a intonation feature, and a speech rate feature, and the first extraction module is further configured to:

In some embodiments, the text feature further comprises a pronunciation feature, and the second extraction module is further configured to:

the pronunciation characteristics are represented by Embedding.

In some of these embodiments, the system further comprises:

an error correction module, configured to correct an error of the text data before the extracting the text feature, where a process of the error correction includes:

In a third aspect, the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the method for emotion recognition of a user when executing the computer program.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the method for emotion recognition of a user.

Compared with the related art, the method for recognizing the user emotion provided by the embodiment of the application extracts the voice features by acquiring the voice data and according to the voice data, wherein the voice features comprise Mel cepstrum coefficient features; converting the voice data to obtain text data, and extracting text features according to the text data, wherein the text features comprise word features and position features; inputting the voice features and the text features into a user emotion recognition model, and outputting emotion labels of the user, wherein in the user emotion recognition model: the method comprises the steps of using a convolutional neural network to represent voice characteristics to obtain first characteristics, using a long-time memory network to represent the first characteristics to obtain second characteristics, using the convolutional neural network to represent text characteristics to obtain third characteristics, using the long-time memory network to represent the third characteristics to obtain fourth characteristics, and fully connecting the second characteristics and the fourth characteristics to determine emotion labels of users, so that the problem of low accuracy of user emotion recognition in the related technology is solved, and the improvement of the accuracy of user emotion recognition is facilitated.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic diagram of an application environment of a method for emotion recognition of a user according to an embodiment of the present application;

FIG. 2 is a flow chart of a method of user emotion recognition according to a first embodiment of the present application;

FIG. 3 is a schematic diagram of a user emotion recognition model according to a second embodiment of the present application;

FIG. 4 is a flow chart of a process of extracting features of Mel cepstral coefficients according to a third embodiment of the present application;

FIG. 5 is a flowchart of a process of extracting word and location features according to a fourth embodiment of the present application;

FIG. 6 is a flow chart of a process of extracting speech features according to a fifth embodiment of the present application;

fig. 7 is a flowchart of an error correction process of text data according to a sixth embodiment of the present application;

fig. 8 is a flowchart of a text feature extraction process according to a seventh embodiment of the present application;

fig. 9 is a block diagram of a system for emotion recognition of a user according to an eighth embodiment of the present application;

fig. 10 is a block diagram of a system for emotion recognition of a user according to a ninth embodiment of the present application;

fig. 11 is an internal structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.

It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.

The method for recognizing the emotion of the user provided by the present application can be applied to an application environment shown in fig. 1, fig. 1 is an application environment schematic diagram of the method for recognizing the emotion of the user according to the embodiment of the present application, as shown in fig. 1, a server 102 obtains voice data of a terminal 101 through a network and operates the method for recognizing the emotion of the user, so as to obtain emotion information of the user in the voice data, the server 102 can be implemented by an independent server or a server cluster composed of a plurality of servers, and the terminal 101 can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.

The present embodiment provides a method for recognizing a user emotion, fig. 2 is a flowchart of a method for recognizing a user emotion according to a first embodiment of the present application, and as shown in fig. 2, the flowchart includes the following steps:

step S201, acquiring voice data, and extracting voice features according to the voice data, wherein the voice features comprise Mel cepstrum coefficient features;

step S202, converting the voice data to obtain text data, and extracting text features according to the text data, wherein the text features include word features and position features, for example, an Automatic Speech Recognition (ASR) technology can be used to convert the voice data to the text data;

step S203, inputting the voice characteristics and the text characteristics into the emotion recognition model of the user, and outputting the emotion label of the user.

Through the steps S201 to S203, compared with the problem that the accuracy of user emotion recognition is not high in the related art, in the present embodiment, the voice feature is extracted according to the voice data, the text feature is extracted according to the text data after voice conversion, the voice feature and the text feature are input into the user emotion recognition model, and the emotion tag of the user is output, so that multi-modal recognition of the user emotion is completed.

In addition, the intelligent voice robot can adjust the answering content according to the recognized emotion of the user, the whole conversation process of the intelligent voice robot and the user is closer to the conversation between people, and therefore the user experience is greatly improved.

In some embodiments, fig. 3 is a schematic diagram of a user emotion recognition model according to a second embodiment of the present application, as shown in fig. 3, a speech feature and a text feature respectively form a feature matrix, in the user emotion recognition model, a Convolutional Neural Network (CNN) is used to characterize the speech feature, a multi-core Convolutional and pooling layer processing is performed on the speech feature to obtain a first feature (the first feature is a sequence feature), a Long Short-Term Memory network (LSTM) is used to characterize the first feature, a second feature is obtained, a Convolutional Neural network is used to characterize the text feature, a multi-core Convolutional and pooling layer processing is performed on the text feature to obtain a third feature (the third feature is a sequence feature), a Long-Term Memory network is used to characterize the third feature, a fourth feature is obtained, in a fully connected layer (fully connected layers, FC for short) and the fourth feature are fully connected, and the emotion Label (Label) of the user is output on the output layer.

In some embodiments, fig. 4 is a flowchart of a process of extracting mel-frequency cepstrum coefficient features according to a third embodiment of the present application, and as shown in fig. 4, the process includes the following steps:

step S401, performing preprocessing on the voice data, including: removing segments which are caused by non-human reasons and have no sound at all in the voice data, for example, removing segments which are caused by packet loss and have no sound at all;

step S402, performing pre-emphasis processing on the pre-processed voice data, where the pre-emphasis processing is to pass a voice signal in the voice data through a high-pass filter:

H(Z)＝1-μz^-1equation 1

Wherein, the value of μ in formula 1 is between 0.9-1.0, usually 0.97, the pre-emphasis is to boost the high frequency part and flatten the spectrum of the signal, so that the spectrum can be obtained by the same signal-to-noise ratio in the whole frequency band from low frequency to high frequency, and the high frequency part of the voice signal, which is suppressed by the pronunciation system, is compensated for in order to highlight the formant of the high frequency;

step S403, performing framing processing on the pre-emphasized voice data, framing the voice data according to the statistical characteristic that the voice data is not constant in time, so as to divide a continuous voice signal into short segments, and assuming that the voice signal is constant in each segment region, generally selecting N sampling points as an observation unit, where the observation unit is called a frame, where N is 512 in general, and the frame length is 25ms, in order to ensure smooth transition between frames, overlapping is generally considered in framing, a time difference between start positions of two adjacent frames is called a frame shift, and in general, the frame shift is 10 ms;

step S404, performing windowing on the framed voice data, because the subsequent step needs to perform Fast Fourier Transform (FFT), the FFT needs to change the voice signal from negative infinity to positive infinity or periodicity, and the framed voice signal is aperiodic, so that windowing is needed to make the voice signal periodic, a hamming window is commonly used in the industry, so that the signal is shrunk to zero at the boundary, and when N is greater than or equal to 0 and less than or equal to N, the window function is as follows:

w (N) 0.54-0.46 cos [2 pi N/(N-1) ] formula 2

In other cases:

w (n) 0 formula 3

Multiplying each frame by a hamming window, thereby increasing the continuity of the left and right ends of the frame;

step S405, performing fast Fourier transform processing on the windowed voice data, wherein the transform formula of the fast Fourier transform is as follows:

because the characteristics of the signal are usually difficult to see by the transformation of the signal in the time domain, the signal is usually observed by transforming the signal into energy distribution in the frequency domain, and different energy distributions can represent the characteristics of different voice signals, so after multiplying a hamming window, each frame also needs to use fast fourier transform to obtain a spectrogram of each frame, and the spectrograms of each frame are stacked in time to obtain a spectrogram;

step S406, applying a Mel filter to the spectrogram, taking logarithm to obtain a Mel spectrogram, and converting the frequency of the spectrogram into American (Mel) scale, i.e. linear distribution below 1000Hz and logarithmic growth above 1000Hz, wherein the conversion formula is as follows:

m＝2595 log₁₀(1+ f/700) formula 5

Because the sound level heard by human ears is not in linear relation with the actual (Hz) frequency, the Mel frequency is more consistent with the auditory characteristics of human ears, so the frequency of the spectrogram needs to be changed into the American (Mel) scale;

step S407, performing Cepstrum analysis on the Mel spectrogram, wherein the most useful information is in a filter, namely a sound cavity, due to the characteristic of human voice, a sound source and the filter need to be separated, the voice signal can be decomposed by performing Cepstrum analysis on the Mel spectrogram, and the 2 nd to 14 th coefficients representing the information of the filter are taken, and the obtained 12 coefficients are the characteristics of Mel Frequency Cepstrum Coefficient (MFCC);

step S408, determining the first difference and the second difference of the MFCC features to obtain the change track of the Mel cepstrum coefficient along with time, wherein the energy of one frame is the sum of the sample powers of the frame in a certain time period, and because the voice signals are not constant from one frame to another frame, the feature related to the time change can be added, and the speed feature (or Delta feature) and the acceleration feature (or double Delta feature) are added to the Mel cepstrum coefficient feature and the energy feature to obtain the change track of the Mel cepstrum coefficient along with time.

In some embodiments, fig. 5 is a flowchart of a process of extracting word features and position features according to a fourth embodiment of the present application, and as shown in fig. 5, the process includes the following steps:

step S501, performing word segmentation processing on text data, performing part-of-speech recognition to obtain word characteristics and part-of-speech characteristics, for example, performing word segmentation by using word-plus-word granularity, determining whether the words in a word list need to be segmented according to context by using a language model, keeping the word granularity for the words not in the word list, and performing part-of-speech tagging on the words by using the part-of-speech and entity tagging information in the word list, wherein each word and the corresponding part-of-speech are respectively used as the word characteristics and the part-of-speech characteristics, so that the text data can be split into characteristics from different angles;

step S502, confirming the position information of the text data to obtain the position characteristics, for example, carrying out Embedding operation on the character and word characteristics by using One-Hot coding, the dimensionality is 512, simultaneously using absolute position coding, adding the position information, summing the data after all the Embedding operations as the data to be input into the emotion recognition model, and feeding the data into the model.

Considering that the pitch, intonation, and speech rate of the user can also reflect the emotion of the user, in some embodiments, the speech features further include a pitch feature, an intonation feature, a speech rate feature, and a pause feature, fig. 6 is a flowchart of a speech feature extraction process according to a fifth embodiment of the present application, and as shown in fig. 6, the process includes the following steps:

step S601, determining peak value, frequency, period and pause information of the waveform according to the waveform of the voice data;

step S602, determining pitch characteristics, tone characteristics, speech speed characteristics and pause characteristics of the voice data in a one-to-one correspondence manner according to the peak value, the frequency, the period and the pause information;

step S603, representing pitch feature, intonation feature, speech rate feature and pause feature by using Embedding.

Through steps S601 to S603, compared to the problem that the emotion recognition result is inaccurate because the pitch, intonation, speed, pause, and other information related to the emotion information of the user are not considered when the emotion of the user is recognized from the perspective of voice in the related art, in the embodiment, the pitch feature, the intonation feature, the speed feature, and the pause feature of the voice data are determined according to the waveform of the voice data, so that more data bases can be provided for determining the emotion label of the user for the subsequent emotion recognition model, thereby further improving the accuracy of emotion recognition of the user, and meanwhile, the format of the voice feature can meet the format requirement of the emotion recognition model on the input data by using Embedding to represent the pitch feature, the intonation feature, the speed feature, and the pause feature.

Considering that if the emotion of the user is to be accurately recognized, it is a prerequisite that the user's semantics are correctly understood, and problems that homophones, nears or pronunciation are not standard, which may exist in the text, all affect the correct understanding of the user's semantics, resulting in inaccurate emotion recognition results, in some embodiments, the method for recognizing the emotion of the user includes: error correction of text data, fig. 7 is a flowchart of an error correction process of text data according to a sixth embodiment of the present application, and as shown in fig. 7, the flowchart includes the following steps:

step S701, inputting text data into a Mask Language Model (MLM), where the mask Language Model determines and conceals pronounceable and error-prone words in the text data according to an error confusion set, where the error confusion set includes an association relationship between the pronounceable and error-prone words, and the mask Language Model performs incremental training in a Model training stage, replaces the pronounceable and error-prone words in the training data with the pronunciation-prone words according to a target ratio as error data, and the error data and the correct data form a set of training data, so that the mask Language Model is trained in a targeted manner to predict the concealed words as pronunciation-similar words of the concealed words, where the pronunciation-similar words include similar sounds or words that are wrong due to different pronunciation habits, such as a frank word, which may be hunan original word, and therefore frank is a pronounceable and error-prone word, hunan is a pronunciation similar word of Fran, and the wrong confusion set can be generated by combining initial consonants, vowels and tones by using a language model, and the target proportion can be 1.5%;

step S702, the mask language model determines the candidate word with the highest probability at the hiding position, and under the condition that the pronounce error-prone word is inconsistent with the candidate word, the mask language model replaces the pronounce error-prone word with the candidate word.

Through steps S701 to S702, in order to solve the problem that the mask language model in the related art is randomly selected to replace the content during the training process, and if the mask language model is directly applied to the text data for error correction, the error correction effect is poor, the implementation considers that the high-frequency error-prone content in the text data is an error-prone word in pronunciation, in the process of incremental training in the model training stage by a mask language model, pronunciation similar words in training data are replaced by pronunciation error-prone words according to a target proportion, the mask language model can be trained in a targeted manner to predict the hidden word as a similar pronunciation word of the hidden word, therefore, high-frequency error-prone contents in the text data can be found out more accurately in the using process of the mask language model, error correction of the text data is achieved, guarantee is provided for correctly understanding the semantics of the user, and therefore the accuracy of the recognized emotion of the user is improved.

Considering that the pinyin of each word in the text data may also reflect the semantics of the user, for example, a hand of a pickpocket is opened, the pinyin is ba (first sound) and is the meaning of being opened, the pickpocket is opened, and the pinyin is pa (second sound) and is the meaning of stealing property on another person, so the pinyin of each word in the text data may also be used as a text feature, fig. 8 is a flowchart of a text feature extraction process according to a seventh embodiment of the present application, and as shown in fig. 8, the process includes the following steps:

step S801, determining the pronunciation of each word in the text data to obtain pronunciation characteristics, wherein the pronunciation comprises pinyin, and further, considering that the tone, the initial consonant or the final sound of the user can also reflect the emotion of the user, the pronunciation can also comprise the tone, the initial consonant or the final sound;

in step S802, pronunciation characteristics are represented by Embedding.

Through steps S801 to S802, compared to the problem that the emotion recognition result is inaccurate because pronunciation information of pinyin, tone, initial consonant, final sound, etc. reflecting user semantics or emotion is not considered when recognizing user emotion from a text perspective in the related art, the embodiment obtains pronunciation characteristics by determining the pinyin, tone, initial consonant, or final sound of each word in the text data, and can provide more data bases for determining a user emotion tag for a subsequent emotion recognition model, thereby further improving the accuracy of user emotion recognition, and meanwhile, by using Embedding to represent the pronunciation characteristics of the text data, the format of the pronunciation characteristics can be made to meet the format requirements of the emotion recognition model on input data.

This embodiment also provides a system for recognizing a user emotion, fig. 9 is a block diagram of a structure of a system for recognizing a user emotion according to an eighth embodiment of the present application, and as shown in fig. 9, the system includes:

a first extraction module 901, configured to obtain voice data and extract a voice feature according to the voice data, where the voice feature includes a mel-frequency cepstrum coefficient feature;

a second extraction module 902, configured to perform conversion processing on the voice data to obtain text data, and extract text features according to the text data, where the text features include word features and position features;

a determining module 903, configured to input the speech feature and the text feature into a user emotion recognition model, and output an emotion label of the user, where in the user emotion recognition model: the method comprises the steps of using a convolutional neural network to represent voice characteristics to obtain first characteristics, using a long-term memory network to represent the first characteristics to obtain second characteristics, using the convolutional neural network to represent text characteristics to obtain third characteristics, using the long-term memory network to represent the third characteristics to obtain fourth characteristics, fully connecting the second characteristics with the fourth characteristics, and determining an emotion label of a user.

In some embodiments, the speech features further include a pitch feature, a intonation feature, and a speech speed feature, and the first extraction module 901 is further configured to: determining the peak value, frequency and period of the waveform according to the waveform of the voice data; determining pitch characteristics, intonation characteristics and speech speed characteristics of the voice data in a one-to-one correspondence manner according to the peak value, the frequency and the period; pitch, intonation, and speech rate features are denoted by Embedding.

In some embodiments, the text feature further comprises a pronunciation feature, and the second extraction module 902 is further configured to: determining the pronunciation of each word in the text data to obtain pronunciation characteristics, wherein the pronunciation comprises pinyin, tone, initial consonant or final sound; pronunciation features are denoted by Embedding.

In some embodiments, fig. 10 is a block diagram of a system for recognizing emotion of a user according to a ninth embodiment of the present application, and as shown in fig. 10, the system further includes:

an error correction module 1001, configured to correct errors in text data before extracting text features, where a process of correcting errors includes: inputting text data into a mask language model, determining and hiding pronouncing and error-prone words in the text data by the mask language model according to an error confusion set, wherein the error confusion set comprises the incidence relation between the pronouncing and error-prone words, and replacing the pronouncing and error-prone words in training data by the mask language model according to a target proportion according to the error confusion set in the incremental training process of the model training stage; and the mask language model determines the candidate word with the highest probability at the hiding position, and replaces the pronounced error-prone word with the candidate word under the condition that the pronounced error-prone word is inconsistent with the candidate word.

In an embodiment, fig. 11 is a schematic internal structure diagram of an electronic device according to an embodiment of the present application, and as shown in fig. 11, there is provided an electronic device, which may be a server, and its internal structure diagram may be as shown in fig. 11. The electronic device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the electronic device is used for storing data. The network interface of the electronic device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a method of emotion recognition for a user.

Those skilled in the art will appreciate that the architecture shown in fig. 11 is a block diagram of only a portion of the architecture associated with the subject application, and does not constitute a limitation on the electronic devices to which the subject application may be applied, and that a particular electronic device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It should be understood by those skilled in the art that various features of the above-described embodiments can be combined in any combination, and for the sake of brevity, all possible combinations of features in the above-described embodiments are not described in detail, but rather, all combinations of features which are not inconsistent with each other should be construed as being within the scope of the present disclosure.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of emotion recognition for a user, the method comprising:

2. The method according to claim 1, wherein the speech features further include pitch features, intonation features and speech speed features, and the extracting process of the speech features includes:

3. The method of claim 1, wherein the text features further comprise pronunciation features, and wherein the extracting of the text features comprises:

the pronunciation characteristics are represented by Embedding.

4. The method of claim 3, wherein prior to said extracting the textual features, the method comprises: correcting errors in the text data, wherein the error correcting process comprises:

5. A system for emotion recognition of a user, the system comprising:

6. The system of claim 5, wherein the speech features further include a pitch feature, a intonation feature, and a speech rate feature, and wherein the first extraction module is further configured to:

7. The system of claim 5, wherein the text features further include pronunciation features, and wherein the second extraction module is further configured to:

the pronunciation characteristics are represented by Embedding.

8. The system of claim 7, further comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the computer program, implements a method of user emotion recognition as claimed in any of claims 1 to 4.

10. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method of emotion recognition of a user as claimed in any one of claims 1 to 4.