CN113506586B

CN113506586B - Method and system for identifying emotion of user

Info

Publication number: CN113506586B
Application number: CN202110677222.7A
Authority: CN
Inventors: 高鹏; 郝少春; 袁兰; 吴飞; 周伟华; 高峰; 潘晶
Original assignee: Hangzhou Mjoys Big Data Technology Co ltd
Current assignee: Hangzhou Mjoys Big Data Technology Co ltd
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2023-06-20
Anticipated expiration: 2041-06-18
Also published as: CN113506586A

Abstract

The application relates to a method and a system for identifying user emotion, wherein the method for identifying the user emotion comprises the following steps: acquiring voice data, and extracting voice characteristics according to the voice data; converting the voice data to obtain text data, and extracting text features according to the text data; inputting the voice features and the text features into a user emotion recognition model, and outputting an emotion label of the user, wherein in the user emotion recognition model: the method comprises the steps of using a convolutional neural network to represent voice characteristics, obtaining first characteristics, using a long-short-time memory network to represent the first characteristics, obtaining second characteristics, using the convolutional neural network to represent text characteristics, obtaining third characteristics, using the long-short-time memory network to represent the third characteristics, obtaining fourth characteristics, fully connecting the second characteristics and the fourth characteristics, and determining emotion labels of users.

Description

Method and system for identifying emotion of user

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a method and a system for identifying emotion of a user.

Background

With the development of artificial intelligence technology, intelligent voice robots are gradually mature in industry, and more enterprises begin to pay attention to and use the intelligent voice robots.

The intelligent voice robot and the user can generate voice data in the interaction process, and emotion information of the user in the communication process can be obtained by carrying out emotion recognition on the voice data; in the related art, the emotion of the user is identified from the perspective of the acquired voice data alone or from the perspective of the text data alone according to the text data converted from the voice data, and the accuracy of the emotion identification of the user is not high.

Aiming at the problem of low accuracy of emotion recognition of a user in the related art, no effective solution has been proposed yet.

Disclosure of Invention

The embodiment of the application provides a method and a system for identifying the emotion of a user, which are beneficial to improving the accuracy of identifying the emotion of the user.

In a first aspect, an embodiment of the present application provides a method for identifying emotion of a user, where the method includes:

acquiring voice data, and extracting voice characteristics according to the voice data, wherein the voice characteristics comprise mel-frequency cepstrum coefficient characteristics;

converting the voice data to obtain text data, and extracting text features according to the text data, wherein the text features comprise word features and position features;

inputting the voice features and the text features into a user emotion recognition model, and outputting emotion tags of the user, wherein in the user emotion recognition model:

characterizing the speech features using a convolutional neural network, obtaining a first feature, characterizing the first feature using a long-short-term memory network, obtaining a second feature,

characterizing the text features using a convolutional neural network, obtaining a third feature, characterizing the third feature using a long-short-term memory network, obtaining a fourth feature,

and fully connecting the second feature and the fourth feature, and determining the emotion label of the user.

In some embodiments, the voice features further include a pitch feature, a intonation feature, and a speech rate feature, and the extracting process of the voice features includes:

determining the peak value, the frequency and the period of the waveform according to the waveform of the voice data;

determining the pitch feature, the intonation feature and the speech speed feature of the speech data according to the peak value, the frequency and the period in a one-to-one correspondence;

and representing the pitch characteristic, the intonation characteristic and the speech speed characteristic by using Embedding.

In some embodiments, the text feature further includes a pronunciation feature, and the extracting process of the text feature includes:

determining pronunciation of each word in the text data to obtain the pronunciation characteristics, wherein the pronunciation comprises pinyin, tone, initials or finals;

the pronunciation characteristics are represented by Embedding.

In some of these embodiments, prior to the extracting the text feature, the method includes: correcting the text data, wherein the correcting process comprises the following steps:

inputting the text data into a mask language model, determining and hiding the pronunciation error-prone words in the text data according to an error confusion set, wherein the error confusion set contains the association relation between the pronunciation error-prone words and the pronunciation similar words, and replacing the pronunciation similar words in the training data with the pronunciation error-prone words according to a target proportion in the process of performing incremental training in a model training stage by the mask language model according to the error confusion set;

the mask language model determines a candidate word having the highest probability at the hidden location, and replaces the pronunciation-prone word with the candidate word if the pronunciation-prone word is inconsistent with the candidate word.

In a second aspect, embodiments of the present application provide a system for emotion recognition of a user, the system comprising:

the first extraction module is used for acquiring voice data and extracting voice characteristics according to the voice data, wherein the voice characteristics comprise mel cepstrum coefficient characteristics;

the second extraction module is used for converting the voice data to obtain text data and extracting text features according to the text data, wherein the text features comprise word features and position features;

the determining module is used for inputting the voice features and the text features into a user emotion recognition model and outputting emotion tags of the user, wherein in the user emotion recognition model:

In some of these embodiments, the speech features further include pitch features, intonation features, and pace features, and the first extraction module is further configured to:

In some of these embodiments, the text feature further comprises a pronunciation feature, and the second extraction module is further configured to:

the pronunciation characteristics are represented by Embedding.

In some of these embodiments, the system further comprises:

and the error correction module is used for correcting errors of the text data before the text features are extracted, wherein the error correction process comprises the following steps:

In a third aspect, embodiments of the present application provide a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing a method for identifying emotion of the user when the computer program is executed.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method for emotion recognition of a user.

Compared with the related art, the method for recognizing the emotion of the user provided by the embodiment of the application is characterized in that voice data are obtained, and voice features are extracted according to the voice data, wherein the voice features comprise mel-frequency spectrum coefficient features; converting the voice data to obtain text data, and extracting text features according to the text data, wherein the text features comprise word features and position features; inputting the voice features and the text features into a user emotion recognition model, and outputting an emotion label of the user, wherein in the user emotion recognition model: the voice features are represented by using a convolutional neural network to obtain first features, the first features are represented by using a long-short-time memory network to obtain second features, the text features are represented by using the convolutional neural network to obtain third features, the third features are represented by using the long-short-time memory network to obtain fourth features, the second features and the fourth features are fully connected, and emotion tags of users are determined, so that the problem of low accuracy of emotion recognition of the users in the related art is solved, and the accuracy of emotion recognition of the users is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a schematic view of an application environment of a method of user emotion recognition according to an embodiment of the present application;

FIG. 2 is a flow chart of a method of user emotion recognition according to a first embodiment of the present application;

FIG. 3 is a schematic diagram of a user emotion recognition model according to a second embodiment of the present application;

fig. 4 is a flowchart of a process of extracting mel-frequency spectral features according to a third embodiment of the present application;

FIG. 5 is a flow chart of a process of extracting word features and location features according to a fourth embodiment of the present application;

fig. 6 is a flowchart of a process of extracting speech features according to a fifth embodiment of the present application;

fig. 7 is a flowchart of an error correction process of text data according to a sixth embodiment of the present application;

fig. 8 is a flowchart of a text feature extraction process according to a seventh embodiment of the present application;

FIG. 9 is a block diagram of a system for user emotion recognition according to an eighth embodiment of the present application;

FIG. 10 is a block diagram of a system for user emotion recognition according to a ninth embodiment of the present application;

fig. 11 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described and illustrated below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden on the person of ordinary skill in the art based on the embodiments provided herein, are intended to be within the scope of the present application.

It is apparent that the drawings in the following description are only some examples or embodiments of the present application, and it is possible for those of ordinary skill in the art to apply the present application to other similar situations according to these drawings without inventive effort. Moreover, it should be appreciated that while such a development effort might be complex and lengthy, it would nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as having the benefit of this disclosure.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is to be expressly and implicitly understood by those of ordinary skill in the art that the embodiments described herein can be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar terms herein do not denote a limitation of quantity, but rather denote the singular or plural. The terms "comprising," "including," "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to only those steps or elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "connected," "coupled," and the like in this application are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein refers to two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., "a and/or B" may mean: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. The terms "first," "second," "third," and the like, as used herein, are merely distinguishing between similar objects and not representing a particular ordering of objects.

The method for identifying user emotion provided by the application can be applied to an application environment shown in fig. 1, fig. 1 is a schematic diagram of an application environment of the method for identifying user emotion according to an embodiment of the application, as shown in fig. 1, a server 102 obtains voice data of a terminal 101 through a network, and operates the method for identifying user emotion, so as to obtain emotion information of a user in the voice data, the server 102 can be implemented by an independent server or a server cluster formed by a plurality of servers, and the terminal 101 can be but is not limited to various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices.

The present embodiment provides a method for identifying user emotion, fig. 2 is a flowchart of a method for identifying user emotion according to a first embodiment of the present application, as shown in fig. 2, the flowchart includes the following steps:

step S201, obtaining voice data, and extracting voice characteristics according to the voice data, wherein the voice characteristics comprise mel cepstrum coefficient characteristics;

step S202, converting the voice data to obtain text data, and extracting text features according to the text data, wherein the text features comprise word features and position features, for example, an automatic voice recognition technology (Automatic Speech Recognition, ASR for short) can be used for converting the voice data into the text data;

step S203, inputting the voice features and the text features into the user emotion recognition model, and outputting the emotion labels of the user.

Compared with the problem of low accuracy of user emotion recognition in the related art, through steps S201 to S203, in the embodiment, by extracting voice features according to voice data and extracting text features according to text data after voice conversion, the voice features and the text features are input into a user emotion recognition model, and emotion tags of the user are output, so that multi-modal recognition of the user emotion is completed.

In addition, the intelligent voice robot can adjust the answer content according to the recognized emotion of the user, and the whole conversation process of the intelligent voice robot and the user and the conversation of the person and the person are closer, so that the user experience is greatly improved.

In some embodiments, fig. 3 is a schematic diagram of a user emotion recognition model according to a second embodiment of the present application, as shown in fig. 3, speech features and text features respectively form a feature matrix, in the user emotion recognition model, a convolutional neural network (Convolutional Neural Networks, abbreviated as CNN) is used to characterize the speech features, a multi-core rolling and pooling layer process is performed on the speech features to obtain first features (the first features are sequence features), a Long Short-Term Memory (LSTM) is used to characterize the first features to obtain second features, a convolutional neural network is used to characterize the text features, a multi-core rolling and pooling layer process is performed on the text features to obtain third features (the third features are sequence features), a Long-Term Memory network is used to characterize the third features to obtain fourth features, the second features and the fourth features are fully connected at a fully connecting layer (fully connected layers, abbreviated as FC), and a user emotion tag (Label) is output at an output layer.

In some of these embodiments, fig. 4 is a flowchart of a process for extracting mel-frequency spectral features according to a third embodiment of the present application, as shown in fig. 4, the process including the steps of:

step S401, preprocessing the voice data, including: removing segments of voice data which are not caused by artificial reasons and are completely free of voice, for example, removing segments of voice which are caused by packet loss reasons;

step S402, pre-emphasis processing is performed on the pre-processed voice data, wherein the pre-emphasis processing is to pass the voice signal in the voice data through a high-pass filter:

H(Z)＝1-μz ^-1 equation 1

Wherein μ in formula 1 has a value between 0.9 and 1.0, usually 0.97, and the pre-emphasis aims to boost the high frequency part and flatten the spectrum of the signal, so that the same signal-to-noise ratio can be used to calculate the spectrum in the whole frequency band from low frequency to high frequency, and meanwhile, the high frequency part of the speech signal suppressed by the sounding system is compensated for highlighting the formants of the high frequency;

step S403, carrying out frame division processing on pre-emphasized voice data, and framing the voice data according to the statistical characteristic that the voice data is not constant in time, so as to divide continuous voice signals into short segments, and assuming that the voice signals are constant in each segment area, N sampling points are generally selected as an observation unit, wherein the observation unit is called a frame, the value of N is 512 in normal case, the frame length is 25ms, overlapping is generally considered in frame division in order to ensure smooth transition between frames, and the time difference of the initial positions of two adjacent frames is called frame shift, and the frame shift is 10ms in normal case;

in step S404, windowing is performed on the framed speech data, and since the following step requires fast fourier transform (fast Fourier transform, FFT for short), the FFT requires that the speech signal is from minus infinity to plus infinity or periodic, and the framed speech signal is aperiodic, so that windowing is required to have periodicity, hamming windows are commonly used in the industry to shrink the signal to zero at the boundary, and in the case where 0 is less than or equal to N, the window function is as follows:

w (N) =0.54-0.46 cos [ 2n/(N-1) ] formula 2

In other cases:

w (n) =0 formula 3

Multiplying each frame by a hamming window to increase the continuity of the left and right ends of the frame;

step S405, performing a fast fourier transform process on the windowed voice data, where a transform formula of the fast fourier transform is:

because the signal is generally difficult to see the characteristics of the signal in the transformation of the signal in the time domain, the signal is generally converted into the energy distribution in the frequency domain for observation, and different energy distributions can represent the characteristics of different voice signals, so after being multiplied by a hamming window, each frame also needs to use fast fourier transformation to obtain a spectrogram of each frame, and the spectrograms of each frame are stacked in time to obtain a spectrogram;

step S406, after applying a Mel filter to the spectrogram, taking the logarithm to obtain a Mel spectrogram, and converting the frequency of the spectrogram into a Mei (Mel) scale, i.e. linear distribution below 1000Hz, logarithmic growth above 1000Hz, with the following conversion formula:

m＝2595 log ₁₀ (1+f/700) equation 5

Because the sound level heard by the human ear and the actual (Hz) frequency are not in a linear relation, the Mel frequency is more in line with the auditory characteristics of the human ear, so that the frequency of the spectrogram is required to be changed into a Mei (Mel) scale;

step S407, carrying out cepstrum analysis on the Mel spectrogram, wherein most useful information is in a filter, namely an acoustic cavity, so that a sound source and the filter are required to be separated, and carrying out cepstrum analysis on the Mel spectrogram to decompose a voice signal, and taking the 2 nd to 14 th coefficients representing the information of the filter, wherein the 12 coefficients are the characteristics of Mel cepstrum coefficients (Mel Frequency Cepstrum Coefficient, MFCC for short);

in step S408, the first-order difference and the second-order difference of the MFCC feature are determined, so as to obtain a variation trace of the mel-frequency coefficient over time, the energy of one frame is the sum of the powers of the samples of the frame over a certain period, and since the speech signal from one frame to another frame is not constant, a feature related to the time variation can be added, and a speed feature (or Delta feature) and an acceleration feature (or double Delta feature) are added to the mel-frequency coefficient feature and the energy feature, so as to obtain a variation trace of the mel-frequency coefficient over time.

In some of these embodiments, fig. 5 is a flowchart of a process of extracting word features and location features according to a fourth embodiment of the present application, as shown in fig. 5, the process including the steps of:

step S501, word segmentation processing is carried out on text data, word part identification is carried out to obtain word features and word part features, for example, word segmentation can be carried out by adopting word adding granularity, a language model is used for words in a word list, whether the words need to be segmented or not is determined according to the context, the word granularity is kept for words which are not in the word list, word part labeling is carried out on words by utilizing word part and entity labeling information in the word list, each word and the word part corresponding to each word serve as word features and word part features respectively, and the text data can be split into features from different angles;

step S502, confirming the position information of the text data to obtain position features, for example, performing an Embedding operation on the word features and the part-of-speech features by using One-Hot coding, wherein the dimension is 512, meanwhile, adding the position information by using absolute position coding, summing the data after all Embedding operations, and feeding the summed data into the model as data to be input into the emotion recognition model.

Considering that the pitch, intonation and speech speed of the user also reflect the emotion of the user, in some embodiments, the speech features further include a pitch feature, a intonation feature, a speech speed feature and a pause feature, fig. 6 is a flowchart of a process for extracting speech features according to a fifth embodiment of the present application, as shown in fig. 6, the flowchart includes the following steps:

step S601, determining peak value, frequency, period and pause information of the waveform according to the waveform of the voice data;

step S602, according to peak value, frequency, period and pause information, determining pitch characteristics, intonation characteristics, speech speed characteristics and pause characteristics of the voice data in a one-to-one correspondence manner;

step S603, using Embedding to represent pitch feature, intonation feature, speech speed feature and pause feature.

Compared with the prior art that when the emotion of the user is recognized from the voice angle in the steps S601 to S603, the problem that the emotion recognition result is inaccurate is solved because the information about pitch, intonation, speech speed, pause and the like and the information about the emotion of the user are not considered.

Considering that if the emotion of the user is to be accurately identified, it is a precondition that the user's semantics are accurately understood, and these problems of homophones, near phones or nonstandard pronunciation which may exist in the text affect the accurate understanding of the user's semantics, which results in inaccurate emotion identification results, in some embodiments, before extracting text features, the method for identifying the emotion of the user includes: for text data error correction, fig. 7 is a flowchart of an error correction process of text data according to a sixth embodiment of the present application, as shown in fig. 7, the flowchart including the steps of:

in step S701, inputting text data into a mask language model (Masked Language Model, abbreviated as MLM), where the mask language model determines and masks a pronunciation error word in the text data according to an error confusion set, where the error confusion set includes an association relationship between the pronunciation error word and the pronunciation similar word, the mask language model replaces the pronunciation similar word in the training data with the pronunciation error word according to a target proportion as error data in a process of performing incremental training in the model training stage, the error data and correct data form a set of training data, so that the mask language model is specifically trained to predict the mask word into the pronunciation similar word of the mask word, where the pronunciation similar word includes similar words similar to a near sound, similar sound, or a word that may be wrong due to different pronunciation habits, for example, a frank word, where the original meaning may be a huan, and the frank is a pronunciation error word, and the target proportion may be 1.5% by combining a consonant with a pronunciation model to generate the error final set;

in step S702, the masking language model determines the candidate word with the highest probability at the hidden position, and in the case where the pronunciation error prone word is inconsistent with the candidate word, the masking language model replaces the pronunciation error prone word with the candidate word.

According to the method, the mask language model is randomly selected to replace in the training process, if the mask language model is directly applied to text data for error correction and poor in error correction effect, the mask language model is replaced by the pronunciation similar words in the training data according to the target proportion in the incremental training process of the mask language model in the model training stage, and the mask language model can be specifically trained to predict the mask words into the pronunciation similar words of the mask words, so that the high-frequency error-prone content in the text data can be conveniently and accurately found out in the using process of the mask language model, error correction of the text data is realized, the guarantee is provided for accurately understanding the semantics of a user, and the accuracy of the identified emotion of the user is improved.

Considering that the pinyin of each word in the text data may also reflect the semantics of the user, for example, the pinyin is ba (first sound), which is the meaning of pulling, the pinyin is pa (second sound), which is the meaning of stealing property on another person, the pinyin of each word in the text data may also be used as a text feature, and fig. 8 is a flowchart of the text feature extraction process according to the seventh embodiment of the present application, as shown in fig. 8, where the flowchart includes the following steps:

step S801, determining pronunciation of each word in the text data to obtain pronunciation characteristics, wherein the pronunciation comprises pinyin, and further, considering that the tone, the initial consonant or the final of the user can also reflect the emotion of the user, the pronunciation can also comprise the tone, the initial consonant or the final;

step S802, representing pronunciation characteristics by using an Embedding.

Compared with the prior art that when the emotion of the user is recognized from the text angle in the step S801 to the step S802, the problem that the emotion recognition result is inaccurate is caused by not considering pronunciation information reflecting the semantics or emotion of the user such as pinyin, tone, initials, finals and the like, and the pronunciation characteristics are obtained by determining the pinyin, tone, initials or finals of each word in text data in the text data, more data bases can be provided for determining the emotion labels of the user for the subsequent emotion recognition model, so that the accuracy of emotion recognition of the user is further improved, and meanwhile, the pronunciation characteristics of the text data are represented by using the Embedding, so that the format of the pronunciation characteristics accords with the format requirement of the emotion recognition model on input data.

The present embodiment also provides a system for identifying user emotion, fig. 9 is a block diagram of a system for identifying user emotion according to an eighth embodiment of the present application, as shown in fig. 9, including:

the first extraction module 901 is configured to obtain voice data, and extract voice features according to the voice data, where the voice features include mel-frequency cepstrum coefficient features;

the second extraction module 902 is configured to perform conversion processing on the voice data to obtain text data, and extract text features according to the text data, where the text features include word features and position features;

a determining module 903, configured to input the speech feature and the text feature into a user emotion recognition model, and output an emotion tag of the user, where in the user emotion recognition model: the voice features are represented by using a convolutional neural network to obtain first features, the first features are represented by using a long-short-time memory network to obtain second features, the text features are represented by using the convolutional neural network to obtain third features, the third features are represented by using the long-short-time memory network to obtain fourth features, the second features and the fourth features are fully connected, and the emotion labels of the users are determined.

In some of these embodiments, the speech features further include pitch features, intonation features, and pace features, and the first extraction module 901 is further configured to: determining the peak value, frequency and period of the waveform according to the waveform of the voice data; according to the peak value, the frequency and the period, determining the pitch characteristic, the intonation characteristic and the speech speed characteristic of the voice data in a one-to-one correspondence manner; pitch features, intonation features, and speech rate features are represented by Embedding.

In some of these embodiments, the text features further include pronunciation features, and the second extraction module 902 is further configured to: determining pronunciation of each word in the text data to obtain pronunciation characteristics, wherein the pronunciation comprises pinyin, tone, initials or finals; pronunciation characteristics are represented by Embedding.

In some of these embodiments, fig. 10 is a block diagram of a system for user emotion recognition according to a ninth embodiment of the present application, as shown in fig. 10, the system further comprising:

the error correction module 1001 is configured to correct errors of text data before extracting text features, where the error correction process includes: inputting text data into a mask language model, determining and hiding pronunciation error-prone words in the text data by the mask language model according to an error confusion set, wherein the error confusion set contains association relations between the pronunciation error-prone words and pronunciation similar words, and replacing the pronunciation similar words in training data with the pronunciation error-prone words according to a target proportion by the mask language model in the process of incremental training in a model training stage according to the error confusion set; the masking language model determines a candidate word having the highest probability at the hidden location, and in the event that the pronunciation error prone word does not coincide with the candidate word, the masking language model replaces the pronunciation error prone word with the candidate word.

In one embodiment, fig. 11 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the present application, as shown in fig. 11, and an electronic device, which may be a server, may be provided, and an internal structure diagram thereof may be as shown in fig. 11. The electronic device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the electronic device is for storing data. The network interface of the electronic device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of user emotion recognition.

It will be appreciated by those skilled in the art that the structure shown in fig. 11 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the electronic device to which the present application is applied, and that a particular electronic device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It should be understood by those skilled in the art that the technical features of the above-described embodiments may be combined in any manner, and for brevity, all of the possible combinations of the technical features of the above-described embodiments are not described, however, they should be considered as being within the scope of the description provided herein, as long as there is no contradiction between the combinations of the technical features.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A method of user emotion recognition, the method comprising:

obtaining voice data, and extracting voice features according to the voice data, wherein the voice features comprise mel-frequency cepstrum coefficient features, and extracting the mel-frequency cepstrum coefficient features comprises: preprocessing the voice data; pre-emphasis processing is carried out on the voice data after the pretreatment; carrying out framing treatment on the pre-emphasized voice data; windowing is carried out on the voice data after framing; performing fast Fourier transform processing on the windowed voice data to obtain a spectrogram; after applying a Mel filter to the spectrogram, taking the logarithm to obtain a Mel spectrogram; carrying out cepstrum analysis on the Mel spectrogram to obtain Mel cepstrum coefficient characteristics; determining a first-order difference and a second-order difference of the characteristic of the mel-frequency spectrum coefficient to obtain a change track of the mel-frequency spectrum coefficient along with time;

2. The method of claim 1, wherein the speech features further comprise pitch features, intonation features, and speech speed features, and wherein the extracting of the speech features comprises:

3. The method of claim 1, wherein the text feature further comprises a pronunciation feature, and wherein the text feature extraction process comprises:

the pronunciation characteristics are represented by Embedding.

4. A method according to claim 3, wherein prior to said extracting the text feature, the method comprises: correcting the text data, wherein the correcting process comprises the following steps:

the mask language model determines a candidate word with the highest probability at a hidden position, and replaces the pronunciation-prone word with the candidate word when the pronunciation-prone word is inconsistent with the candidate word.

5. A system for emotion recognition of a user, the system comprising:

the first extraction module is used for acquiring voice data and extracting voice features according to the voice data, wherein the voice features comprise mel-frequency cepstrum coefficient features, and the extracting of the mel-frequency cepstrum coefficient features comprises: preprocessing the voice data; pre-emphasis processing is carried out on the voice data after the pretreatment; carrying out framing treatment on the pre-emphasized voice data; windowing is carried out on the voice data after framing; performing fast Fourier transform processing on the windowed voice data to obtain a spectrogram; after applying a Mel filter to the spectrogram, taking the logarithm to obtain a Mel spectrogram; carrying out cepstrum analysis on the Mel spectrogram to obtain Mel cepstrum coefficient characteristics; determining a first-order difference and a second-order difference of the characteristic of the mel-frequency spectrum coefficient to obtain a change track of the mel-frequency spectrum coefficient along with time;

6. The system of claim 5, wherein the speech features further comprise pitch features, intonation features, and pace features, the first extraction module further configured to:

7. The system of claim 5, wherein the text features further comprise pronunciation features, the second extraction module further configured to:

the pronunciation characteristics are represented by Embedding.

8. The system of claim 7, wherein the system further comprises:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of user emotion recognition according to any of claims 1 to 4 when executing the computer program.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements a method of user emotion recognition as claimed in any of claims 1 to 4.