CN111081280A

CN111081280A - Text-independent speech emotion recognition method and device and emotion recognition algorithm model generation method

Info

Publication number: CN111081280A
Application number: CN201911394741.1A
Authority: CN
Inventors: 张艳; 黄厚军; 钱彦旻
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2020-04-28
Anticipated expiration: 2039-12-30
Also published as: CN111081280B

Abstract

The invention discloses a speech emotion recognition method irrelevant to a text, which comprises the steps of acquiring real-time audio data, recognizing the real-time audio data through a trained algorithm model for speech emotion recognition, and determining the emotion state of a speaker, wherein the end point detection is carried out on the acquired real-time audio data by using a phoneme alignment processing mode; extracting characteristic values of the real-time audio data after the endpoint detection; and inputting the extracted characteristic value into an algorithm model, and determining the emotional state of the speaker according to the emotional state prediction probability output by the algorithm model. The invention also discloses a speech emotion recognition device irrelevant to the text and a model training method. The scheme disclosed by the invention can effectively ensure the accuracy of identification.

Description

Text-independent speech emotion recognition method and device and emotion recognition algorithm model generation method

Technical Field

The invention relates to the technical field of intelligent voice, in particular to a voice emotion recognition method and device irrelevant to a text and a generation method of an algorithm model for recognizing emotion.

Background

In recent years, with the development of intelligent voice technology, emotion-based voice recognition is becoming a new direction and hot spot for research. At present, most emotion recognition in the market is text-related, for example, Huashi, Korea news and Baidu emotion recognition are text-based, and a speech emotion recognition scheme irrelevant to the text is still in a research stage, and an effective scheme capable of ensuring a high recognition rate is not provided.

Disclosure of Invention

In order to overcome the defects of the existing scheme, the inventor makes a great deal of attempts and researches on algorithm selection and model training, and finally provides a speech emotion recognition solution which can efficiently recognize the emotion of the user and is irrelevant to the text.

According to one aspect of the invention, a method for generating an algorithm model for speech emotion recognition is provided, which comprises

Recording emotion recognition voice data for preprocessing, and determining a training data set;

training the selected neural network model by using the training data set to determine an algorithm model for speech emotion recognition;

wherein, recording emotion recognition voice data for preprocessing, and determining the training data set to realize comprises:

extracting a characteristic value of the recorded emotion recognition voice data;

and setting emotion labels for the extracted characteristic values to form a training data set.

According to a second aspect of the present invention, there is provided a method for text-independent speech emotion recognition, comprising

Acquiring real-time audio data, identifying the real-time audio data through a trained algorithm model for speech emotion identification, and determining the emotional state of a speaker, wherein the implementation comprises

Carrying out endpoint detection on the acquired real-time audio data by using a phoneme alignment processing mode;

extracting characteristic values of the real-time audio data after the endpoint detection;

inputting the extracted characteristic value into the algorithm model, and determining the emotional state of the speaker according to the emotional state prediction probability output by the algorithm model;

the network model is an algorithm model for speech emotion recognition, which is generated by training through the method.

According to a third aspect of the present invention, there is provided a text-independent speech emotion recognition apparatus, comprising

An algorithm model for speech emotion recognition, wherein the algorithm model is generated by training through the method; and

the identification module is used for identifying the real-time audio data through the algorithm model when the real-time audio data are obtained, and determining the emotional state of a speaker, and the realization of the identification module comprises

The mute processing unit is used for carrying out end point detection on the acquired real-time audio data by using a phoneme alignment processing mode;

the feature extraction unit is used for extracting feature values of the real-time audio data subjected to the endpoint detection processing;

and the emotion determining unit is used for inputting the extracted characteristic values into the algorithm model and determining the emotional state of the speaker according to the emotional state prediction probability output by the algorithm model.

According to a fourth aspect of the present invention, there is provided an electronic apparatus comprising: the computer-readable medium includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the steps of the above-described method.

According to a fifth aspect of the present invention, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above-described method.

According to the scheme of the embodiment of the invention, the neural network model is selected for training, and the recorded audio data of the special voice emotion database is used as the original audio in the training process, so that the richness of the linguistic data is ensured, the universality of the model is ensured in the model training process, and the accuracy of emotion recognition is ensured. In addition, in practical application, the recognition method of the embodiment of the invention carries out endpoint detection through phoneme alignment, effectively removes silence, ensures that the trained and tested audios have voice information, and further ensures the accuracy of emotion recognition.

Drawings

FIG. 1 is a flowchart of a method for text-independent speech emotion recognition according to an embodiment of the present invention;

FIG. 2 is a method diagram of a model training process in the embodiment shown in FIG. 1;

FIG. 3 is a block diagram of a speech emotion recognition apparatus independent of text according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

As used in this disclosure, "module," "device," "system," and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, a component can be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, can be a component. One or more components can reside within a process and/or thread of execution and a component can be localized on one computer and/or distributed between two or more computers and can be run by various computer-readable media. The components may also communicate by way of local and/or remote processes in accordance with a signal having one or more data packets, e.g., signals from data interacting with another component in a local system, distributed system, and/or across a network of the internet with other systems by way of the signal.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The scheme of speech emotion recognition irrelevant to the text, which is related to the embodiment of the invention, can be applied to any intelligent equipment with an intelligent speech interaction function, so that speech products borne by the intelligent equipment have the emotion recognition function, such as a mobile phone, a watch, an earphone, a Personal Computer (PC) and the like, but the application range of the invention is not limited to the scheme. By the scheme provided by the embodiment of the invention, the emotional state of the speaker can be accurately and efficiently determined, the recognition rate and the accuracy are greatly improved, and the user experience during interaction is ensured.

The present invention will be described in further detail with reference to the accompanying drawings.

Fig. 1 and fig. 2 schematically show a method flow of a text-independent speech emotion recognition method according to an embodiment of the present invention, wherein fig. 1 is a flow of the text-independent speech emotion recognition method, and fig. 2 is a flow of a method for training an algorithm model for speech emotion recognition. As shown in the figure, the method of the present embodiment includes the following steps:

step S101: and constructing a neural network model for speech emotion recognition, training the constructed neural network model, and determining an algorithm model for speech emotion recognition. The embodiment of the invention selects the neural network model to carry out the speech emotion recognition which is irrelevant to the text. Fig. 2 shows a method process for training the selected neural network model, as shown in fig. 2, which includes:

step S201: and recording emotion recognition voice data for preprocessing, and determining a training data set. The embodiment of the invention adopts the recorded emotion recognition voice data as the original audio. Illustratively, a special emotion recognition voice library is recorded, and 400 speakers (200 men and 200 women) are used, so that 700 sentences of audio are recorded and stored for each speaker with four emotions of angry (angry), happy (happy), sad (sad) and normal (neutral). Wherein, the devices used for recording can also be set to be different. Therefore, the richness of the linguistic data can be ensured, and the universality of the model can be ensured when the model is trained. The recorded emotion recognition speech data is then preprocessed to determine a training data set. Specifically, the preprocessing may be implemented to include only the feature value extraction and the setting of the emotion label. As a preferred implementation, the preprocessing of the recorded emotion recognition voice data according to the embodiment of the present invention includes: data enhancement processing, end point detection processing, characteristic value extraction and emotion label setting for the extracted characteristic values. The data enhancement processing can be realized by multi-scene simulation and near-field far-field separate sound pickup, for example, near-field and far-field sound pickup is performed in multiple scenes such as home, vehicle, market, roadside, and office. The end point detection processing can be realized by performing end point detection on the audio data after the data enhancement processing based on the phoneme alignment mode, that is, performing end point detection processing on the enhanced audio data through an alignment technology, wherein the alignment is an end point detection method based on text alignment, and the alignment method has a better alignment effect compared with other end point detection technologies such as vad technology, and is beneficial to improving the identification accuracy. The feature value extraction may be implemented as: and respectively extracting fbank features and pitch features of the audio data after the end point detection, performing feature fusion processing on the extracted fbank features and pitch features, extracting first-order and second-order differences from the fused features, and taking the extracted feature values as input parts of training data. Finally, corresponding emotion labels (exemplarily, four emotion labels as described above) are set for the extracted feature values, thereby forming a training data set. In the preferred embodiment, the fbank feature extracted is a 40dim (40-dimensional) fbank feature, and the pitch feature extracted is a 3dim (three-dimensional) fbank feature. And obtaining 43-dimensional fbank + pitch characteristics through the fusion of the characteristics of the two, and extracting first-order second-order difference to obtain 129-dimensional characteristic values serving as input parts of training data.

The reality of the trained algorithm model can be ensured through multi-scene voice data enhancement processing. The pitch feature is related to the fundamental frequency (F0) of the sound, and reflects pitch information, i.e. tone, while emotion and pitch have a certain relation, so that the performance of the model can be improved well by using the 43-dimensional fbank + pitch feature and the first-order second-order difference of the 43-dimensional fbank + pitch feature, which are 129-dimensional.

The end point detection processing, the feature value extraction, the feature value fusion and the extraction of the first-order and second-order differences based on the phoneme alignment can be realized by adopting the prior art.

Step S202: and training the selected neural network model by using the training data set.

Specifically, the complete process of training the neural network model at one time generally includes three steps of forward propagation, backward propagation and weight updating. During training, training data are obtained from the training data set, characteristic values in the training data are used as input of the neural network model, emotion labels corresponding to the training data are used as output matching targets of the neural network model, and therefore the weight coefficients of the model are trained, and the trained neural network model is obtained. In the embodiment of the invention, 40dim fbank features and 3dim pitch features are respectively extracted from the enhanced audio, fused features of the fbank and pitch features are fused, and finally, a first-order and second-order difference is extracted from the fused features, and the extracted feature values are used as input values and are sent to a neural network model for training, namely, a training data set formed in step S201 is a set of data pairs of the feature values and emotion labels.

Illustratively, the selected neural network model is a tdnn (time delay neural network) model with a structure of 7 × (conv + relu6+ batchnorm) and a loss of softmax (). The delay neural network tdnn model is equivalent to delay of a filter, and has the following advantages: 1. the network is multi-layer, and each layer has stronger abstract capability to the characteristics; 2. ability to express temporal relationships of speech features; 3. has time invariance; 4. precise time positioning of the learned markers is not required in the learning process; 5. by sharing the weight, the learning is convenient. Therefore, the identification rate can be further improved by selecting the time delay neural network model.

The process of training the tdnn model includes: the method comprises the steps of randomly acquiring batch training data from a training data set, using a characteristic value in the training data as input of a tdnn model, using an emotion label corresponding to each training data as an output matching target of the tdnn model, and repeatedly training the tdnn model through forward propagation, backward propagation and weight updating until the weight of the tdnn model converges to a preset range, such as a range approaching 1 (0.999-1). In a specific practice process, the training can be carried out for multiple times to finally determine a tdnn model weight coefficient which is more suitable for the actual situation, and in order to reduce the memory in the training process, a mini-batch mode is used, and small batches of training data are randomly taken out each time for training. Firstly, in the forward propagation process, feature values are input into a trunk part of a tdnn model (7 CNN layers form a trunk), embeddings are extracted and are respectively input into a classification layer, the prediction probability of each label category is calculated, and the loss function calculates the loss by using the prediction probability and a target emotion label. Then, in the back propagation process, the gradient values of the loss function for the tdnn model weights are calculated using a back propagation algorithm. And finally, updating the network weight by using the gradient value of the tdnn model weight by adopting a random gradient descent method. After the weights of the tdnn model are trained, the latest tdnn model weight and the network structure can be used as an algorithm model to identify the emotional state of the speaker.

Step S203: and performing fixed-point processing on the trained neural network model to determine an algorithm model for speech emotion recognition. In order to reduce the size of the model, reduce the memory to a certain extent and improve the speed of gender recognition, the embodiment of the invention trains the neural network model, then performs fixed-point processing on the model, and takes the algorithm model after the fixed-point processing as the model for speech emotion state recognition. Preferably, the spotting processing is 8-bit spotting processing. The 8-bit fixed-point processing may be implemented by using the prior art, for example, by using Tencent's Tenson, Tensorflow, and Nvidia's TensorRT, which is not limited in this embodiment of the present invention.

Step S102: acquiring real-time audio data, and identifying the real-time audio data through a trained algorithm model for speech emotion identification to determine the emotional state of a speaker. In a specific application, the algorithm model obtained by training in step S101 may be used to perform speech emotion recognition independent of text. The implementation can be as follows: carrying out endpoint detection on the acquired real-time audio data by using a phoneme alignment processing mode; extracting characteristic values of the real-time audio data after the endpoint detection; and inputting the extracted characteristic value into the trained algorithm model, and determining the emotional state of the speaker according to the emotional state prediction probability output by the algorithm model. In the embodiment of the present invention, a feature value of 129 dimensions is used as an input feature, where 43 dimensions fbank + pitch feature and its first-order second-order difference are used.

According to the embodiment, the time delay neural network model is selected for training, the recorded audio data of the special voice emotion database is used as original audio in the training process, end point detection is carried out through phoneme alignment, silence is effectively removed, the fact that the audio for training and testing has voice information is guaranteed, and therefore accuracy of emotion recognition is guaranteed. In addition, the scheme of the embodiment of the invention uses fbank + pitch and the first-order and second-order difference thereof as the characteristic value, thereby further improving the accuracy of emotion recognition. More preferably, the embodiment uses a scheme combining batchnorm and 8bit fixed-point, so that the size of the trained model is reduced, the memory is reduced to a certain extent, the speed rate of emotion recognition is improved, and emotion states including anger, happy feeling, sadness, normality and the like can be recognized quickly.

In other embodiments, the neural network model selected may be a tdnn model with a structure of 7 × and loss of 5 frames extending around softmax (). Compared with the above embodiment, the tdnn network structure uses a model trained by 7 × (conv + relu ()), and the difference lies in that 8-bit fixed-point processing is not needed after training, so that the recognition rate is 94% which is 0.6% higher than that of the method provided by the above embodiment, but the recognition speed is 4 times slower, therefore, in order to ensure the optimal recognition efficiency (i.e. ensure the recognition rate and the recognition rate at the same time), the above embodiment adopts a network structure added with a batch norm, so as to conveniently perform 8-bit fixed-point processing, and improve the recognition rate.

In the actual application stage, the feature value extracted from the acquired real-time audio is kept consistent with the feature value used in the training stage.

In other specific embodiments, the tdnn network structure may be activated by relu (), where relu () activation is an activation function used by most neural networks, and the gender recognition rate of relu () activation is 93.4%, which is 0.2 point higher than relu6(), but is not beneficial to fix the point.

In another embodiment, the feature usage may also select the fbank feature, or a fusion of fbank and pitch features. However, the experimental results show that neither fbank nor fbank + pitch has good identification effect on fbank + pitch and the first-order and second-order difference thereof.

FIG. 3 is a schematic diagram of a text-independent speech emotion recognition apparatus according to an embodiment of the present invention, which includes

An algorithm model 30 for speech emotion recognition, wherein the algorithm model is generated by training through the method; and

the identification module 31 is configured to, when the real-time audio data is acquired, perform identification processing on the real-time audio data through an algorithm model, and determine an emotional state of a speaker.

Wherein, as shown in FIG. 3, the identification module 31 is implemented to include

A mute processing unit 31A configured to perform end point detection on the acquired real-time audio data by using a phoneme alignment processing manner;

a feature extraction unit 31B, configured to perform feature value extraction on the real-time audio data after the endpoint detection processing;

and an emotion determining unit 31C for inputting the extracted feature value into the algorithm model, and determining the emotional state of the speaker based on the emotional state prediction probability output from the algorithm model.

Preferably, the feature value extracted by the feature extraction unit 31B is a 129-dimensional feature value including 43-dimensional fbank + pitch features and the first-order second-order difference thereof.

The specific implementation of each module and unit in the device embodiment may be described with reference to the method portion, and other implementations mentioned in the method portion may also be applied to the device embodiment, so that details are not described here.

By the method and the device, the emotional state of the speaker can be accurately and quickly identified, and the speaker can be recorded by hundreds of different men and women by adopting different devices, so that the corpus is rich, the universality of the model can be well ensured, and the higher identification rate can be ensured in various scenes. And the characteristic value of 129 dimensions is obtained by using the 43-dimensional fbank + pitch characteristic and the first-order second-order difference, so that the performance of the model can be well improved.

In some embodiments, the present invention provides a non-transitory computer readable storage medium, in which one or more programs including executable instructions are stored, and the executable instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform the above-mentioned text-independent speech emotion recognition method of the present invention.

In some embodiments, the present invention further provides a computer program product comprising a computer program stored on a non-volatile computer-readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform the above-mentioned text-independent speech emotion recognition method.

In some embodiments, an embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the text-independent speech emotion recognition method.

In some embodiments, the present invention further provides a storage medium, on which a computer program is stored, which when executed by a processor is capable of executing the above-mentioned text-independent speech emotion recognition method.

The text-independent speech emotion recognition device according to the embodiment of the present invention can be used to execute the text-independent speech emotion recognition method according to the embodiment of the present invention, and accordingly achieves the technical effects achieved by the text-independent speech emotion recognition method according to the embodiment of the present invention, which are not described herein again. In the embodiment of the present invention, the relevant functional module may be implemented by a hardware processor (hardware processor).

Fig. 4 is a schematic hardware structure diagram of an electronic device for performing a text-independent speech emotion recognition method according to another embodiment of the present application, and as shown in fig. 4, the electronic device includes:

one or more processors 510 and memory 520, with one processor 510 being an example in fig. 4.

The apparatus for performing the text-independent speech emotion recognition method may further include: an input device 530 and an output device 540.

The processor 510, the memory 520, the input device 530, and the output device 540 may be connected by a bus or other means, such as by a bus connection in fig. 4.

Memory 520, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the text-independent speech emotion recognition method in the embodiments of the present application. The processor 510 executes various functional applications and data processing of the server by executing the nonvolatile software programs, instructions and modules stored in the memory 520, namely, implements the text-independent speech emotion recognition method in the above method embodiments.

The memory 520 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the text-independent speech emotion recognition apparatus, and the like. Further, the memory 520 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 520 optionally includes memory located remotely from processor 510, which may be connected to a text-independent speech emotion recognition device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 530 may receive input numeric or character information and generate signals related to user settings and function control of the text-independent speech emotion recognition method. The output device 540 may include a display device such as a display screen.

The one or more modules are stored in the memory 520 and when executed by the one or more processors 510 perform the text-independent speech emotion recognition method of any of the above method embodiments.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.

(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.

(5) And other electronic devices with data interaction functions.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or otherwise contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. The method for generating the algorithm model for speech emotion recognition is characterized by comprising

2. The method of claim 1, wherein extracting feature values from the recorded emotion-recognized speech data comprises:

respectively extracting fbank characteristics and pitch characteristics of the voice data;

performing feature fusion processing on the extracted fbank features and pitch features;

and extracting first-order and second-order differences of the fused features.

3. The method according to claim 1 or 2, wherein the selected neural network model is a tdnn model with a structure of 7 x (conv + relu ()), and a loss of 5 frames extending around softmax ().

4. The method according to claim 1 or 2, wherein the selected neural network model is a tdnn model with a structure of 7 × (conv + relu6+ batchnorm) and a loss of softmax (), and after the training of the neural network model, the trained neural network model is further subjected to a fixed-point processing to generate an algorithm model for speech emotion recognition.

5. The method of claim 4, wherein the emotion recognition speech data includes speech data for four emotions, angry, happy, sad and normal.

6. The method for recognizing the speech emotion irrelevant with the text is characterized by comprising

wherein the network model is an algorithm model for speech emotion recognition generated by training with the method of any one of claims 1 to 5.

7. The method of claim 6, wherein the extracted feature values are 129 dimensional feature values comprising 43 dimensional fbank + pitch features and first order second order differences thereof.

8. The device for speech emotion recognition independent of text is characterized by comprising

An algorithm model for speech emotion recognition, wherein the algorithm model is generated by training the method of any one of claims 1 to 5; and the identification module is used for identifying the real-time audio data through the algorithm model when the real-time audio data are acquired, and determining the emotional state of the speaker, wherein the identification module is realized by comprising

9. The apparatus of claim 8, wherein the extracted feature value is a 129-dimensional feature value comprising a 43-dimensional fbank + pitch feature and its first-order second-order difference.

10. Storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of claim 6 or 7.