CN112489688A

CN112489688A - Neural network-based emotion recognition method, device and medium

Info

Publication number: CN112489688A
Application number: CN202011239769.0A
Authority: CN
Inventors: 周文铠
Original assignee: Inspur General Software Co Ltd
Current assignee: Inspur General Software Co Ltd
Priority date: 2020-11-09
Filing date: 2020-11-09
Publication date: 2021-03-12

Abstract

The application discloses a neural network-based emotion recognition method, equipment and a medium, wherein the method comprises the following steps: determining a voice to be recognized corresponding to a user; performing emotion recognition on the voice to be recognized through a pre-trained voice recognition model to obtain a voice recognition result; converting the voice to be recognized into a text, and performing emotion recognition on the text to obtain a text recognition result; and fusing the voice recognition result and the text recognition result to obtain a final result corresponding to the voice to be recognized. When the emotion is recognized through the voice of the user, the emotion of the user is judged through double modes instead of only adopting the voice or the text, the recognition effect is far better than that of a single mode, and the effectiveness of double-mode fusion emotion recognition is guaranteed. The bimodal information is fused with the single modal information, and the voice change information and the semantic information in the voice are more widely contained, so that cross judgment can be more favorably realized in model training and decision judgment, and the optimal emotion recognition result is obtained.

Description

Neural network-based emotion recognition method, device and medium

Technical Field

The application relates to the field of emotion recognition, in particular to an emotion recognition method, equipment and medium based on a neural network.

Background

With the development of multimedia technology, an important data source is provided for emotion calculation in the big data environment nowadays.

Generally, emotion calculation mainly aims at collected different data, such as image data, voice data and text data, to perform corresponding recognition processing. The emotion recognition of the voice data mainly utilizes the acoustic characteristics and prosody characteristics of voice to model the voice signal. However, in the conventional speech emotion recognition, only the sound signal of the speech is analyzed, and the expression of rich content information contained in the speech is ignored, so that the emotion expression result cannot be well described. This also makes the existing emotion recognition results for speech inaccurate.

Disclosure of Invention

In order to solve the above problem, the present application provides an emotion recognition method based on a neural network, including: determining a voice to be recognized corresponding to a user; performing emotion recognition on the voice to be recognized through a pre-trained voice recognition model to obtain a voice recognition result; converting the voice to be recognized into a text, and carrying out emotion recognition on the text to obtain a text recognition result; and fusing the voice recognition result and the text recognition result to obtain a final result corresponding to the voice to be recognized.

In one example, performing emotion recognition on the speech to be recognized through a pre-trained voice recognition model to obtain a voice recognition result, including: carrying out noise reduction pretreatment on the voice to be recognized; extracting the spectral feature and the prosody feature of the speech to be recognized; coupling the spectral features and the prosodic features to obtain the sound features of the speech to be recognized; and performing emotion recognition on the voice features through a pre-trained voice recognition model to obtain a voice recognition result.

In one example, the noise reduction preprocessing is performed on the speech to be recognized, and comprises the following steps: carrying out normalization processing on the voice to be recognized; carrying out frame-by-frame detection on the voice to be recognized, and calculating the zero crossing rate and the short-time energy of each frame of voice; and dividing the voice to be recognized into a plurality of voice sections through endpoint detection so as to perform noise reduction preprocessing on the voice to be recognized.

In one example, the voice to be recognized is divided into a plurality of voice segments through endpoint detection, including: if the zero crossing rate of the corresponding frame is higher than a preset zero crossing rate threshold value and the short-time energy is higher than a preset short-time energy threshold value, taking the corresponding frame as an initial frame; if the zero crossing rate of a plurality of continuous voice frames is not higher than the zero crossing rate threshold value and the short-time energy is not higher than a preset short-time energy threshold value after the initial frame, taking the last frame of the plurality of continuous voice frames as an end frame; and taking the part between the starting frame and the ending frame as a speech segment.

In one example, the spectral features include: mel-frequency cepstrum coefficients MFCC; the prosodic features include: at least one of speech rate, amplitude characteristics, gene period and formants.

In one example, performing emotion recognition on the text to obtain a text recognition result, including: performing word segmentation on the text to obtain a plurality of words; extracting text features of the vocabularies, and performing emotion recognition on the text features through a pre-trained text recognition model to obtain a first text recognition result; and performing emotion recognition on the plurality of words through a preset emotion dictionary to obtain a second text recognition result.

In one example, extracting textual features of the number of words includes: and extracting text features of the vocabularies based on at least one of document frequency DF, mutual information MI and CHI-square statistics CHI.

In one example, the emotion recognition is performed on the vocabularies through a preset emotion dictionary to obtain a second text recognition result, and the method includes: and performing emotion recognition on the plurality of words through a preset emotion dictionary and weights corresponding to different preset emotions to obtain a second text recognition result.

On the other hand, the application also provides emotion recognition equipment based on the neural network, which comprises: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of any one of the examples above.

In another aspect, the present application further provides a non-volatile computer storage medium for neural network based emotion recognition, storing computer-executable instructions configured to: a method as in any preceding example.

The emotion recognition method based on the neural network can bring the following beneficial effects:

when the emotion is recognized through the voice of the user, the voice or the text is fused, the emotion of the user is judged through the dual modes, the recognition effect is far better than that of a single mode, and the effectiveness of the dual-mode fusion emotion recognition is guaranteed. The bimodal information is fused with the comparative single modal information, and the voice change information and the semantic information in the voice are more widely contained, so that cross judgment can be more favorably realized in model training and decision judgment, and the optimal emotion recognition result can be obtained.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a schematic flow chart of an emotion recognition method based on a neural network in an embodiment of the present application;

FIG. 2 is a flow chart of an emotion recognition method based on a neural network in the embodiment of the present application;

FIG. 3 is a block diagram illustrating a flow chart corresponding to a voice recognition result in an embodiment of the present application;

FIG. 4 is a block diagram of a flow corresponding to a text recognition result in an embodiment of the present application;

FIG. 5 is a flowchart of sound feature extraction according to an embodiment of the present application;

FIG. 6 is a schematic diagram illustrating a training and recognition process of a voice recognition model according to an embodiment of the present application;

FIG. 7 is a graph showing the effect of the experiment in the example of the present application;

fig. 8 is a schematic diagram of an emotion recognition apparatus based on a neural network in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

It should be noted that there is no uniform recording standard and labeling format for the existing emotion data, and a DMO-DB emotion database, a chinese academy speech emotion database (CASIA), etc. are more commonly used. The CASIA Chinese emotion corpus is recorded by automation of Chinese academy of sciences, contains 9600 pieces of voice, is divided into four different speaker expressions, and contains six basic emotion categories of anger, happiness, surprise, fear, sadness and calmness, and the six emotions can be used as the emotion classification in the embodiment of the application.

As shown in fig. 1 and fig. 2, an embodiment of the present application provides an emotion recognition method based on a neural network, including:

s101, determining the voice to be recognized corresponding to the user.

In order to recognize the emotion of the user by voice, the voice of the user needs to be acquired first, and the voice may be referred to as a voice to be recognized herein. The speech to be recognized may be one or more segments, and is obtained through corresponding software and equipment, and here, how to obtain the speech to be recognized is limited.

S102, emotion recognition is carried out on the voice to be recognized through a pre-trained voice recognition model, and a voice recognition result is obtained.

As shown in fig. 3, in order to effectively recognize the emotion of the speech to be recognized, a corresponding voice recognition model may be trained in advance, and then the speech to be recognized may be subjected to emotion recognition by the voice recognition model, so as to obtain a voice-related emotion recognition result (referred to as a voice recognition result herein). When the voice recognition model is trained, the voice data can be preprocessed and feature extraction work is carried out, the obtained single-mode features of the voice data are used for model training, a shallow learning model and a deep learning model are used for classification learning, and the optimal recognition result is obtained and used as the voice recognition model.

Specifically, in the preprocessing and noise reduction process after training, in order to maintain the usefulness of the speech data and find out the speech with the recognized text result, a threshold endpoint detection algorithm may be used to mark the start and end points of all speech segments in the speech. The implementation process can be as follows: and presetting a zero crossing rate threshold and a short-time energy threshold. And inputting voice data, carrying out normalization processing, then carrying out frame-by-frame detection, and calculating the zero crossing rate and the short-time energy of each frame. If the zero-crossing rate and the short-time energy of a certain frame exceed the corresponding threshold values, namely the zero-crossing rate is higher than the zero-crossing rate threshold value and the short-time energy is higher than the short-time energy threshold value, marking the frame as the starting point of the speech segment. After the starting frame, if the zero crossing rate and the short-time energy of a plurality of continuous frames do not exceed the corresponding threshold value, marking the ending frame positions of the plurality of continuous frames as the voice segment end points, thereby generating a voice segment. The remaining voice data is continuously scanned in the same way, so that the voice to be recognized can be divided into a plurality of voice segments. The equal noise of silence and high-frequency noise can be effectively removed through end point detection, and the complete voice section is divided.

As shown in fig. 5, in the feature extraction process, a high-pass filter and a hamming window may be used to process the speech to be recognized, then a Mel filter bank and discrete cosine transform are used to obtain a static Mel parameter, a first order Mel parameter and a second order Mel parameter, and finally an MFCC parameter is obtained, and the MFCC parameter is used as the spectral feature of the speech to be recognized. And simultaneously extracting rhythmical characteristics of the voice to be recognized, wherein the rhythmical characteristics can comprise: speech rate, amplitude characteristics, gene period, formants, etc. The prosodic features and the overall speech features are then coupled, and the coupled features are referred to as sound features. Besides the speech rate features, the dimensions of other feature parameters are more, and then the statistical parameter features of the feature parameters are extracted. Specific parameters of the multi-dimensional speech features can be shown in the following table, and the MFCC parameters and the statistic parameters of partial prosody features are mainly selected.

Of course, when training the voice recognition model, as shown in fig. 6, a training set and a test set may be divided from a corpus, and then preprocessed in the noise reduction mode by the above embodiment, and then the neural network is trained by the training set, and parameters in the neural network are adjusted by comparing the classifier with the training expression. And comparing the trained model with the test label by using the test set so as to obtain a recognition result and test the accuracy of the voice recognition model. In the model building process, the LSTM network may be used for final model building. The preprocessing and denoising process and the feature extraction process in the training process are basically similar to the process of processing the speech to be processed by using the voice recognition model in the above embodiments, and are not described herein again.

S103, converting the voice to be recognized into a text, and performing emotion recognition on the text to obtain a text recognition result.

As shown in fig. 4, besides recognizing the sound of the speech to be recognized, it is also possible to perform emotion recognition on the text corresponding to the speech to be recognized. Namely, firstly, voice to be recognized is converted into a text through voice recognition, and then emotion recognition is carried out on the text to obtain a corresponding text recognition result.

Specifically, the text may be first subjected to word segmentation to obtain a plurality of words. Then, for each vocabulary, the corresponding text features are extracted. And then performing emotion recognition on the text features by using the trained text recognition model to obtain a corresponding text recognition result (referred to as a first text recognition result).

In the process of extracting text features, feature selection can be performed on words in the text by using document frequency feature (DF) extraction, mutual information feature (MI) extraction and CHI-square statistics (CHI) extraction. The effect of feature word selection affects the characterization capability of the vector of the text, so that different groups of experiments can be performed to select the optimal combination. However, the statistical result usually includes some stop words and low-frequency words which are not commonly used, so that the influence of the stop words can be removed based on the rule stop word list. Meanwhile, for comparison experiments, 3000-dimensional feature vectors can be constructed for all feature selection methods, and the specific details are shown in the following table.

On the other hand, after a plurality of words are obtained, emotion recognition is performed by using an emotion dictionary in addition to text features, and a recognition result recognized by the emotion dictionary may be referred to as a second text recognition result.

The currently disclosed emotion dictionaries include a HowNet dictionary, a Taiwan university (NTUSD) emotion dictionary, a large-link-worker emotion vocabulary text library and the like, can be introduced as a basic dictionary and improved on the basis of the basic dictionary, can also be used for creating a dictionary by self, and the specific content of the emotion dictionary is not limited herein. If the dictionary is used, the low-frequency words are considered to be few or even not present in the training corpus, and the low-frequency words are not common words. At the moment, TF-IDF is used for carrying out word frequency statistics on the experimental corpus, and partial low-frequency emotion vocabularies are abandoned through statistical weighting processing.

Text emotion classification based on an emotion dictionary is the simplest simulation of human memory and judgment thinking. When learning through an emotional dictionary, some basic words can be firstly memorized through learning, for example, negative words have ' don't ', words representing joy have ' like ' and ' love ', words representing anger have ' hate ' and the like, so that a basic corpus is formed. Then, the sentence input by training is split most directly, whether a corresponding word exists in the dictionary is checked, and then the emotion is judged according to the category of the word. The text emotion classification rule based on the emotion dictionary is mechanized, so different weights can be added to different emotions, the onehot form is used for representation, and the emotion value is assumed to meet the linear superposition principle. And then, segmenting the sentence, if the word vector after the sentence is segmented contains corresponding words, adding a forward weight, and finally judging the emotion of the sentence according to the positive and negative of the total weight.

It should be noted that there is no strict sequence between step S102 and step S103, and the emotion recognition related to the sound in step S102 may be performed first, or the emotion recognition related to the text in step S103 may be performed first, which is not limited herein.

In addition, before training the model and extracting the features, a knowledge base needs to be prepared in advance for training the model. The method comprises the steps of selecting a voice data source for emotion recognition, determining the capacity of a database, obtaining text contents expressed in voice by using a voice recognition interface to form a text library, carrying out emotion marking processing on the voice library and the text library so as to be used as training data during model training, formulating a marking scheme and a marking criterion, developing an emotion marking tool, simplifying a marking process, and correspondingly storing marking data according to a voice text.

And S104, fusing the voice recognition result and the text recognition result to obtain a final result corresponding to the voice to be recognized.

When the voice recognition result and the text recognition result are obtained, the voice recognition result and the text recognition result can be fused to obtain a final result corresponding to the voice to be recognized, and the final result can represent the emotional condition of the user expressed by the voice to be recognized. The fusion method may assign different weights according to different situations, for example, assigning weights according to emotional conditions corresponding to the voice recognition result and the text recognition result. The fusion method may also be to select the result with the most votes as the final result by voting.

In one example, corresponding experiments were performed according to the methods described in the examples of the present application. In the experimental process, the following table shows the recognition accuracy and the misrecognition error rate of each emotion under the condition of only considering the voice. 5 experiments were performed for each emotion, and the arithmetic mean was taken, wherein the recognition rate of anger was the highest, reaching 79.58%. The recognition rate of surprise is the lowest, only 45.90%, wherein the probability of the misrecognition being happy is high, because the surprise and the happy are similar in the emotional expression of the voice and are easy to be confused. And the average recognition rate for six emotions was 64.50%.

anger

fear

happy

neutral

sad

surprise

Average

anger

79.58％

4.75％

8.98％

1.93％

2.47％

2.59％

fear

7.35％

68.67％

4.20％

6.36％

7.40％

6.02％

happy

16.09％

5.17％

52.57％

2.61％

6.08％

17.48％

neutral

7.59％

10.92％

9.73％

64.61％

2.41％

4.74％

sad

1.15％

6.85％

8.59％

4.27％

75.65％

3.49％

surprise

9.82％

7.21％

22.94％

9.18％

4.93％

45.90％

64.50％

The following table shows the recognition accuracy and the misrecognition rate of each emotion in the case of considering only the text. 5 experiments were performed for each emotion, and the arithmetic mean was taken, wherein the recognition rate of anger was the highest, reaching 93.2%. The recognition rate for surprise was the lowest, only 81.3%, while the average recognition rate for six emotions was 87.50%.

When emotion recognition is performed through the dual modes, namely the voice mode and the text mode, recognition results of the two different modes are fused, and the final result of the emotion recognition can be shown in the following table.

Emotion classification	Rate of accuracy
		angry	0.92
fear	0.94
		happy	0.90
sad	0.89
		surprise	0.90
Ave	0.91

It can be obviously observed from the above table and fig. 7 that the difference of different emotion recognition results is weakened in the recognition result of bimodal information fusion, so that the recognition effect of each emotion is stable and can be obviously obtained, the emotion recognition effect of bimodal information fusion is far better than that of single modality, and the validity of bimodal information fusion emotion recognition is also verified. The bimodal information is fused with the comparative single modal information, and the voice change information and the semantic information in the voice are more widely contained, so that cross judgment can be more favorably realized in model training and decision judgment, and the optimal emotion recognition result can be obtained.

As shown in fig. 8, an embodiment of the present application further provides an emotion recognition apparatus based on a neural network, including:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to cause the at least one processor to perform a method according to any one of the embodiments described above.

The embodiment of the present application further provides a non-volatile computer storage medium for emotion recognition based on a neural network, in which computer-executable instructions are stored, and the computer-executable instructions are set to: a method as in any preceding embodiment.

The embodiments in the present application are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the device and media embodiments, the description is relatively simple as it is substantially similar to the method embodiments, and reference may be made to some descriptions of the method embodiments for relevant points.

The device and the medium provided by the embodiment of the application correspond to the method one to one, so the device and the medium also have the similar beneficial technical effects as the corresponding method, and the beneficial technical effects of the method are explained in detail above, so the beneficial technical effects of the device and the medium are not repeated herein.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A neural network-based emotion recognition method is characterized by comprising the following steps:

determining a voice to be recognized corresponding to a user;

performing emotion recognition on the voice to be recognized through a pre-trained voice recognition model to obtain a voice recognition result;

converting the voice to be recognized into a text, and carrying out emotion recognition on the text to obtain a text recognition result;

and fusing the voice recognition result and the text recognition result to obtain a final result corresponding to the voice to be recognized.

2. The method of claim 1, wherein performing emotion recognition on the speech to be recognized through a pre-trained voice recognition model to obtain a voice recognition result, the method comprising:

carrying out noise reduction pretreatment on the voice to be recognized;

extracting the spectral feature and the prosody feature of the speech to be recognized;

coupling the spectral features and the prosodic features to obtain the sound features of the speech to be recognized;

and performing emotion recognition on the voice features through a pre-trained voice recognition model to obtain a voice recognition result.

3. The method of claim 2, wherein performing noise reduction preprocessing on the speech to be recognized comprises:

carrying out normalization processing on the voice to be recognized;

carrying out frame-by-frame detection on the voice to be recognized, and calculating the zero crossing rate and the short-time energy of each frame of voice;

and dividing the voice to be recognized into a plurality of voice sections through endpoint detection so as to perform noise reduction preprocessing on the voice to be recognized.

4. The method according to claim 3, wherein the dividing of the speech to be recognized into speech segments by endpoint detection comprises:

if the zero crossing rate of the corresponding frame is higher than a preset zero crossing rate threshold value and the short-time energy is higher than a preset short-time energy threshold value, taking the corresponding frame as an initial frame;

if the zero crossing rate of a plurality of continuous voice frames is not higher than the zero crossing rate threshold value and the short-time energy is not higher than a preset short-time energy threshold value after the initial frame, taking the last frame of the plurality of continuous voice frames as an end frame;

and taking the part between the starting frame and the ending frame as a speech segment.

5. The method of claim 2, wherein the spectral features comprise: mel-frequency cepstrum coefficients MFCC; the prosodic features include: at least one of speech rate, amplitude characteristics, gene period and formants.

6. The method of claim 1, wherein performing emotion recognition on the text to obtain a text recognition result comprises:

performing word segmentation on the text to obtain a plurality of words;

extracting text features of the vocabularies, and performing emotion recognition on the text features through a pre-trained text recognition model to obtain a first text recognition result;

and performing emotion recognition on the plurality of words through a preset emotion dictionary to obtain a second text recognition result.

7. The method of claim 6, wherein extracting text features of the plurality of words comprises:

and extracting text features of the vocabularies based on at least one of document frequency DF, mutual information MI and CHI-square statistics CHI.

8. The method of claim 6, wherein performing emotion recognition on the vocabularies through a preset emotion dictionary to obtain a second text recognition result, comprising:

and performing emotion recognition on the plurality of words through a preset emotion dictionary and weights corresponding to different preset emotions to obtain a second text recognition result.

9. An emotion recognition apparatus based on a neural network, comprising:

at least one processor; and the number of the first and second groups,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

10. A non-transitory computer storage medium for neural network based emotion recognition, storing computer-executable instructions, the computer-executable instructions configured to: the method of any one of claims 1-8.