CN112489688A - Neural network-based emotion recognition method, device and medium - Google Patents

Neural network-based emotion recognition method, device and medium Download PDF

Info

Publication number
CN112489688A
CN112489688A CN202011239769.0A CN202011239769A CN112489688A CN 112489688 A CN112489688 A CN 112489688A CN 202011239769 A CN202011239769 A CN 202011239769A CN 112489688 A CN112489688 A CN 112489688A
Authority
CN
China
Prior art keywords
voice
text
recognized
recognition
emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011239769.0A
Other languages
Chinese (zh)
Inventor
周文铠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur General Software Co Ltd
Original Assignee
Inspur General Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur General Software Co Ltd filed Critical Inspur General Software Co Ltd
Priority to CN202011239769.0A priority Critical patent/CN112489688A/en
Publication of CN112489688A publication Critical patent/CN112489688A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/09Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being zero crossing rates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/15Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being formant information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Child & Adolescent Psychology (AREA)
  • Psychiatry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Hospice & Palliative Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a neural network-based emotion recognition method, equipment and a medium, wherein the method comprises the following steps: determining a voice to be recognized corresponding to a user; performing emotion recognition on the voice to be recognized through a pre-trained voice recognition model to obtain a voice recognition result; converting the voice to be recognized into a text, and performing emotion recognition on the text to obtain a text recognition result; and fusing the voice recognition result and the text recognition result to obtain a final result corresponding to the voice to be recognized. When the emotion is recognized through the voice of the user, the emotion of the user is judged through double modes instead of only adopting the voice or the text, the recognition effect is far better than that of a single mode, and the effectiveness of double-mode fusion emotion recognition is guaranteed. The bimodal information is fused with the single modal information, and the voice change information and the semantic information in the voice are more widely contained, so that cross judgment can be more favorably realized in model training and decision judgment, and the optimal emotion recognition result is obtained.

Description

Neural network-based emotion recognition method, device and medium
Technical Field
The application relates to the field of emotion recognition, in particular to an emotion recognition method, equipment and medium based on a neural network.
Background
With the development of multimedia technology, an important data source is provided for emotion calculation in the big data environment nowadays.
Generally, emotion calculation mainly aims at collected different data, such as image data, voice data and text data, to perform corresponding recognition processing. The emotion recognition of the voice data mainly utilizes the acoustic characteristics and prosody characteristics of voice to model the voice signal. However, in the conventional speech emotion recognition, only the sound signal of the speech is analyzed, and the expression of rich content information contained in the speech is ignored, so that the emotion expression result cannot be well described. This also makes the existing emotion recognition results for speech inaccurate.
Disclosure of Invention
In order to solve the above problem, the present application provides an emotion recognition method based on a neural network, including: determining a voice to be recognized corresponding to a user; performing emotion recognition on the voice to be recognized through a pre-trained voice recognition model to obtain a voice recognition result; converting the voice to be recognized into a text, and carrying out emotion recognition on the text to obtain a text recognition result; and fusing the voice recognition result and the text recognition result to obtain a final result corresponding to the voice to be recognized.
In one example, performing emotion recognition on the speech to be recognized through a pre-trained voice recognition model to obtain a voice recognition result, including: carrying out noise reduction pretreatment on the voice to be recognized; extracting the spectral feature and the prosody feature of the speech to be recognized; coupling the spectral features and the prosodic features to obtain the sound features of the speech to be recognized; and performing emotion recognition on the voice features through a pre-trained voice recognition model to obtain a voice recognition result.
In one example, the noise reduction preprocessing is performed on the speech to be recognized, and comprises the following steps: carrying out normalization processing on the voice to be recognized; carrying out frame-by-frame detection on the voice to be recognized, and calculating the zero crossing rate and the short-time energy of each frame of voice; and dividing the voice to be recognized into a plurality of voice sections through endpoint detection so as to perform noise reduction preprocessing on the voice to be recognized.
In one example, the voice to be recognized is divided into a plurality of voice segments through endpoint detection, including: if the zero crossing rate of the corresponding frame is higher than a preset zero crossing rate threshold value and the short-time energy is higher than a preset short-time energy threshold value, taking the corresponding frame as an initial frame; if the zero crossing rate of a plurality of continuous voice frames is not higher than the zero crossing rate threshold value and the short-time energy is not higher than a preset short-time energy threshold value after the initial frame, taking the last frame of the plurality of continuous voice frames as an end frame; and taking the part between the starting frame and the ending frame as a speech segment.
In one example, the spectral features include: mel-frequency cepstrum coefficients MFCC; the prosodic features include: at least one of speech rate, amplitude characteristics, gene period and formants.
In one example, performing emotion recognition on the text to obtain a text recognition result, including: performing word segmentation on the text to obtain a plurality of words; extracting text features of the vocabularies, and performing emotion recognition on the text features through a pre-trained text recognition model to obtain a first text recognition result; and performing emotion recognition on the plurality of words through a preset emotion dictionary to obtain a second text recognition result.
In one example, extracting textual features of the number of words includes: and extracting text features of the vocabularies based on at least one of document frequency DF, mutual information MI and CHI-square statistics CHI.
In one example, the emotion recognition is performed on the vocabularies through a preset emotion dictionary to obtain a second text recognition result, and the method includes: and performing emotion recognition on the plurality of words through a preset emotion dictionary and weights corresponding to different preset emotions to obtain a second text recognition result.
On the other hand, the application also provides emotion recognition equipment based on the neural network, which comprises: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of any one of the examples above.
In another aspect, the present application further provides a non-volatile computer storage medium for neural network based emotion recognition, storing computer-executable instructions configured to: a method as in any preceding example.
The emotion recognition method based on the neural network can bring the following beneficial effects:
when the emotion is recognized through the voice of the user, the voice or the text is fused, the emotion of the user is judged through the dual modes, the recognition effect is far better than that of a single mode, and the effectiveness of the dual-mode fusion emotion recognition is guaranteed. The bimodal information is fused with the comparative single modal information, and the voice change information and the semantic information in the voice are more widely contained, so that cross judgment can be more favorably realized in model training and decision judgment, and the optimal emotion recognition result can be obtained.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a schematic flow chart of an emotion recognition method based on a neural network in an embodiment of the present application;
FIG. 2 is a flow chart of an emotion recognition method based on a neural network in the embodiment of the present application;
FIG. 3 is a block diagram illustrating a flow chart corresponding to a voice recognition result in an embodiment of the present application;
FIG. 4 is a block diagram of a flow corresponding to a text recognition result in an embodiment of the present application;
FIG. 5 is a flowchart of sound feature extraction according to an embodiment of the present application;
FIG. 6 is a schematic diagram illustrating a training and recognition process of a voice recognition model according to an embodiment of the present application;
FIG. 7 is a graph showing the effect of the experiment in the example of the present application;
fig. 8 is a schematic diagram of an emotion recognition apparatus based on a neural network in an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.
It should be noted that there is no uniform recording standard and labeling format for the existing emotion data, and a DMO-DB emotion database, a chinese academy speech emotion database (CASIA), etc. are more commonly used. The CASIA Chinese emotion corpus is recorded by automation of Chinese academy of sciences, contains 9600 pieces of voice, is divided into four different speaker expressions, and contains six basic emotion categories of anger, happiness, surprise, fear, sadness and calmness, and the six emotions can be used as the emotion classification in the embodiment of the application.
As shown in fig. 1 and fig. 2, an embodiment of the present application provides an emotion recognition method based on a neural network, including:
s101, determining the voice to be recognized corresponding to the user.
In order to recognize the emotion of the user by voice, the voice of the user needs to be acquired first, and the voice may be referred to as a voice to be recognized herein. The speech to be recognized may be one or more segments, and is obtained through corresponding software and equipment, and here, how to obtain the speech to be recognized is limited.
S102, emotion recognition is carried out on the voice to be recognized through a pre-trained voice recognition model, and a voice recognition result is obtained.
As shown in fig. 3, in order to effectively recognize the emotion of the speech to be recognized, a corresponding voice recognition model may be trained in advance, and then the speech to be recognized may be subjected to emotion recognition by the voice recognition model, so as to obtain a voice-related emotion recognition result (referred to as a voice recognition result herein). When the voice recognition model is trained, the voice data can be preprocessed and feature extraction work is carried out, the obtained single-mode features of the voice data are used for model training, a shallow learning model and a deep learning model are used for classification learning, and the optimal recognition result is obtained and used as the voice recognition model.
Specifically, in the preprocessing and noise reduction process after training, in order to maintain the usefulness of the speech data and find out the speech with the recognized text result, a threshold endpoint detection algorithm may be used to mark the start and end points of all speech segments in the speech. The implementation process can be as follows: and presetting a zero crossing rate threshold and a short-time energy threshold. And inputting voice data, carrying out normalization processing, then carrying out frame-by-frame detection, and calculating the zero crossing rate and the short-time energy of each frame. If the zero-crossing rate and the short-time energy of a certain frame exceed the corresponding threshold values, namely the zero-crossing rate is higher than the zero-crossing rate threshold value and the short-time energy is higher than the short-time energy threshold value, marking the frame as the starting point of the speech segment. After the starting frame, if the zero crossing rate and the short-time energy of a plurality of continuous frames do not exceed the corresponding threshold value, marking the ending frame positions of the plurality of continuous frames as the voice segment end points, thereby generating a voice segment. The remaining voice data is continuously scanned in the same way, so that the voice to be recognized can be divided into a plurality of voice segments. The equal noise of silence and high-frequency noise can be effectively removed through end point detection, and the complete voice section is divided.
As shown in fig. 5, in the feature extraction process, a high-pass filter and a hamming window may be used to process the speech to be recognized, then a Mel filter bank and discrete cosine transform are used to obtain a static Mel parameter, a first order Mel parameter and a second order Mel parameter, and finally an MFCC parameter is obtained, and the MFCC parameter is used as the spectral feature of the speech to be recognized. And simultaneously extracting rhythmical characteristics of the voice to be recognized, wherein the rhythmical characteristics can comprise: speech rate, amplitude characteristics, gene period, formants, etc. The prosodic features and the overall speech features are then coupled, and the coupled features are referred to as sound features. Besides the speech rate features, the dimensions of other feature parameters are more, and then the statistical parameter features of the feature parameters are extracted. Specific parameters of the multi-dimensional speech features can be shown in the following table, and the MFCC parameters and the statistic parameters of partial prosody features are mainly selected.
Figure BDA0002767980290000051
Figure BDA0002767980290000061
Of course, when training the voice recognition model, as shown in fig. 6, a training set and a test set may be divided from a corpus, and then preprocessed in the noise reduction mode by the above embodiment, and then the neural network is trained by the training set, and parameters in the neural network are adjusted by comparing the classifier with the training expression. And comparing the trained model with the test label by using the test set so as to obtain a recognition result and test the accuracy of the voice recognition model. In the model building process, the LSTM network may be used for final model building. The preprocessing and denoising process and the feature extraction process in the training process are basically similar to the process of processing the speech to be processed by using the voice recognition model in the above embodiments, and are not described herein again.
S103, converting the voice to be recognized into a text, and performing emotion recognition on the text to obtain a text recognition result.
As shown in fig. 4, besides recognizing the sound of the speech to be recognized, it is also possible to perform emotion recognition on the text corresponding to the speech to be recognized. Namely, firstly, voice to be recognized is converted into a text through voice recognition, and then emotion recognition is carried out on the text to obtain a corresponding text recognition result.
Specifically, the text may be first subjected to word segmentation to obtain a plurality of words. Then, for each vocabulary, the corresponding text features are extracted. And then performing emotion recognition on the text features by using the trained text recognition model to obtain a corresponding text recognition result (referred to as a first text recognition result).
In the process of extracting text features, feature selection can be performed on words in the text by using document frequency feature (DF) extraction, mutual information feature (MI) extraction and CHI-square statistics (CHI) extraction. The effect of feature word selection affects the characterization capability of the vector of the text, so that different groups of experiments can be performed to select the optimal combination. However, the statistical result usually includes some stop words and low-frequency words which are not commonly used, so that the influence of the stop words can be removed based on the rule stop word list. Meanwhile, for comparison experiments, 3000-dimensional feature vectors can be constructed for all feature selection methods, and the specific details are shown in the following table.
Figure BDA0002767980290000062
Figure BDA0002767980290000071
On the other hand, after a plurality of words are obtained, emotion recognition is performed by using an emotion dictionary in addition to text features, and a recognition result recognized by the emotion dictionary may be referred to as a second text recognition result.
The currently disclosed emotion dictionaries include a HowNet dictionary, a Taiwan university (NTUSD) emotion dictionary, a large-link-worker emotion vocabulary text library and the like, can be introduced as a basic dictionary and improved on the basis of the basic dictionary, can also be used for creating a dictionary by self, and the specific content of the emotion dictionary is not limited herein. If the dictionary is used, the low-frequency words are considered to be few or even not present in the training corpus, and the low-frequency words are not common words. At the moment, TF-IDF is used for carrying out word frequency statistics on the experimental corpus, and partial low-frequency emotion vocabularies are abandoned through statistical weighting processing.
Text emotion classification based on an emotion dictionary is the simplest simulation of human memory and judgment thinking. When learning through an emotional dictionary, some basic words can be firstly memorized through learning, for example, negative words have ' don't ', words representing joy have ' like ' and ' love ', words representing anger have ' hate ' and the like, so that a basic corpus is formed. Then, the sentence input by training is split most directly, whether a corresponding word exists in the dictionary is checked, and then the emotion is judged according to the category of the word. The text emotion classification rule based on the emotion dictionary is mechanized, so different weights can be added to different emotions, the onehot form is used for representation, and the emotion value is assumed to meet the linear superposition principle. And then, segmenting the sentence, if the word vector after the sentence is segmented contains corresponding words, adding a forward weight, and finally judging the emotion of the sentence according to the positive and negative of the total weight.
It should be noted that there is no strict sequence between step S102 and step S103, and the emotion recognition related to the sound in step S102 may be performed first, or the emotion recognition related to the text in step S103 may be performed first, which is not limited herein.
In addition, before training the model and extracting the features, a knowledge base needs to be prepared in advance for training the model. The method comprises the steps of selecting a voice data source for emotion recognition, determining the capacity of a database, obtaining text contents expressed in voice by using a voice recognition interface to form a text library, carrying out emotion marking processing on the voice library and the text library so as to be used as training data during model training, formulating a marking scheme and a marking criterion, developing an emotion marking tool, simplifying a marking process, and correspondingly storing marking data according to a voice text.
And S104, fusing the voice recognition result and the text recognition result to obtain a final result corresponding to the voice to be recognized.
When the voice recognition result and the text recognition result are obtained, the voice recognition result and the text recognition result can be fused to obtain a final result corresponding to the voice to be recognized, and the final result can represent the emotional condition of the user expressed by the voice to be recognized. The fusion method may assign different weights according to different situations, for example, assigning weights according to emotional conditions corresponding to the voice recognition result and the text recognition result. The fusion method may also be to select the result with the most votes as the final result by voting.
In one example, corresponding experiments were performed according to the methods described in the examples of the present application. In the experimental process, the following table shows the recognition accuracy and the misrecognition error rate of each emotion under the condition of only considering the voice. 5 experiments were performed for each emotion, and the arithmetic mean was taken, wherein the recognition rate of anger was the highest, reaching 79.58%. The recognition rate of surprise is the lowest, only 45.90%, wherein the probability of the misrecognition being happy is high, because the surprise and the happy are similar in the emotional expression of the voice and are easy to be confused. And the average recognition rate for six emotions was 64.50%.
anger fear happy neutral sad surprise Average
anger 79.58% 4.75% 8.98% 1.93% 2.47% 2.59%
fear 7.35% 68.67% 4.20% 6.36% 7.40% 6.02%
happy 16.09% 5.17% 52.57% 2.61% 6.08% 17.48%
neutral 7.59% 10.92% 9.73% 64.61% 2.41% 4.74%
sad 1.15% 6.85% 8.59% 4.27% 75.65% 3.49%
surprise 9.82% 7.21% 22.94% 9.18% 4.93% 45.90%
64.50%
The following table shows the recognition accuracy and the misrecognition rate of each emotion in the case of considering only the text. 5 experiments were performed for each emotion, and the arithmetic mean was taken, wherein the recognition rate of anger was the highest, reaching 93.2%. The recognition rate for surprise was the lowest, only 81.3%, while the average recognition rate for six emotions was 87.50%.
Figure BDA0002767980290000091
When emotion recognition is performed through the dual modes, namely the voice mode and the text mode, recognition results of the two different modes are fused, and the final result of the emotion recognition can be shown in the following table.
Emotion classification Rate of accuracy
angry 0.92
fear 0.94
happy 0.90
sad 0.89
surprise 0.90
Ave 0.91
It can be obviously observed from the above table and fig. 7 that the difference of different emotion recognition results is weakened in the recognition result of bimodal information fusion, so that the recognition effect of each emotion is stable and can be obviously obtained, the emotion recognition effect of bimodal information fusion is far better than that of single modality, and the validity of bimodal information fusion emotion recognition is also verified. The bimodal information is fused with the comparative single modal information, and the voice change information and the semantic information in the voice are more widely contained, so that cross judgment can be more favorably realized in model training and decision judgment, and the optimal emotion recognition result can be obtained.
As shown in fig. 8, an embodiment of the present application further provides an emotion recognition apparatus based on a neural network, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to cause the at least one processor to perform a method according to any one of the embodiments described above.
The embodiment of the present application further provides a non-volatile computer storage medium for emotion recognition based on a neural network, in which computer-executable instructions are stored, and the computer-executable instructions are set to: a method as in any preceding embodiment.
The embodiments in the present application are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the device and media embodiments, the description is relatively simple as it is substantially similar to the method embodiments, and reference may be made to some descriptions of the method embodiments for relevant points.
The device and the medium provided by the embodiment of the application correspond to the method one to one, so the device and the medium also have the similar beneficial technical effects as the corresponding method, and the beneficial technical effects of the method are explained in detail above, so the beneficial technical effects of the device and the medium are not repeated herein.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A neural network-based emotion recognition method is characterized by comprising the following steps:
determining a voice to be recognized corresponding to a user;
performing emotion recognition on the voice to be recognized through a pre-trained voice recognition model to obtain a voice recognition result;
converting the voice to be recognized into a text, and carrying out emotion recognition on the text to obtain a text recognition result;
and fusing the voice recognition result and the text recognition result to obtain a final result corresponding to the voice to be recognized.
2. The method of claim 1, wherein performing emotion recognition on the speech to be recognized through a pre-trained voice recognition model to obtain a voice recognition result, the method comprising:
carrying out noise reduction pretreatment on the voice to be recognized;
extracting the spectral feature and the prosody feature of the speech to be recognized;
coupling the spectral features and the prosodic features to obtain the sound features of the speech to be recognized;
and performing emotion recognition on the voice features through a pre-trained voice recognition model to obtain a voice recognition result.
3. The method of claim 2, wherein performing noise reduction preprocessing on the speech to be recognized comprises:
carrying out normalization processing on the voice to be recognized;
carrying out frame-by-frame detection on the voice to be recognized, and calculating the zero crossing rate and the short-time energy of each frame of voice;
and dividing the voice to be recognized into a plurality of voice sections through endpoint detection so as to perform noise reduction preprocessing on the voice to be recognized.
4. The method according to claim 3, wherein the dividing of the speech to be recognized into speech segments by endpoint detection comprises:
if the zero crossing rate of the corresponding frame is higher than a preset zero crossing rate threshold value and the short-time energy is higher than a preset short-time energy threshold value, taking the corresponding frame as an initial frame;
if the zero crossing rate of a plurality of continuous voice frames is not higher than the zero crossing rate threshold value and the short-time energy is not higher than a preset short-time energy threshold value after the initial frame, taking the last frame of the plurality of continuous voice frames as an end frame;
and taking the part between the starting frame and the ending frame as a speech segment.
5. The method of claim 2, wherein the spectral features comprise: mel-frequency cepstrum coefficients MFCC; the prosodic features include: at least one of speech rate, amplitude characteristics, gene period and formants.
6. The method of claim 1, wherein performing emotion recognition on the text to obtain a text recognition result comprises:
performing word segmentation on the text to obtain a plurality of words;
extracting text features of the vocabularies, and performing emotion recognition on the text features through a pre-trained text recognition model to obtain a first text recognition result;
and performing emotion recognition on the plurality of words through a preset emotion dictionary to obtain a second text recognition result.
7. The method of claim 6, wherein extracting text features of the plurality of words comprises:
and extracting text features of the vocabularies based on at least one of document frequency DF, mutual information MI and CHI-square statistics CHI.
8. The method of claim 6, wherein performing emotion recognition on the vocabularies through a preset emotion dictionary to obtain a second text recognition result, comprising:
and performing emotion recognition on the plurality of words through a preset emotion dictionary and weights corresponding to different preset emotions to obtain a second text recognition result.
9. An emotion recognition apparatus based on a neural network, comprising:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.
10. A non-transitory computer storage medium for neural network based emotion recognition, storing computer-executable instructions, the computer-executable instructions configured to: the method of any one of claims 1-8.
CN202011239769.0A 2020-11-09 2020-11-09 Neural network-based emotion recognition method, device and medium Pending CN112489688A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011239769.0A CN112489688A (en) 2020-11-09 2020-11-09 Neural network-based emotion recognition method, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011239769.0A CN112489688A (en) 2020-11-09 2020-11-09 Neural network-based emotion recognition method, device and medium

Publications (1)

Publication Number Publication Date
CN112489688A true CN112489688A (en) 2021-03-12

Family

ID=74929235

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011239769.0A Pending CN112489688A (en) 2020-11-09 2020-11-09 Neural network-based emotion recognition method, device and medium

Country Status (1)

Country Link
CN (1) CN112489688A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113506550A (en) * 2021-07-29 2021-10-15 北京花兰德科技咨询服务有限公司 Artificial intelligent reading display and display method
CN113689885A (en) * 2021-04-09 2021-11-23 电子科技大学 Intelligent auxiliary guide system based on voice signal processing
CN113852524A (en) * 2021-07-16 2021-12-28 天翼智慧家庭科技有限公司 Intelligent household equipment control system and method based on emotional characteristic fusion
CN114065742A (en) * 2021-11-19 2022-02-18 马上消费金融股份有限公司 Text detection method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737629A (en) * 2011-11-11 2012-10-17 东南大学 Embedded type speech emotion recognition method and device
CN104331506A (en) * 2014-11-20 2015-02-04 北京理工大学 Multiclass emotion analyzing method and system facing bilingual microblog text
CN108108433A (en) * 2017-12-19 2018-06-01 杭州电子科技大学 A kind of rule-based and the data network integration sentiment analysis method
US20190295533A1 (en) * 2018-01-26 2019-09-26 Shanghai Xiaoi Robot Technology Co., Ltd. Intelligent interactive method and apparatus, computer device and computer readable storage medium
CN110473571A (en) * 2019-07-26 2019-11-19 北京影谱科技股份有限公司 Emotion identification method and device based on short video speech
CN110675859A (en) * 2019-09-05 2020-01-10 华南理工大学 Multi-emotion recognition method, system, medium, and apparatus combining speech and text
CN111898670A (en) * 2020-07-24 2020-11-06 深圳市声希科技有限公司 Multi-mode emotion recognition method, device, equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737629A (en) * 2011-11-11 2012-10-17 东南大学 Embedded type speech emotion recognition method and device
CN104331506A (en) * 2014-11-20 2015-02-04 北京理工大学 Multiclass emotion analyzing method and system facing bilingual microblog text
CN108108433A (en) * 2017-12-19 2018-06-01 杭州电子科技大学 A kind of rule-based and the data network integration sentiment analysis method
US20190295533A1 (en) * 2018-01-26 2019-09-26 Shanghai Xiaoi Robot Technology Co., Ltd. Intelligent interactive method and apparatus, computer device and computer readable storage medium
CN110473571A (en) * 2019-07-26 2019-11-19 北京影谱科技股份有限公司 Emotion identification method and device based on short video speech
CN110675859A (en) * 2019-09-05 2020-01-10 华南理工大学 Multi-emotion recognition method, system, medium, and apparatus combining speech and text
CN111898670A (en) * 2020-07-24 2020-11-06 深圳市声希科技有限公司 Multi-mode emotion recognition method, device, equipment and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113689885A (en) * 2021-04-09 2021-11-23 电子科技大学 Intelligent auxiliary guide system based on voice signal processing
CN113852524A (en) * 2021-07-16 2021-12-28 天翼智慧家庭科技有限公司 Intelligent household equipment control system and method based on emotional characteristic fusion
CN113506550A (en) * 2021-07-29 2021-10-15 北京花兰德科技咨询服务有限公司 Artificial intelligent reading display and display method
CN113506550B (en) * 2021-07-29 2022-07-05 北京花兰德科技咨询服务有限公司 Artificial intelligent reading display and display method
CN114065742A (en) * 2021-11-19 2022-02-18 马上消费金融股份有限公司 Text detection method and device
CN114065742B (en) * 2021-11-19 2023-08-25 马上消费金融股份有限公司 Text detection method and device

Similar Documents

Publication Publication Date Title
CN111933129B (en) Audio processing method, language model training method and device and computer equipment
CN107315737B (en) Semantic logic processing method and system
CN112489688A (en) Neural network-based emotion recognition method, device and medium
US8494853B1 (en) Methods and systems for providing speech recognition systems based on speech recordings logs
CN110414004B (en) Method and system for extracting core information
CN105654940B (en) Speech synthesis method and device
US20240153509A1 (en) Speaker separation based on real-time latent speaker state characterization
CN111785275A (en) Voice recognition method and device
CN110322895B (en) Voice evaluation method and computer storage medium
CN112233680A (en) Speaker role identification method and device, electronic equipment and storage medium
JP2018072697A (en) Phoneme collapse detection model learning apparatus, phoneme collapse section detection apparatus, phoneme collapse detection model learning method, phoneme collapse section detection method, program
CN112259083A (en) Audio processing method and device
CN112015872A (en) Question recognition method and device
CN115312030A (en) Display control method and device of virtual role and electronic equipment
CN110738061A (en) Ancient poetry generation method, device and equipment and storage medium
CN112908315B (en) Question and answer intention judging method based on sound characteristics and voice recognition
CN112885335A (en) Speech recognition method and related device
Elbarougy Speech emotion recognition based on voiced emotion unit
KR20130126570A (en) Apparatus for discriminative training acoustic model considering error of phonemes in keyword and computer recordable medium storing the method thereof
Al-Talabani et al. Kurdish dialects and neighbor languages automatic recognition
Jiang et al. Comparing feature dimension reduction algorithms for GMM-SVM based speech emotion recognition
CN116052655A (en) Audio processing method, device, electronic equipment and readable storage medium
CN112489646B (en) Speech recognition method and device thereof
CN114999463A (en) Voice recognition method, device, equipment and medium
CN111782779B (en) Voice question-answering method, system, mobile terminal and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210312

RJ01 Rejection of invention patent application after publication