CN108091323B - Method and apparatus for emotion recognition from speech - Google Patents

Method and apparatus for emotion recognition from speech Download PDF

Info

Publication number
CN108091323B
CN108091323B CN201711378503.2A CN201711378503A CN108091323B CN 108091323 B CN108091323 B CN 108091323B CN 201711378503 A CN201711378503 A CN 201711378503A CN 108091323 B CN108091323 B CN 108091323B
Authority
CN
China
Prior art keywords
feature matrix
machine learning
audio signal
length
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711378503.2A
Other languages
Chinese (zh)
Other versions
CN108091323A (en
Inventor
C·C·多斯曼
B·N·利亚纳盖
T·J·M·厄斯特勒姆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiangxiang Technology Beijing Co ltd
Original Assignee
Xiangxiang Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiangxiang Technology Beijing Co ltd filed Critical Xiangxiang Technology Beijing Co ltd
Priority to CN201711378503.2A priority Critical patent/CN108091323B/en
Publication of CN108091323A publication Critical patent/CN108091323A/en
Application granted granted Critical
Publication of CN108091323B publication Critical patent/CN108091323B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Abstract

The application relates to a method and a device for recognizing emotion from voice. A method for recognizing emotion from speech according to an embodiment of the present application may include: the method includes receiving an audio signal, data cleaning the received audio signal, segmenting the cleaned audio signal into at least one segment, performing feature extraction on the at least one segment to extract a number of mel-frequency cepstral coefficients and a number of bark-frequency cepstral coefficients from the at least one segment, performing feature filling to fill the number of mel-frequency cepstral coefficients and the number of bark-frequency cepstral coefficients into a feature matrix based on a length threshold, and performing machine learning inference on the feature matrix to identify an emotion indicated in the audio signal. The embodiment of the application can be suitable for audio signals with almost any size, and can identify the emotion of the whole voice in real time.

Description

Method and apparatus for emotion recognition from speech
Technical Field
The present application relates to emotion recognition technology, and more particularly, to a method and apparatus for recognizing emotion from speech.
Background
Voice communication between humans is very complex and delicate, and it not only delivers information in the form of words, but also delivers the current mental state of people. Emotion recognition or understanding of the mental state of a speaker is important and advantageous for many applications including games, human-machine interactive interfaces, virtual agents, and the like. Psychologists have studied the field of emotion recognition for many years and have developed many theories. On the other hand, machine learning researchers have also explored this area and have gained a consensus that emotional states are encoded in speech.
Most existing speech systems can effectively process studio-recorded, neural speech, but perform poorly in emotion-like speech processing. The current state-of-the-art emotion detectors have only about 40-50% accuracy in identifying four to five different types of emotions in the main emotion. Thus, the problem of emotion-like speech processing is also the limited functionality of speech recognition methods and systems, which can be attributed to the difficulties in modeling and characterizing the emotion present in speech.
In conclusion, improvements in speech recognition remain important and urgent to effectively and accurately recognize the emotional state of a speaker.
Disclosure of Invention
One of the objects of the present application is to provide a method and apparatus for recognizing emotion from speech.
According to an embodiment of the present application, a method for recognizing emotion from speech may include: the method includes receiving an audio signal, data cleaning the received audio signal, segmenting the cleaned audio signal into at least one segment, performing feature extraction on the at least one segment to extract a number of mel-frequency cepstral coefficients and a number of bark-frequency cepstral coefficients from the at least one segment, performing feature filling to fill the number of mel-frequency cepstral coefficients and the number of bark-frequency cepstral coefficients into a feature matrix based on a length threshold, and performing machine learning inference on the feature matrix to identify an emotion indicated in the audio signal.
In an embodiment of the application, performing data cleansing on the received audio signal further comprises at least one of: removing noise in the audio signal, removing silence of the audio signal at the beginning and end based on a silence threshold, and removing sound fragments of the audio signal shorter than a predefined threshold. The silence threshold may be-50 db and the predefined threshold may be 1/4 seconds. In another embodiment of the present application, performing data cleaning on the received audio signal may further include performing band-pass filtering on the received audio signal to control the frequency of the audio signal to be 100 and 400 kHz.
According to an embodiment of the present application, performing feature extraction on the at least one segment further may include extracting at least one of gender, loudness, normalized spectral envelope, power spectral analysis, perceptual half-width, emotional square, and tonal coefficient of a speaker from the audio signal. The size of the window for extracting mel-frequency cepstral coefficients and barker-frequency cepstral coefficients from each of the at least one segment may be between 10-500 ms.
In another embodiment of the present application, the length threshold is not less than 1 second. Performing feature filling may further include: determining whether the length of the feature matrix reaches the length threshold; when the length of the feature matrix does not reach the length threshold, calculating the data volume which is required to be added to the feature matrix and reaches the length threshold; and based on the calculated data amount, filling the extracted features from the subsequent segments into the feature matrix to expand the feature matrix. According to an embodiment of the application, when the length of the feature matrix does not reach the length threshold, based on the calculated data amount, the valid features in the feature matrix are copied to expand the feature matrix. Moreover, the method may further comprise skipping the performing feature filling when the length of the feature matrix reaches the length threshold.
According to an embodiment of the present application, performing machine learning inference on the feature matrix further may include normalizing and scaling the feature matrix. Moreover, performing machine learning inference on the feature matrix may further include feeding the feature matrix to a machine learning model. The machine learning model may be a neural network. In another embodiment of the present application, the method further can include training a machine learning model to perform the machine learning inference. According to an embodiment of the present application, training a machine learning model may include: optimizing a number of model hyper-parameters, selecting a set of model hyper-parameters from the optimized model hyper-parameters, and measuring performance of the machine learning model using the selected set of model hyper-parameters. Optimizing a number of model hyper-parameters may further include: the method includes generating the plurality of hyper-parameters, training the machine learning model on sampled data using the plurality of hyper-parameters, and finding an optimal machine learning model during training of the machine learning model. The model hyper-parameter may be a model shape.
In an embodiment of the present application, performing machine learning inference on the feature matrix may further include generating an emotion score for at least one of arousal, temperament, and valence. The sentiment scores generated may be combined together.
Another embodiment of the present application provides an apparatus for emotion recognition from speech comprising a processor and a memory. Wherein computer programmable instructions for implementing a method of recognizing emotion from speech are stored in the memory and the processor is configured to execute the computer programmable instructions to implement the method of recognizing emotion from speech. The method for recognizing emotion from speech may be the method described above or other methods according to embodiments of the present application.
Yet another embodiment of the present application provides a non-transitory, computer-readable storage medium having computer programmable instructions stored therein. Wherein the computer programmable instructions are programmed to implement the foregoing or other methods of recognizing emotion from speech according to embodiments of the application.
The embodiment of the application can be suitable for audio signals with almost any size, and can identify the emotion of the whole voice in real time. In addition, by training the machine learning model, the embodiments of the present application can benefit from refinement in efficiency and accuracy.
Drawings
In order to describe the manner in which the advantages and features of the application can be obtained, a more particular description of the application will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. These drawings depict only exemplary embodiments of the application and are not therefore to be considered to limit the scope of the application.
FIG. 1 is a block diagram illustrating a system for emotion recognition from speech according to an embodiment of the present application
FIG. 2 is a flow chart demonstrating a method for emotion recognition from speech according to an embodiment of the present application
FIG. 3 is a flow chart demonstrating a method for populating features into a feature matrix according to an embodiment of the present application
FIG. 4 is a flow chart demonstrating a method for training a machine learning model according to an embodiment of the present application
Detailed Description
The detailed description of the drawings is intended as a description of the presently preferred embodiments of the application and is not intended to represent the only forms in which the present application may be practiced. It is to be understood that the same or equivalent functions may be accomplished by different embodiments that are intended to be encompassed within the spirit and scope of the application.
Speech is a complex signal containing information such as messages, speakers, language, emotion, etc. Understanding the emotion of a speaker is useful for many applications, including call centers, virtual agents, and other neural user interfaces. Current speech systems can achieve performance equivalent to humans only when dealing effectively with potential emotions. Complex speech systems should not be limited to pure message processing, but rather should understand the potential trends of the speaker by detecting expressions in the speech. Accordingly, emotion recognition from speech has become an increasingly important area in recent years.
According to one embodiment of the present application, affective information can be stored in the form of acoustic waves, which vary over time. A single acoustic wave may be formed by combining several different frequencies. Using fourier transformation, it is possible to convert a single acoustic wave back to a component feature. The information indicated by the component frequencies contains the specific frequencies and their powers relative to each other. The embodiment of the application can improve the efficiency and accuracy of emotion recognition from voice. Meanwhile, the method and the device for recognizing the emotion from the voice according to the embodiment of the application can process the real-time and noisy voice stably enough to recognize the emotion.
According to an embodiment of the present application, the basic stages of a method for emotion recognition from speech may include: the method comprises the steps of receiving an audio signal, data cleaning the received audio signal, dividing the cleaned audio signal into at least one segment, performing feature extraction on the at least one segment to extract a number of Mel frequency cepstrum coefficients and a number of Bark frequency cepstrum coefficients from the at least one segment, performing feature filling to fill the number of Mel frequency cepstrum coefficients and the number of Bark frequency cepstrum coefficients into a feature matrix based on a length threshold, and performing machine learning inference on the feature matrix to identify an emotion indicated in the audio signal.
Further details of embodiments of the present application will be further demonstrated below with reference to the accompanying drawings.
FIG. 1 is a block diagram illustrating a system 100 for recognizing emotion from speech according to one embodiment of the present application.
As shown in FIG. 1, the system 100 for recognizing emotion from speech may include at least one hardware 12 for receiving and recording the speech and a device 14 for recognizing emotion from embodiments in accordance with the present application. The at least one hardware device 12 and the means 14 for recognizing emotion from speech may be connected via the internet 16 or a local area network. In another embodiment of the present application, the at least one hardware device 12 and the means for recognizing emotion from speech 14 may be directly connected by optical fiber or cable, etc. The at least one hardware device 12 may be a call center, a human machine interface, or a virtual agent, etc. in one embodiment of the present application, the at least one hardware device 12 may include a processor 120 and a number of peripherals. The number of peripherals may include a microphone 121; at least one computer memory or other non-transitory storage medium, such as ram (random Access memory)123 and internal storage 124; a network adapter 125; a display 127 and a speaker 129. Speech may be captured by microphone 121, recorded, digitized, and stored in RAM123 as an audio signal. The audio signal may be transmitted from the at least one hardware device 12 to the means for emotion recognition from speech 14 via the internet 16, wherein the audio signal may be queued in a processing queue to await processing by the means for emotion recognition from speech 14.
In one embodiment of the present application, the apparatus 14 for recognizing emotion from speech may include a processor and a memory. Computer programmable instructions for implementing a method for emotion recognition from speech are stored in the memory and executable by the processor.
FIG. 2 is a flow diagram illustrating a method for emotion recognition from speech according to an embodiment of the present application.
As shown in FIG. 2, the method for emotion recognition from speech may receive an audio signal in step 200, such as from the processing queue shown in FIG. 1.
In step 202, data cleansing may be performed on the received audio signal. According to an embodiment of the present application, performing data cleansing on the received audio signal may further comprise at least one of: removing noise in the audio signal, removing silence of the audio signal at the beginning and end based on a silence threshold, and removing sound fragments of the audio signal shorter than a predefined threshold. For example, performing data cleaning on the received audio signal may further include performing band-pass filtering on the received audio signal to control the frequency of the audio signal to be 100-400kHz in order to remove high-frequency noise and low-frequency noise from the audio signal. In one embodiment of the present application, the silencing threshold may be-50 db. In other words, for sound fragments with loudness below-50 db, they will be considered silent and will be removed from the audio signal. According to an embodiment of the present application, the predefined threshold may be 1/4 seconds. In other words, for sound fragments shorter than 1/4 seconds in length, they will be considered too short to be retained in the audio signal. Similarly, data cleansing will improve the efficiency and accuracy of the method for recognizing emotion from speech.
According to an embodiment of the present application, in step 204, the cleaned audio signal may be divided into at least one segment; features are then extracted from the at least one segment in step 206, which may be implemented by a Fast Fourier Transform (FFT).
Extracting appropriate features for developing arbitrary content in speech is an important decision. The features will be selected to present the intended information. There are three important speech features to those skilled in the art, namely: excitation source features (excitation source features), vocal tract system features (vocal tract system features), and prosodic features (prosodic features). According to an embodiment of the application, mel-frequency cepstral coefficients and barker-frequency cepstral coefficients are extracted from at least one segment. The size of the window for extracting mel-frequency cepstral coefficients and barker-frequency cepstral coefficients from each of the at least one segment may be between 10-500 ms. Both mel-frequency cepstrum coefficients and barker-frequency cepstrum coefficients are prosodic features. For example, mel-frequency cepstral coefficients are coefficients that collectively make up a mel-frequency cepstrum mfc (mel frequency cepstrum) that represents the short-term power spectrum of sound. The mel-frequency cepstral coefficients may be based on a linear cosine (cosine) transform of a non-linear mel-scaled power spectrum logarithm (log power spectrum) of the frequency.
In addition to the mel-frequency cepstral coefficients and barker-frequency cepstral coefficients, at least another prosodic feature, such as speaker gender, loudness, normalized spectral envelope, power spectral analysis, perceptual half-width (perceptual ban width), emotion blocks, and pitch coefficients, can be extracted from the audio signal to further improve the result. In an embodiment of the present application, at least one of the excitation source feature and the vocal tract system feature may also be extracted.
The extracted features may be populated into a feature matrix based on a length threshold in step 208. In other words, after the extracted features are filled into the feature matrix, it is determined whether the length of the feature matrix reaches the length threshold. When the length of the feature matrix reaches the length threshold, the method for recognizing emotion from speech jumps from the step of performing feature filling to the subsequent step. Otherwise, the method for emotion recognition from speech will continue to fill in features to the feature matrix to expand the feature matrix to reach the length threshold. The length threshold may be no less than 1 second. In an embodiment of the application, the extracted mel-frequency cepstral coefficients and barker-frequency cepstral coefficients are populated into the feature matrix based on a length threshold, e.g., 1 second. Filling the feature to the feature matrix based on the length threshold can achieve real-time emotion recognition and allow emotion to be monitored throughout the duration of the common language. According to embodiments of the present application, the length threshold may be any value greater than 1 second. In other words, embodiments of the present application can also process audio signals of any size greater than 1 second, which are missing in conventional methods and apparatuses for recognizing emotion from speech.
In particular, FIG. 3 is a flow chart demonstrating a method for populating features into a feature matrix according to an embodiment of the present application.
As shown in fig. 3, according to an embodiment of the present application, performing feature filling may further include: in step 300 it is determined whether the length of the feature matrix reaches the length threshold. When the length of the feature matrix reaches the length threshold, the method for emotion recognition from speech jumps from the step of performing feature filling to a subsequent step, such as step 210 shown in FIG. 2. When the length of the feature matrix does not reach the length threshold, the amount of data that needs to be added to the feature matrix to reach the length threshold is calculated in step 302. In step 304, based on the calculated amount of data, features extracted from subsequent segments may be populated into the feature matrix or valid features in the feature matrix may be copied to expand the feature matrix so that it reaches the length threshold in step 304.
Returning to FIG. 2, in an embodiment of the present application, the method for recognizing emotion from speech may further include performing machine learning inference on the feature matrix to identify emotion indicated in the audio signal in step 210. Specifically, performing machine learning inference on the feature matrix may further include feeding the feature matrix to a machine learning model. In other words, an appropriate model will be identified along with the features to capture emotion-specific information from the extracted features. In an embodiment of the present application, performing machine-learning inference on the feature matrix further can include normalizing and scaling the feature matrix such that a machine-learning model performing machine-learning inference can converge to a solution. In an embodiment of the present application, performing machine learning inference on the feature matrix may further include generating and outputting sentiment scores for at least one of arousal, temperament, and valence, respectively. The fraction may be in the range of 0-1. In one embodiment of the present application, emotion scores for at least one of arousal, temperament, and valence are generated and output separately, or may be combined to output a single score. Identifying emotion in arousal, temperament, and valence from speech allows the application to gain more insight into emotion from audio signals. According to an embodiment of the present application, these three aspects of emotion can be further designed into discrete categories. For example, the temperament may be designed to be happy, angry, and the like. The emotion of the speaker indicated in the speech may be classified into one of these categories. The soft decision process can also be used in cases where the emotion of the speaker is represented as a mixture of the above categories at a given time, e.g. one example is the display at a certain time, the degree to which a person is happy and the degree to which the person is sad at the same time, etc.
In an embodiment of the present application, the method for recognizing emotion from speech may further include training a machine learning model to perform the machine learning inference. The machine learning model may be a neural network or other model training mechanism for training the model and learning the mapping between the final features and emotion classifications, for example, to find combinations of auditory bases (audiogist) or their corresponding emotion classifications, such as anger, happiness, sadness, and the like. The training of these models may be performed in a separate training operation using the input sound signals associated with one or more emotion categories. The resulting trained model can be used to recognize emotion from speech in normal operation by letting auditory dependency features derived from the speech signal pass through the trained model. The training steps can be repeated over and over again, allowing for continual improvement in performing machine learning inferences on the feature matrix. The more training, the better the machine learning model can be achieved.
FIG. 4 is a flow diagram demonstrating a method for training a machine learning model according to an embodiment of the present application.
As shown in fig. 4, a method for training a machine learning model according to an embodiment of the present application may include: optimizing a number of model hyper-parameters in step 400, selecting a set of model hyper-parameters from the optimized model hyper-parameters in step 402; and measuring performance of the machine learning model using the selected set of model hyper-parameters in step 404. The model hyper-parameter may be a model shape.
According to an embodiment of the present application, optimizing a number of model hyper-parameters may further include: generating the plurality of hyper-parameters; training the machine learning model on the sampled data using the number of hyper-parameters; and finding an optimal machine learning model during training of the machine learning model. By training the machine learning model, the method and the device can greatly improve the efficiency and the accuracy.
In one embodiment of the present application, the prior processing of emotion recognition, such as feature extraction and filling, may be performed separately from training the machine learning model, and may accordingly be performed separately on different devices.
The method according to embodiments of the present application may also be implemented on a programmed processor. However, the controller, flow diagrams, modules, and the like may also be implemented in a general purpose or special purpose computer; a programmed microprocessor or microcontroller and peripheral integrated circuit elements; an integrated circuit; and hardwired electronic or logic circuitry such as discrete element circuitry, programmable logic devices, and the like. In general, any device having a state machine disposed therein that is capable of implementing the flowcharts shown in the figures may be used to implement the processor functions of the present application. For example, an embodiment of the present application provides an apparatus for recognizing emotion from speech, comprising a processor and a memory. Wherein computer programmable instructions for implementing a method of recognizing emotion from speech are stored in the memory and the processor is configured to execute the computer programmable instructions to implement the method of recognizing emotion from speech. The method for recognizing emotion from speech may be the method described above or other methods according to embodiments of the present application.
A preferred alternative embodiment of the present application is to implement the methods of the present application on a non-transitory, computer-readable storage medium having stored thereon computer-programmable instructions. The instructions are preferably executed by a computer-executable component, which is preferably integrated into a network security system. The non-transitory, computer-readable storage medium may be stored on any suitable computer-readable medium, such as RAMs, ROMs, flash memory, EEPROMs, optical storage devices (CD or DVD), hard drives, floppy drives, or any other suitable device. The computer-executable components are preferably processors, but the instructions may alternatively or additionally be executed by any suitable dedicated hardware device. For example, one embodiment of the present application provides a non-transitory, computer-readable storage medium having computer-programmable instructions stored therein. Wherein the computer programmable instructions are programmed to implement the foregoing or other methods of recognizing emotion from speech according to embodiments of the application.
Although the present application has been described with respect to specific embodiments, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art, for example, various components, steps of the embodiments may be interchanged, added, or substituted in other embodiments. And not all of the components and steps of the various figures may be required for a disclosed embodiment. For example, those skilled in the art may still make various substitutions and modifications based on the teachings and disclosures herein without departing from the spirit of the present application, such as merely implementing elements of the independent claims. Therefore, the protection scope of the present application should not be limited to the disclosure of the embodiments, but should include various alternatives and modifications without departing from the scope of the present application, which is covered by the claims of the present patent application.

Claims (36)

1. A method for recognizing emotion from speech, the method comprising:
receiving an audio signal;
performing data cleaning on the received audio signal;
dividing the cleaned audio signal into at least one segment;
performing feature extraction on the at least one segment to extract a number of mel-frequency cepstral coefficients and a number of bark-frequency cepstral coefficients from the at least one segment;
performing feature filling to fill the number of mel-frequency cepstral coefficients and the number of bark-frequency cepstral coefficients into a feature matrix based on a length threshold of the feature matrix; and
performing machine learning inference on the feature matrix to identify an emotion indicated in the audio signal.
2. The method of claim 1, wherein the performing data cleansing on a received audio signal further comprises at least one of:
removing noise in the audio signal;
removing silence at the beginning and end of the audio signal based on a silence threshold; and
removing sound fragments in the audio signal that are shorter than a predefined threshold.
3. The method of claim 2, wherein the silencing threshold is-50 db.
4. The method of claim 2, wherein the predefined threshold is 1/4 seconds.
5. The method as claimed in claim 1, wherein the performing data cleaning on the received audio signal further comprises performing band-pass filtering on the received audio signal to control the frequency of the audio signal to be 100-400 kHz.
6. The method of claim 1, wherein the performing feature extraction on the at least one segment further comprises extracting at least one of speaker gender, loudness, normalized spectral envelope, power spectral analysis, perceptual half-width, emotion blocks, and pitch coefficients from the audio signal.
7. The method according to claim 1, wherein a size of a window used for extracting mel-frequency cepstral coefficients and barker-frequency cepstral coefficients from each of said at least one segment is between 10-500 ms.
8. The method of claim 1, wherein the length threshold is not less than 1 second.
9. The method of claim 1, wherein the performing feature filling further comprises:
determining whether the length of the feature matrix reaches the length threshold;
when the length of the feature matrix does not reach the length threshold, calculating the data amount which needs to be added to the feature matrix when the length of the feature matrix reaches the length threshold; and
based on the calculated amount of data, filling features extracted from subsequent segments into the feature matrix to expand the feature matrix.
10. The method of claim 1, wherein the performing feature filling further comprises:
determining whether the length of the feature matrix reaches the length threshold;
when the length of the feature matrix does not reach the length threshold, calculating the data amount which needs to be added to the feature matrix when the length of the feature matrix reaches the length threshold; and
based on the calculated amount of data, the valid features in the feature matrix are replicated to expand the feature matrix.
11. The method of claim 9 or 10, further comprising skipping the performing feature padding when a length of the feature matrix reaches the length threshold.
12. The method of claim 1, wherein the performing machine learning inference on the feature matrix further comprises normalizing and scaling the feature matrix.
13. The method of claim 1, wherein the performing machine learning inference on the feature matrix further comprises feeding the feature matrix to a machine learning model.
14. The method of claim 13, wherein the machine learning model is a neural network.
15. The method of claim 1, further comprising training a machine learning model to perform the machine learning inference.
16. The method of claim 15, wherein the training a machine learning model comprises:
optimizing a plurality of model hyper-parameters;
selecting a set of model hyper-parameters from the optimized model hyper-parameters; and
measuring performance of the machine learning model using the selected set of model hyper-parameters.
17. The method of claim 16, wherein the optimizing a number of model hyper-parameters further comprises:
generating the plurality of model hyper-parameters;
training the machine learning model on sampled data using the number of model hyper-parameters; and
finding an optimal machine learning model during training of the machine learning model.
18. The method of claim 16, wherein the model hyper-parameter is a model shape.
19. The method of claim 1, wherein the performing machine learning inference on the feature matrix further comprises generating an emotion score for at least one of arousal, temperament, and valence.
20. The method of claim 19, wherein the performing machine learning inference on the feature matrix further comprises combining generated sentiment scores.
21. A device for recognizing emotion from speech, comprising:
a processor; and
a memory;
wherein computer programmable instructions for implementing a method of recognizing emotion from speech are stored in the memory, and the processor is configured to execute the computer programmable instructions to:
receiving an audio signal;
performing data cleaning on the received audio signal;
dividing the cleaned audio signal into at least one segment;
performing feature extraction on the at least one segment to extract a number of mel-frequency cepstral coefficients and a number of bark-frequency cepstral coefficients from the at least one segment;
performing feature filling to fill the number of mel-frequency cepstral coefficients and the number of bark-frequency cepstral coefficients into a feature matrix based on a length threshold of the feature matrix; and
performing machine learning inference on the feature matrix to identify an emotion indicated in the audio signal.
22. The device of claim 21, wherein the performing data cleansing on a received audio signal further comprises at least one of:
removing noise in the audio signal;
removing silence at the beginning and end of the audio signal based on a silence threshold; and
removing sound fragments in the audio signal that are shorter than a predefined threshold.
23. The apparatus of claim 22, wherein the silencing threshold is-50 db.
24. The apparatus of claim 22, wherein the predefined threshold is 1/4 seconds.
25. The device of claim 21, wherein a size of a window used to extract mel-frequency cepstral coefficients and barker-frequency cepstral coefficients from each of the at least one segment is between 10-500 ms.
26. The apparatus of claim 21, wherein the length threshold is not less than 1 second.
27. The device of claim 21, wherein the performing feature filling further comprises:
determining whether the length of the feature matrix reaches the length threshold;
when the length of the feature matrix does not reach the length threshold, calculating the data amount which needs to be added to the feature matrix when the length of the feature matrix reaches the length threshold; and
based on the calculated amount of data, filling features extracted from subsequent segments into the feature matrix to expand the feature matrix.
28. The device of claim 21, wherein the performing feature filling further comprises:
determining whether the length of the feature matrix reaches the length threshold;
when the length of the feature matrix does not reach the length threshold, calculating the data amount which needs to be added to the feature matrix when the length of the feature matrix reaches the length threshold; and
based on the calculated amount of data, the valid features in the feature matrix are replicated to expand the feature matrix.
29. The apparatus of claim 27 or 28, further wherein it is skipped from performing feature padding when the length of the feature matrix reaches the length threshold.
30. The device of claim 21, wherein the performing machine learning inference on the feature matrix further comprises normalizing and scaling the feature matrix.
31. The device of claim 21, wherein the performing machine learning inference on the feature matrix further comprises feeding the feature matrix to a machine learning model.
32. The apparatus of claim 21, further training a machine learning model to perform the machine learning inference.
33. The device of claim 32, wherein the training machine learning model includes:
optimizing a plurality of model hyper-parameters;
selecting a set of model hyper-parameters from the optimized model hyper-parameters; and
measuring performance of the machine learning model using the selected set of model hyper-parameters.
34. The device of claim 33, wherein the optimizing a number of model hyper-parameters further comprises:
generating the plurality of model hyper-parameters;
training the machine learning model on sampled data using the number of model hyper-parameters; and
finding an optimal machine learning model during training of the machine learning model.
35. The device of claim 21, wherein the performing machine learning inference on the feature matrix further comprises generating an emotion score for at least one of arousal, temperament, and valence.
36. A non-transitory, computer-readable storage medium having stored therein computer programmable instructions, wherein the computer programmable instructions are programmed to implement the method of recognizing emotion from speech according to claim 1.
CN201711378503.2A 2017-12-19 2017-12-19 Method and apparatus for emotion recognition from speech Active CN108091323B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711378503.2A CN108091323B (en) 2017-12-19 2017-12-19 Method and apparatus for emotion recognition from speech

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711378503.2A CN108091323B (en) 2017-12-19 2017-12-19 Method and apparatus for emotion recognition from speech

Publications (2)

Publication Number Publication Date
CN108091323A CN108091323A (en) 2018-05-29
CN108091323B true CN108091323B (en) 2020-10-13

Family

ID=62177341

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711378503.2A Active CN108091323B (en) 2017-12-19 2017-12-19 Method and apparatus for emotion recognition from speech

Country Status (1)

Country Link
CN (1) CN108091323B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3729419A1 (en) * 2017-12-19 2020-10-28 Wonder Group Technologies Ltd. Method and apparatus for emotion recognition from speech
CN108806716A (en) * 2018-06-15 2018-11-13 想象科技(北京)有限公司 For the matched method and apparatus of computerization based on emotion frame
CN111143551A (en) * 2019-12-04 2020-05-12 支付宝(杭州)信息技术有限公司 Text preprocessing method, classification method, device and equipment
TWI807203B (en) * 2020-07-28 2023-07-01 華碩電腦股份有限公司 Voice recognition method and electronic device using the same
CN113192537B (en) * 2021-04-27 2024-04-09 深圳市优必选科技股份有限公司 Awakening degree recognition model training method and voice awakening degree acquisition method
CN113531806B (en) * 2021-06-28 2022-08-19 青岛海尔空调器有限总公司 Method and device for controlling air conditioner and air conditioner

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178897A (en) * 2007-12-05 2008-05-14 浙江大学 Speaking man recognizing method using base frequency envelope to eliminate emotion voice
CN101547261A (en) * 2008-03-27 2009-09-30 富士通株式会社 Association apparatus, association method, and recording medium
CN101599271A (en) * 2009-07-07 2009-12-09 华中科技大学 A kind of recognition methods of digital music emotion
CN103021406A (en) * 2012-12-18 2013-04-03 台州学院 Robust speech emotion recognition method based on compressive sensing
CN103258537A (en) * 2013-05-24 2013-08-21 安宁 Method utilizing characteristic combination to identify speech emotions and device thereof
CN105976809A (en) * 2016-05-25 2016-09-28 中国地质大学(武汉) Voice-and-facial-expression-based identification method and system for dual-modal emotion fusion
CN106503646A (en) * 2016-10-19 2017-03-15 竹间智能科技(上海)有限公司 Multi-modal emotion identification system and method
WO2017100334A1 (en) * 2015-12-07 2017-06-15 Sri International Vpa with integrated object recognition and facial expression recognition
CN106919251A (en) * 2017-01-09 2017-07-04 重庆邮电大学 A kind of collaborative virtual learning environment natural interactive method based on multi-modal emotion recognition

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10056076B2 (en) * 2015-09-06 2018-08-21 International Business Machines Corporation Covariance matrix estimation with structural-based priors for speech processing

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178897A (en) * 2007-12-05 2008-05-14 浙江大学 Speaking man recognizing method using base frequency envelope to eliminate emotion voice
CN101547261A (en) * 2008-03-27 2009-09-30 富士通株式会社 Association apparatus, association method, and recording medium
CN101599271A (en) * 2009-07-07 2009-12-09 华中科技大学 A kind of recognition methods of digital music emotion
CN103021406A (en) * 2012-12-18 2013-04-03 台州学院 Robust speech emotion recognition method based on compressive sensing
CN103258537A (en) * 2013-05-24 2013-08-21 安宁 Method utilizing characteristic combination to identify speech emotions and device thereof
WO2017100334A1 (en) * 2015-12-07 2017-06-15 Sri International Vpa with integrated object recognition and facial expression recognition
CN105976809A (en) * 2016-05-25 2016-09-28 中国地质大学(武汉) Voice-and-facial-expression-based identification method and system for dual-modal emotion fusion
CN106503646A (en) * 2016-10-19 2017-03-15 竹间智能科技(上海)有限公司 Multi-modal emotion identification system and method
CN106919251A (en) * 2017-01-09 2017-07-04 重庆邮电大学 A kind of collaborative virtual learning environment natural interactive method based on multi-modal emotion recognition

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Extraction of novel features for emotion recognition;LI Xiang 等;《J Shanghai Univ (Engl Ed)》;20111231;第15卷(第5期);第479-486页 *

Also Published As

Publication number Publication date
CN108091323A (en) 2018-05-29

Similar Documents

Publication Publication Date Title
CN108091323B (en) Method and apparatus for emotion recognition from speech
Bhavan et al. Bagged support vector machines for emotion recognition from speech
US20210118464A1 (en) Method and apparatus for emotion recognition from speech
Aloufi et al. Emotionless: Privacy-preserving speech analysis for voice assistants
CN111161752A (en) Echo cancellation method and device
US11842722B2 (en) Speech synthesis method and system
WO2022046526A1 (en) Synthesized data augmentation using voice conversion and speech recognition models
Javidi et al. Speech emotion recognition by using combinations of C5. 0, neural network (NN), and support vector machines (SVM) classification methods
Chenchah et al. A bio-inspired emotion recognition system under real-life conditions
Subhashree et al. Speech Emotion Recognition: Performance Analysis based on fused algorithms and GMM modelling
CN113782032B (en) Voiceprint recognition method and related device
Revathy et al. Performance comparison of speaker and emotion recognition
JP6373621B2 (en) Speech evaluation device, speech evaluation method, program
CN113539243A (en) Training method of voice classification model, voice classification method and related device
Zouhir et al. A bio-inspired feature extraction for robust speech recognition
Koolagudi et al. Recognition of emotions from speech using excitation source features
CN113112992A (en) Voice recognition method and device, storage medium and server
WO2021152566A1 (en) System and method for shielding speaker voice print in audio signals
Chen et al. CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application
JP2016186516A (en) Pseudo-sound signal generation device, acoustic model application device, pseudo-sound signal generation method, and program
KR102415519B1 (en) Computing Detection Device for AI Voice
KR20230148048A (en) Method and system for synthesizing emotional speech based on emotion prediction
CN110895941A (en) Voiceprint recognition method and device and storage device
CN113129926A (en) Voice emotion recognition model training method, voice emotion recognition method and device
Fahmeeda et al. Voice Based Gender Recognition Using Deep Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant