CN108091323B

CN108091323B - Method and apparatus for emotion recognition from speech

Info

Publication number: CN108091323B
Application number: CN201711378503.2A
Authority: CN
Inventors: C·C·多斯曼; B·N·利亚纳盖; T·J·M·厄斯特勒姆
Original assignee: Xiangxiang Technology Beijing Co ltd
Current assignee: Xiangxiang Technology Beijing Co ltd
Priority date: 2017-12-19
Filing date: 2017-12-19
Publication date: 2020-10-13
Anticipated expiration: 2037-12-19
Also published as: CN108091323A

Abstract

The application relates to a method and a device for recognizing emotion from voice. A method for recognizing emotion from speech according to an embodiment of the present application may include: the method includes receiving an audio signal, data cleaning the received audio signal, segmenting the cleaned audio signal into at least one segment, performing feature extraction on the at least one segment to extract a number of mel-frequency cepstral coefficients and a number of bark-frequency cepstral coefficients from the at least one segment, performing feature filling to fill the number of mel-frequency cepstral coefficients and the number of bark-frequency cepstral coefficients into a feature matrix based on a length threshold, and performing machine learning inference on the feature matrix to identify an emotion indicated in the audio signal. The embodiment of the application can be suitable for audio signals with almost any size, and can identify the emotion of the whole voice in real time.

Description

Method and apparatus for emotion recognition from speech

Technical Field

The present application relates to emotion recognition technology, and more particularly, to a method and apparatus for recognizing emotion from speech.

Background

Voice communication between humans is very complex and delicate, and it not only delivers information in the form of words, but also delivers the current mental state of people. Emotion recognition or understanding of the mental state of a speaker is important and advantageous for many applications including games, human-machine interactive interfaces, virtual agents, and the like. Psychologists have studied the field of emotion recognition for many years and have developed many theories. On the other hand, machine learning researchers have also explored this area and have gained a consensus that emotional states are encoded in speech.

Most existing speech systems can effectively process studio-recorded, neural speech, but perform poorly in emotion-like speech processing. The current state-of-the-art emotion detectors have only about 40-50% accuracy in identifying four to five different types of emotions in the main emotion. Thus, the problem of emotion-like speech processing is also the limited functionality of speech recognition methods and systems, which can be attributed to the difficulties in modeling and characterizing the emotion present in speech.

In conclusion, improvements in speech recognition remain important and urgent to effectively and accurately recognize the emotional state of a speaker.

Disclosure of Invention

One of the objects of the present application is to provide a method and apparatus for recognizing emotion from speech.

According to an embodiment of the present application, a method for recognizing emotion from speech may include: the method includes receiving an audio signal, data cleaning the received audio signal, segmenting the cleaned audio signal into at least one segment, performing feature extraction on the at least one segment to extract a number of mel-frequency cepstral coefficients and a number of bark-frequency cepstral coefficients from the at least one segment, performing feature filling to fill the number of mel-frequency cepstral coefficients and the number of bark-frequency cepstral coefficients into a feature matrix based on a length threshold, and performing machine learning inference on the feature matrix to identify an emotion indicated in the audio signal.

In an embodiment of the application, performing data cleansing on the received audio signal further comprises at least one of: removing noise in the audio signal, removing silence of the audio signal at the beginning and end based on a silence threshold, and removing sound fragments of the audio signal shorter than a predefined threshold. The silence threshold may be-50 db and the predefined threshold may be 1/4 seconds. In another embodiment of the present application, performing data cleaning on the received audio signal may further include performing band-pass filtering on the received audio signal to control the frequency of the audio signal to be 100 and 400 kHz.

According to an embodiment of the present application, performing feature extraction on the at least one segment further may include extracting at least one of gender, loudness, normalized spectral envelope, power spectral analysis, perceptual half-width, emotional square, and tonal coefficient of a speaker from the audio signal. The size of the window for extracting mel-frequency cepstral coefficients and barker-frequency cepstral coefficients from each of the at least one segment may be between 10-500 ms.

In another embodiment of the present application, the length threshold is not less than 1 second. Performing feature filling may further include: determining whether the length of the feature matrix reaches the length threshold; when the length of the feature matrix does not reach the length threshold, calculating the data volume which is required to be added to the feature matrix and reaches the length threshold; and based on the calculated data amount, filling the extracted features from the subsequent segments into the feature matrix to expand the feature matrix. According to an embodiment of the application, when the length of the feature matrix does not reach the length threshold, based on the calculated data amount, the valid features in the feature matrix are copied to expand the feature matrix. Moreover, the method may further comprise skipping the performing feature filling when the length of the feature matrix reaches the length threshold.

According to an embodiment of the present application, performing machine learning inference on the feature matrix further may include normalizing and scaling the feature matrix. Moreover, performing machine learning inference on the feature matrix may further include feeding the feature matrix to a machine learning model. The machine learning model may be a neural network. In another embodiment of the present application, the method further can include training a machine learning model to perform the machine learning inference. According to an embodiment of the present application, training a machine learning model may include: optimizing a number of model hyper-parameters, selecting a set of model hyper-parameters from the optimized model hyper-parameters, and measuring performance of the machine learning model using the selected set of model hyper-parameters. Optimizing a number of model hyper-parameters may further include: the method includes generating the plurality of hyper-parameters, training the machine learning model on sampled data using the plurality of hyper-parameters, and finding an optimal machine learning model during training of the machine learning model. The model hyper-parameter may be a model shape.

In an embodiment of the present application, performing machine learning inference on the feature matrix may further include generating an emotion score for at least one of arousal, temperament, and valence. The sentiment scores generated may be combined together.

Another embodiment of the present application provides an apparatus for emotion recognition from speech comprising a processor and a memory. Wherein computer programmable instructions for implementing a method of recognizing emotion from speech are stored in the memory and the processor is configured to execute the computer programmable instructions to implement the method of recognizing emotion from speech. The method for recognizing emotion from speech may be the method described above or other methods according to embodiments of the present application.

Yet another embodiment of the present application provides a non-transitory, computer-readable storage medium having computer programmable instructions stored therein. Wherein the computer programmable instructions are programmed to implement the foregoing or other methods of recognizing emotion from speech according to embodiments of the application.

The embodiment of the application can be suitable for audio signals with almost any size, and can identify the emotion of the whole voice in real time. In addition, by training the machine learning model, the embodiments of the present application can benefit from refinement in efficiency and accuracy.

Drawings

In order to describe the manner in which the advantages and features of the application can be obtained, a more particular description of the application will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. These drawings depict only exemplary embodiments of the application and are not therefore to be considered to limit the scope of the application.

FIG. 1 is a block diagram illustrating a system for emotion recognition from speech according to an embodiment of the present application

FIG. 2 is a flow chart demonstrating a method for emotion recognition from speech according to an embodiment of the present application

FIG. 3 is a flow chart demonstrating a method for populating features into a feature matrix according to an embodiment of the present application

FIG. 4 is a flow chart demonstrating a method for training a machine learning model according to an embodiment of the present application

Detailed Description

The detailed description of the drawings is intended as a description of the presently preferred embodiments of the application and is not intended to represent the only forms in which the present application may be practiced. It is to be understood that the same or equivalent functions may be accomplished by different embodiments that are intended to be encompassed within the spirit and scope of the application.

Speech is a complex signal containing information such as messages, speakers, language, emotion, etc. Understanding the emotion of a speaker is useful for many applications, including call centers, virtual agents, and other neural user interfaces. Current speech systems can achieve performance equivalent to humans only when dealing effectively with potential emotions. Complex speech systems should not be limited to pure message processing, but rather should understand the potential trends of the speaker by detecting expressions in the speech. Accordingly, emotion recognition from speech has become an increasingly important area in recent years.

According to one embodiment of the present application, affective information can be stored in the form of acoustic waves, which vary over time. A single acoustic wave may be formed by combining several different frequencies. Using fourier transformation, it is possible to convert a single acoustic wave back to a component feature. The information indicated by the component frequencies contains the specific frequencies and their powers relative to each other. The embodiment of the application can improve the efficiency and accuracy of emotion recognition from voice. Meanwhile, the method and the device for recognizing the emotion from the voice according to the embodiment of the application can process the real-time and noisy voice stably enough to recognize the emotion.

According to an embodiment of the present application, the basic stages of a method for emotion recognition from speech may include: the method comprises the steps of receiving an audio signal, data cleaning the received audio signal, dividing the cleaned audio signal into at least one segment, performing feature extraction on the at least one segment to extract a number of Mel frequency cepstrum coefficients and a number of Bark frequency cepstrum coefficients from the at least one segment, performing feature filling to fill the number of Mel frequency cepstrum coefficients and the number of Bark frequency cepstrum coefficients into a feature matrix based on a length threshold, and performing machine learning inference on the feature matrix to identify an emotion indicated in the audio signal.

Further details of embodiments of the present application will be further demonstrated below with reference to the accompanying drawings.

FIG. 1 is a block diagram illustrating a system 100 for recognizing emotion from speech according to one embodiment of the present application.

As shown in FIG. 1, the system 100 for recognizing emotion from speech may include at least one hardware 12 for receiving and recording the speech and a device 14 for recognizing emotion from embodiments in accordance with the present application. The at least one hardware device 12 and the means 14 for recognizing emotion from speech may be connected via the internet 16 or a local area network. In another embodiment of the present application, the at least one hardware device 12 and the means for recognizing emotion from speech 14 may be directly connected by optical fiber or cable, etc. The at least one hardware device 12 may be a call center, a human machine interface, or a virtual agent, etc. in one embodiment of the present application, the at least one hardware device 12 may include a processor 120 and a number of peripherals. The number of peripherals may include a microphone 121; at least one computer memory or other non-transitory storage medium, such as ram (random Access memory)123 and internal storage 124; a network adapter 125; a display 127 and a speaker 129. Speech may be captured by microphone 121, recorded, digitized, and stored in RAM123 as an audio signal. The audio signal may be transmitted from the at least one hardware device 12 to the means for emotion recognition from speech 14 via the internet 16, wherein the audio signal may be queued in a processing queue to await processing by the means for emotion recognition from speech 14.

In one embodiment of the present application, the apparatus 14 for recognizing emotion from speech may include a processor and a memory. Computer programmable instructions for implementing a method for emotion recognition from speech are stored in the memory and executable by the processor.

FIG. 2 is a flow diagram illustrating a method for emotion recognition from speech according to an embodiment of the present application.

As shown in FIG. 2, the method for emotion recognition from speech may receive an audio signal in step 200, such as from the processing queue shown in FIG. 1.

In step 202, data cleansing may be performed on the received audio signal. According to an embodiment of the present application, performing data cleansing on the received audio signal may further comprise at least one of: removing noise in the audio signal, removing silence of the audio signal at the beginning and end based on a silence threshold, and removing sound fragments of the audio signal shorter than a predefined threshold. For example, performing data cleaning on the received audio signal may further include performing band-pass filtering on the received audio signal to control the frequency of the audio signal to be 100-400kHz in order to remove high-frequency noise and low-frequency noise from the audio signal. In one embodiment of the present application, the silencing threshold may be-50 db. In other words, for sound fragments with loudness below-50 db, they will be considered silent and will be removed from the audio signal. According to an embodiment of the present application, the predefined threshold may be 1/4 seconds. In other words, for sound fragments shorter than 1/4 seconds in length, they will be considered too short to be retained in the audio signal. Similarly, data cleansing will improve the efficiency and accuracy of the method for recognizing emotion from speech.

According to an embodiment of the present application, in step 204, the cleaned audio signal may be divided into at least one segment; features are then extracted from the at least one segment in step 206, which may be implemented by a Fast Fourier Transform (FFT).

Extracting appropriate features for developing arbitrary content in speech is an important decision. The features will be selected to present the intended information. There are three important speech features to those skilled in the art, namely: excitation source features (excitation source features), vocal tract system features (vocal tract system features), and prosodic features (prosodic features). According to an embodiment of the application, mel-frequency cepstral coefficients and barker-frequency cepstral coefficients are extracted from at least one segment. The size of the window for extracting mel-frequency cepstral coefficients and barker-frequency cepstral coefficients from each of the at least one segment may be between 10-500 ms. Both mel-frequency cepstrum coefficients and barker-frequency cepstrum coefficients are prosodic features. For example, mel-frequency cepstral coefficients are coefficients that collectively make up a mel-frequency cepstrum mfc (mel frequency cepstrum) that represents the short-term power spectrum of sound. The mel-frequency cepstral coefficients may be based on a linear cosine (cosine) transform of a non-linear mel-scaled power spectrum logarithm (log power spectrum) of the frequency.

In addition to the mel-frequency cepstral coefficients and barker-frequency cepstral coefficients, at least another prosodic feature, such as speaker gender, loudness, normalized spectral envelope, power spectral analysis, perceptual half-width (perceptual ban width), emotion blocks, and pitch coefficients, can be extracted from the audio signal to further improve the result. In an embodiment of the present application, at least one of the excitation source feature and the vocal tract system feature may also be extracted.

The extracted features may be populated into a feature matrix based on a length threshold in step 208. In other words, after the extracted features are filled into the feature matrix, it is determined whether the length of the feature matrix reaches the length threshold. When the length of the feature matrix reaches the length threshold, the method for recognizing emotion from speech jumps from the step of performing feature filling to the subsequent step. Otherwise, the method for emotion recognition from speech will continue to fill in features to the feature matrix to expand the feature matrix to reach the length threshold. The length threshold may be no less than 1 second. In an embodiment of the application, the extracted mel-frequency cepstral coefficients and barker-frequency cepstral coefficients are populated into the feature matrix based on a length threshold, e.g., 1 second. Filling the feature to the feature matrix based on the length threshold can achieve real-time emotion recognition and allow emotion to be monitored throughout the duration of the common language. According to embodiments of the present application, the length threshold may be any value greater than 1 second. In other words, embodiments of the present application can also process audio signals of any size greater than 1 second, which are missing in conventional methods and apparatuses for recognizing emotion from speech.

In particular, FIG. 3 is a flow chart demonstrating a method for populating features into a feature matrix according to an embodiment of the present application.

As shown in fig. 3, according to an embodiment of the present application, performing feature filling may further include: in step 300 it is determined whether the length of the feature matrix reaches the length threshold. When the length of the feature matrix reaches the length threshold, the method for emotion recognition from speech jumps from the step of performing feature filling to a subsequent step, such as step 210 shown in FIG. 2. When the length of the feature matrix does not reach the length threshold, the amount of data that needs to be added to the feature matrix to reach the length threshold is calculated in step 302. In step 304, based on the calculated amount of data, features extracted from subsequent segments may be populated into the feature matrix or valid features in the feature matrix may be copied to expand the feature matrix so that it reaches the length threshold in step 304.

Returning to FIG. 2, in an embodiment of the present application, the method for recognizing emotion from speech may further include performing machine learning inference on the feature matrix to identify emotion indicated in the audio signal in step 210. Specifically, performing machine learning inference on the feature matrix may further include feeding the feature matrix to a machine learning model. In other words, an appropriate model will be identified along with the features to capture emotion-specific information from the extracted features. In an embodiment of the present application, performing machine-learning inference on the feature matrix further can include normalizing and scaling the feature matrix such that a machine-learning model performing machine-learning inference can converge to a solution. In an embodiment of the present application, performing machine learning inference on the feature matrix may further include generating and outputting sentiment scores for at least one of arousal, temperament, and valence, respectively. The fraction may be in the range of 0-1. In one embodiment of the present application, emotion scores for at least one of arousal, temperament, and valence are generated and output separately, or may be combined to output a single score. Identifying emotion in arousal, temperament, and valence from speech allows the application to gain more insight into emotion from audio signals. According to an embodiment of the present application, these three aspects of emotion can be further designed into discrete categories. For example, the temperament may be designed to be happy, angry, and the like. The emotion of the speaker indicated in the speech may be classified into one of these categories. The soft decision process can also be used in cases where the emotion of the speaker is represented as a mixture of the above categories at a given time, e.g. one example is the display at a certain time, the degree to which a person is happy and the degree to which the person is sad at the same time, etc.

In an embodiment of the present application, the method for recognizing emotion from speech may further include training a machine learning model to perform the machine learning inference. The machine learning model may be a neural network or other model training mechanism for training the model and learning the mapping between the final features and emotion classifications, for example, to find combinations of auditory bases (audiogist) or their corresponding emotion classifications, such as anger, happiness, sadness, and the like. The training of these models may be performed in a separate training operation using the input sound signals associated with one or more emotion categories. The resulting trained model can be used to recognize emotion from speech in normal operation by letting auditory dependency features derived from the speech signal pass through the trained model. The training steps can be repeated over and over again, allowing for continual improvement in performing machine learning inferences on the feature matrix. The more training, the better the machine learning model can be achieved.

FIG. 4 is a flow diagram demonstrating a method for training a machine learning model according to an embodiment of the present application.

As shown in fig. 4, a method for training a machine learning model according to an embodiment of the present application may include: optimizing a number of model hyper-parameters in step 400, selecting a set of model hyper-parameters from the optimized model hyper-parameters in step 402; and measuring performance of the machine learning model using the selected set of model hyper-parameters in step 404. The model hyper-parameter may be a model shape.

According to an embodiment of the present application, optimizing a number of model hyper-parameters may further include: generating the plurality of hyper-parameters; training the machine learning model on the sampled data using the number of hyper-parameters; and finding an optimal machine learning model during training of the machine learning model. By training the machine learning model, the method and the device can greatly improve the efficiency and the accuracy.

In one embodiment of the present application, the prior processing of emotion recognition, such as feature extraction and filling, may be performed separately from training the machine learning model, and may accordingly be performed separately on different devices.

The method according to embodiments of the present application may also be implemented on a programmed processor. However, the controller, flow diagrams, modules, and the like may also be implemented in a general purpose or special purpose computer; a programmed microprocessor or microcontroller and peripheral integrated circuit elements; an integrated circuit; and hardwired electronic or logic circuitry such as discrete element circuitry, programmable logic devices, and the like. In general, any device having a state machine disposed therein that is capable of implementing the flowcharts shown in the figures may be used to implement the processor functions of the present application. For example, an embodiment of the present application provides an apparatus for recognizing emotion from speech, comprising a processor and a memory. Wherein computer programmable instructions for implementing a method of recognizing emotion from speech are stored in the memory and the processor is configured to execute the computer programmable instructions to implement the method of recognizing emotion from speech. The method for recognizing emotion from speech may be the method described above or other methods according to embodiments of the present application.

A preferred alternative embodiment of the present application is to implement the methods of the present application on a non-transitory, computer-readable storage medium having stored thereon computer-programmable instructions. The instructions are preferably executed by a computer-executable component, which is preferably integrated into a network security system. The non-transitory, computer-readable storage medium may be stored on any suitable computer-readable medium, such as RAMs, ROMs, flash memory, EEPROMs, optical storage devices (CD or DVD), hard drives, floppy drives, or any other suitable device. The computer-executable components are preferably processors, but the instructions may alternatively or additionally be executed by any suitable dedicated hardware device. For example, one embodiment of the present application provides a non-transitory, computer-readable storage medium having computer-programmable instructions stored therein. Wherein the computer programmable instructions are programmed to implement the foregoing or other methods of recognizing emotion from speech according to embodiments of the application.

Although the present application has been described with respect to specific embodiments, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art, for example, various components, steps of the embodiments may be interchanged, added, or substituted in other embodiments. And not all of the components and steps of the various figures may be required for a disclosed embodiment. For example, those skilled in the art may still make various substitutions and modifications based on the teachings and disclosures herein without departing from the spirit of the present application, such as merely implementing elements of the independent claims. Therefore, the protection scope of the present application should not be limited to the disclosure of the embodiments, but should include various alternatives and modifications without departing from the scope of the present application, which is covered by the claims of the present patent application.

Claims

1. A method for recognizing emotion from speech, the method comprising:

receiving an audio signal;

performing data cleaning on the received audio signal;

dividing the cleaned audio signal into at least one segment;

performing feature extraction on the at least one segment to extract a number of mel-frequency cepstral coefficients and a number of bark-frequency cepstral coefficients from the at least one segment;

performing feature filling to fill the number of mel-frequency cepstral coefficients and the number of bark-frequency cepstral coefficients into a feature matrix based on a length threshold of the feature matrix; and

performing machine learning inference on the feature matrix to identify an emotion indicated in the audio signal.

2. The method of claim 1, wherein the performing data cleansing on a received audio signal further comprises at least one of:

removing noise in the audio signal;

removing silence at the beginning and end of the audio signal based on a silence threshold; and

removing sound fragments in the audio signal that are shorter than a predefined threshold.

3. The method of claim 2, wherein the silencing threshold is-50 db.

4. The method of claim 2, wherein the predefined threshold is 1/4 seconds.

5. The method as claimed in claim 1, wherein the performing data cleaning on the received audio signal further comprises performing band-pass filtering on the received audio signal to control the frequency of the audio signal to be 100-400 kHz.

6. The method of claim 1, wherein the performing feature extraction on the at least one segment further comprises extracting at least one of speaker gender, loudness, normalized spectral envelope, power spectral analysis, perceptual half-width, emotion blocks, and pitch coefficients from the audio signal.

7. The method according to claim 1, wherein a size of a window used for extracting mel-frequency cepstral coefficients and barker-frequency cepstral coefficients from each of said at least one segment is between 10-500 ms.

8. The method of claim 1, wherein the length threshold is not less than 1 second.

9. The method of claim 1, wherein the performing feature filling further comprises:

determining whether the length of the feature matrix reaches the length threshold;

when the length of the feature matrix does not reach the length threshold, calculating the data amount which needs to be added to the feature matrix when the length of the feature matrix reaches the length threshold; and

based on the calculated amount of data, filling features extracted from subsequent segments into the feature matrix to expand the feature matrix.

10. The method of claim 1, wherein the performing feature filling further comprises:

based on the calculated amount of data, the valid features in the feature matrix are replicated to expand the feature matrix.

11. The method of claim 9 or 10, further comprising skipping the performing feature padding when a length of the feature matrix reaches the length threshold.

12. The method of claim 1, wherein the performing machine learning inference on the feature matrix further comprises normalizing and scaling the feature matrix.

13. The method of claim 1, wherein the performing machine learning inference on the feature matrix further comprises feeding the feature matrix to a machine learning model.

14. The method of claim 13, wherein the machine learning model is a neural network.

15. The method of claim 1, further comprising training a machine learning model to perform the machine learning inference.

16. The method of claim 15, wherein the training a machine learning model comprises:

optimizing a plurality of model hyper-parameters;

selecting a set of model hyper-parameters from the optimized model hyper-parameters; and

measuring performance of the machine learning model using the selected set of model hyper-parameters.

17. The method of claim 16, wherein the optimizing a number of model hyper-parameters further comprises:

generating the plurality of model hyper-parameters;

training the machine learning model on sampled data using the number of model hyper-parameters; and

finding an optimal machine learning model during training of the machine learning model.

18. The method of claim 16, wherein the model hyper-parameter is a model shape.

19. The method of claim 1, wherein the performing machine learning inference on the feature matrix further comprises generating an emotion score for at least one of arousal, temperament, and valence.

20. The method of claim 19, wherein the performing machine learning inference on the feature matrix further comprises combining generated sentiment scores.

21. A device for recognizing emotion from speech, comprising:

a processor; and

a memory;

wherein computer programmable instructions for implementing a method of recognizing emotion from speech are stored in the memory, and the processor is configured to execute the computer programmable instructions to:

receiving an audio signal;

performing data cleaning on the received audio signal;

dividing the cleaned audio signal into at least one segment;

22. The device of claim 21, wherein the performing data cleansing on a received audio signal further comprises at least one of:

removing noise in the audio signal;

23. The apparatus of claim 22, wherein the silencing threshold is-50 db.

24. The apparatus of claim 22, wherein the predefined threshold is 1/4 seconds.

25. The device of claim 21, wherein a size of a window used to extract mel-frequency cepstral coefficients and barker-frequency cepstral coefficients from each of the at least one segment is between 10-500 ms.

26. The apparatus of claim 21, wherein the length threshold is not less than 1 second.

27. The device of claim 21, wherein the performing feature filling further comprises:

28. The device of claim 21, wherein the performing feature filling further comprises:

29. The apparatus of claim 27 or 28, further wherein it is skipped from performing feature padding when the length of the feature matrix reaches the length threshold.

30. The device of claim 21, wherein the performing machine learning inference on the feature matrix further comprises normalizing and scaling the feature matrix.

31. The device of claim 21, wherein the performing machine learning inference on the feature matrix further comprises feeding the feature matrix to a machine learning model.

32. The apparatus of claim 21, further training a machine learning model to perform the machine learning inference.

33. The device of claim 32, wherein the training machine learning model includes:

optimizing a plurality of model hyper-parameters;

34. The device of claim 33, wherein the optimizing a number of model hyper-parameters further comprises:

generating the plurality of model hyper-parameters;

35. The device of claim 21, wherein the performing machine learning inference on the feature matrix further comprises generating an emotion score for at least one of arousal, temperament, and valence.

36. A non-transitory, computer-readable storage medium having stored therein computer programmable instructions, wherein the computer programmable instructions are programmed to implement the method of recognizing emotion from speech according to claim 1.