CN109256150B

CN109256150B - Speech emotion recognition system and method based on machine learning

Info

Publication number: CN109256150B
Application number: CN201811186572.8A
Authority: CN
Inventors: 徐心; 胡宇澄; 王麒铭; 饶鹏
Original assignee: Tranzvision Consulting Co ltd
Current assignee: Tranzvision Consulting Co ltd
Priority date: 2018-10-12
Filing date: 2018-10-12
Publication date: 2021-11-30
Anticipated expiration: 2038-10-12
Also published as: CN109256150A

Abstract

The invention discloses a speech emotion recognition system and method based on machine learning, comprising a recording noise reduction module; the sentence breaking module is used for receiving the recording data transmitted by the recording noise reduction module and cutting the recording data into segments according to the relevant characteristics of phonetics; the speaker recognition module is used for receiving the segments transmitted by the sentence segmentation module, classifying the segments by utilizing a machine learning algorithm and recognizing the speaker according to the classification; the characteristic extraction module is used for receiving the fragments transmitted by the sentence segmentation module, extracting the spectral characteristic and the Mel cepstrum coefficient of each fragment, and extracting the fragment characteristic after processing the spectral characteristic and the Mel cepstrum coefficient; and the emotion recognition module is used for receiving the segment features generated by the feature extraction module, training the emotion prediction models through a machine learning algorithm and integrating the prediction results of each model by using an integration algorithm. The invention has the beneficial effects that: effectively obtains good performance in the Chinese language environment and the actual production environment of customer service telephones.

Description

Speech emotion recognition system and method based on machine learning

Technical Field

The invention relates to the technical field of voice recognition, in particular to a system and a method for recognizing voice emotion based on machine learning.

Background

In the field of speech recognition, the method can be roughly divided into two modules, one is based on displaying the content expressed by the speech audio in a text form, and the other is based on recognizing the emotion (such as anger or calmness, etc.) contained in the audio based on the speech audio. Speech emotion recognition is involved in foreign documents, but the limitation is large; in the existing documents, emotion recognition of other languages besides the Chinese language is mostly performed, and the emotion recognition cannot be directly performed by speech in the Chinese environment, and the documents mostly use a single algorithm to recognize emotion, so that the recognition effect is relatively close to laboratory data, the recognition effect is not ideal in actual production, and the requirements in the Chinese production environment cannot be met.

An effective solution to the problems in the related art has not been proposed yet.

Disclosure of Invention

Aiming at the technical problems in the related art, the invention provides a speech emotion recognition system and method based on machine learning, which can solve the emotion recognition problem based on Chinese audio recording, and comprises but is not limited to scenes such as incoming calls and outgoing calls of customer services.

In order to achieve the technical purpose, the technical scheme of the invention is realized as follows:

a speech emotion recognition system based on machine learning comprises a recording noise reduction module, a sentence break module, a speaker recognition module, a feature extraction module and an emotion recognition module,

the recording noise reduction module is used for acquiring recording data and performing noise reduction pretreatment on the recording data by using a related algorithm;

the sentence breaking module is used for receiving the recording data transmitted by the recording noise reduction module and cutting the recording data into segments according to the relevant characteristics of phonetics;

the speaker recognition module is used for receiving the segments transmitted by the sentence segmentation module, classifying the segments by utilizing a machine learning algorithm and recognizing the speaker according to the classification;

the characteristic extraction module is used for receiving the fragments transmitted by the sentence segmentation module, extracting the spectral characteristic and the Mel cepstrum coefficient of each fragment, and extracting the fragment characteristic after processing the spectral characteristic and the Mel cepstrum coefficient;

and the emotion recognition module is used for receiving the segment features generated by the feature extraction module, training the emotion prediction models through a machine learning algorithm and integrating the prediction results of each model by using an integration algorithm.

Further, the noise reduction preprocessing in the recording noise reduction module comprises:

the training learning module is used for inputting lossy data and utilizing the lossy data to train and learn;

the output module is used for outputting undamaged data as the output of the deep learning algorithm;

and the first processing module is used for processing other lossy data according to the trained model.

Further, the speaker recognition module includes:

the classification module is used for classifying different fragments and voice frames in the voice recording data into two or more types by using different modeling methods;

the first integration module is used for integrating the classification results of the models and classifying and marking the voice segments of different speakers in real time and in batches.

Further, the feature extraction module comprises:

the extraction module is used for extracting various different characteristic indexes from each model according to the modeling requirement of each model;

the second processing module is used for processing different dimensionality indexes of the Mel cepstrum coefficient to generate identification features;

and the conversion module is used for extracting the image characteristics through the original voice time domain signal and the spectrogram generated by converting the original voice time domain signal.

Further, the emotion recognition module comprises:

the input module is used for inputting the training samples into different machine learning algorithm models for training and learning by utilizing the extracted various characteristics;

the second integration module is used for integrating different prediction results obtained after each model is trained;

and the prediction module is used for obtaining a final overall model of the prediction effect under different application scenes by adjusting different models and predicting other unknown segment emotions by using the final overall model.

In another aspect of the present invention, a speech emotion recognition method based on machine learning is provided, which includes the following steps:

s1, acquiring the recording data, and performing noise reduction pretreatment on the recording data by using a related algorithm;

s2, receiving the recording data transmitted by the recording noise reduction module, and cutting the recording data into segments according to the relevant characteristics of phonetics;

s3, receiving the segments transmitted by the sentence-breaking module, classifying the segments by using a machine learning algorithm, and identifying the speaker according to the classification;

s4 receiving the segments transmitted by the sentence-breaking module, extracting the spectral feature and the Mel cepstrum coefficient of each segment, and extracting the segment feature after processing on the spectral feature and the Mel cepstrum coefficient;

s5 receives the segment features generated by the feature extraction module, trains the emotion prediction model through a machine learning algorithm, and integrates the prediction results of each model through an integration algorithm.

Further, the noise reduction preprocessing in step S1 includes:

s11, inputting lossy data, and training and learning by using the lossy data;

s12 taking the undamaged data as the output of the deep learning algorithm;

s13 processes other lossy data according to the trained model.

Further, the step S3 of classifying the segments by using a machine learning algorithm, and identifying the speaker according to the classification includes:

s31, using different modeling methods to divide different segments and voice frames in the voice recording data into two or more types;

s32 integrates the classification results of the respective models.

Further, the step S4 of extracting segment features after processing the spectral features and mel frequency cepstral coefficients includes:

s41, extracting various different characteristic indexes of each model according to the modeling requirement of each model;

s42, processing different dimensionality indexes of the Mel cepstrum coefficient to generate identification features;

s43, extracting image features through the original speech time domain signal and the spectrogram generated by converting the original speech time domain signal.

Further, the training of the emotion prediction model in step S5, and the integrating the prediction result of each model by using an integration algorithm includes:

s51, inputting the training samples into different machine learning algorithm models for training and learning by utilizing the various extracted characteristics;

s52, integrating different prediction results obtained after training of the models;

s53, obtaining a final overall model of the prediction effect under different application scenes by adjusting different models, and predicting other unknown segment emotions by using the final overall model.

The invention has the beneficial effects that: the system is designed and built on the premise of application of a Chinese language environment and an actual production environment, so that good performance can be effectively obtained in the Chinese language environment and the actual production environment of a customer service telephone; the required recording can be efficiently and quickly retrieved according to the emotion, so that the efficiency of the telephone quality inspection work including but not limited to is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic structural diagram of a speech emotion recognition system based on machine learning according to an embodiment of the present invention;

FIG. 2 is a flowchart of a speech emotion recognition method based on machine learning according to an embodiment of the present invention;

FIG. 3 is a second flowchart of the speech emotion recognition method based on machine learning according to the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.

As shown in fig. 1, the speech emotion recognition system based on machine learning according to the embodiment of the present invention includes a recording noise reduction module, a sentence break module, a speaker recognition module, a feature extraction module, and an emotion recognition module, where the recording noise reduction module, the sentence break module, and the feature extraction module belong to data preprocessing, the recording noise reduction module, the sentence break module, and the feature extraction module provide a basis for prediction, improve accuracy and stability in a prediction process, and provide features that can be used for prediction, the speaker recognition module, and the emotion recognition module belong to prediction, and predict a speaker and an emotion of each segment by using segments and features obtained by data preprocessing;

specifically, the recording noise reduction module processes the recording by using a correlation algorithm aiming at different types of noise such as environmental noise, so as to reduce or eliminate the influence of the noise on voice recognition, enhance the performance effect of other modules of the system, and transmit the recording after noise reduction to the sentence segmentation module.

The sentence breaking module is used for receiving the recording data transmitted by the recording noise reduction module and cutting the recording data into segments according to the relevant characteristics of phonetics; wherein, sentence-breaking module includes:

the second classification module is used for classifying different voice frames into two types of human voices and non-human voices according to a clustering method in machine learning;

and the aggregation module is used for aggregating the classified voice frames into human voice and non-human voice segments based on a rule method.

Specifically, each small segment of speech frames is clustered according to the difference of relevant characteristics of phonetics on the voices and the non-voices, then the classified speech frames are aggregated based on rules, similar speech frames which are continuous in front and back in time are classified into the same segment, the same speech of the same speaker is ensured to be contained in the same segment, different speakers are contained in the different segments, and finally the generated segments are respectively transmitted to the speaker recognition module and the characteristic extraction module.

specifically, a machine learning algorithm is utilized to divide all segments cut out from a one-pass recording into two or more categories, and two or more speakers are identified according to the divided categories, and which speaker each segment of a period length belongs to is divided.

specifically, the feature extraction module is a pre-step of emotion recognition, extracts an index for each small segment, and the extracted index is a feature commonly used in the field of speech recognition (audio to text): Mel-Frequency Cepstral Coefficients (MFCCs) and performing relevant processing on the Mel-Frequency Cepstral Coefficients, such as methods for further extracting features or adding other features, wherein the Mel-Frequency Cepstral Coefficients (MFCCs) still contain a large amount of feature information, so that statistical features similar to mean values, standard deviations and the like can be further extracted from the extracted MFCCs, the statistical features are supplementary features of the MFCCs, the effect of the emotion prediction model is improved, the features extracted by the method are used for representing a segment, and the extracted features are transmitted to the emotion recognition module.

Specifically, firstly, the features extracted by the feature extraction module and related machine learning algorithms, such as Convolutional Neural Network (CNN), Support Vector Machine (SVM) and other machine learning and deep learning related algorithms, are utilized, and secondly, the speech segments closer to the Chinese language environment and the training method more prone to the application level of the production environment are utilized to train the emotion prediction model; secondly, predicting the characteristics of the unknown segment by using the trained models so as to obtain the emotion represented by the unknown segment, wherein the prediction results of each model on the same recording segment are possibly not completely the same; and finally, integrating the prediction result of each model by using an integration algorithm so as to obtain the final emotion of each recording segment.

In an embodiment of the present invention, the noise reduction preprocessing in the recording noise reduction module includes:

Specifically, the noise reduction method is that DAE (Denoising auto encoder, noise reduction auto encoder) is an algorithm of deep learning, and the principle is as follows: one lossy data (which can be lossy data in various forms) is input, training learning is carried out by using the lossy data, undamaged data is used as the output of a deep learning algorithm, and other lossy data is processed by using a trained model.

In one embodiment of the present invention, the speaker identification module comprises:

Specifically, because different speakers have different expressions on voice characteristic data and indexes, different fragments and voice frames in the one-pass recording are divided into two or more types by using different modeling methods such as clustering, Gaussian mixture models and CNN (convolutional neural network), and classification results of the models are integrated, so that real-time and batch classification marking of the voice fragments of the different speakers is realized. For example, in the insurance telephone, the seat and the client can be classified and identified by the method; in addition, the system can also perform pre-speech sample sampling modeling on a known speaker according to some application scenes, so that the accuracy of speaker recognition is further improved by applying a pre-trained speaker model, for example, the recognition effect on an unknown client is improved by obtaining a sound pre-trained sample model of a customer service seat in advance.

In a specific embodiment of the present invention, the feature extraction module includes:

the extraction module is used for extracting various different characteristic indexes of each model according to the modeling requirement of each model, wherein the characteristic indexes are Mel cepstrum coefficients;

Specifically, a plurality of models of the system select and extract various different characteristic indexes according to modeling requirements, a classification model adopts a relatively universal MFCC extraction method, and different dimensional indexes of the MFCC are further processed to generate more effective identification characteristics; the CNN and other models extract the graphic features of the spectrogram generated by converting the original voice spectrum.

For example: the feature extraction module performs pre-emphasis processing on an original recording signal, divides the whole complete recording signal into a plurality of frames (the frames represent a small segment of recording in the original recording signal), multiplies each frame by a Hamming window, performs fast Fourier transform on the Hamming window to obtain an energy spectrum of each frame, and then performs Discrete Cosine Transform (DCT) on logarithmic energy output by each triangular filter bank by passing the energy spectrum through the triangular filter bank to obtain features of voice segments such as MFCC coefficients.

In a specific embodiment of the present invention, the emotion recognition module includes:

Specifically, in the emotion recognition module, emotion marks of historical samples are used as training samples, various features extracted in the previous steps are used for inputting the training samples into different machine learning algorithm models for training and learning, the features learned and recognized by the different algorithm models are different, the recognition and prediction effects of the models are different, the system further integrates different prediction results obtained after the models are trained, an integral model with the best prediction effect in different application scenes such as accuracy and recall rate is obtained by adjusting integration weights, target evaluation functions and the like of the different models, and then the final model is used for predicting other unknown segment emotions.

For example: the voice segments in different calls respectively represent the emotions of different clients and agents, the voice segments are extracted and processed according to the method of the feature extraction module, the segment sample input method is trained and integrated in a machine learning algorithm model (for example, algorithms such as a support vector machine), and the finally obtained model is used for predicting other voice segments with unknown emotions.

As shown in fig. 2 and fig. 3, in another aspect of the present invention, a speech emotion recognition method based on machine learning is provided, which includes the following steps:

s2, receiving the recording data transmitted by the recording noise reduction module, and cutting the recording data into segments according to the relevant characteristics of phonetics; wherein the cutting of the recorded data into segments according to the phonetic correlation features further comprises the steps of: s21, dividing different voice frames into two types of human voice and non-human voice according to the clustering method in machine learning; s22 is based on the rule method, the classified speech frames are aggregated into human voice and non-human voice segments.

In an embodiment of the present invention, the denoising preprocessing in step S1 includes:

s11, inputting lossy data, and training and learning by using the lossy data;

s12 taking the undamaged data as the output of the deep learning algorithm;

s13 processes other lossy data according to the trained model.

In an embodiment of the present invention, the classifying the segments by using a machine learning algorithm in step S3, and the recognizing the speaker according to the classification includes:

s32 integrates the classification results of the respective models.

In an embodiment of the invention, the extracting the segment features after the processing on the spectral features and the mel-frequency cepstral coefficients in the step S4 includes:

In an embodiment of the present invention, the training of the emotion prediction model in step S5, and the integrating the prediction result of each model by using the integration algorithm includes:

In order to facilitate understanding of the above-described technical aspects of the present invention, the above-described technical aspects of the present invention will be described in detail below in terms of specific usage.

When the machine learning-based speech emotion recognition system is used specifically, noise reduction pretreatment is carried out on real telephone data through real telephone recording data by using a correlation algorithm, and then a speech fragment is cut into a plurality of fragments with the length of one sentence according to the obvious difference between a human voice part and a non-human voice part in the speech fragment; because the segments in the real recording data contain two or more speakers, the two or more speakers are distinguished according to different sound characteristics of each person, each segment is represented by a Mel cepstrum coefficient on the basis of a sentence segmentation module, the Mel cepstrum coefficient representing each segment is sent to an emotion recognition module, and then the emotion represented by each segment is recognized according to the Mel cepstrum coefficient.

With the deployment and continuous application of the model, the system continuously accumulates new emotion sample data which are predicted correctly and wrongly in the actual use process, and the system continuously optimizes and promotes parameters, structures and the like of the model through methods such as incremental learning and the like, so that the model is continuously improved to achieve a better application effect.

In conclusion, by means of the technical scheme of the invention, the Chinese language environment and the actual production environment are designed and established on the premise of application, so that good performance can be effectively obtained in the Chinese language environment and the actual production environment of the customer service telephone; the required recording can be efficiently and quickly retrieved according to the emotion, so that the efficiency of the telephone quality inspection work including but not limited to is improved.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A speech emotion recognition system based on machine learning is characterized by comprising a recording noise reduction module, a sentence segmentation module, a speaker recognition module, a feature extraction module and an emotion recognition module, wherein,

the emotion recognition module is used for receiving the segment features generated by the feature extraction module, training emotion prediction models through a machine learning algorithm and integrating prediction results of each model through an integration algorithm;

the sentence-breaking module comprises: the second classification module is used for classifying different voice frames into two types of human voices and non-human voices according to a clustering method in machine learning; the aggregation module is used for aggregating the classified voice frames into human voice and non-human voice segments based on a rule method;

the noise reduction preprocessing in the recording noise reduction module comprises the following steps:

the first processing module is used for processing other lossy data according to the trained model;

the speaker recognition module includes:

the first integration module is used for integrating the classification results of the models and classifying and marking the voice segments of different speakers in real time and in batches;

the feature extraction module includes:

the conversion module is used for extracting image characteristics through the original voice time domain signal and a spectrogram generated by conversion of the original voice time domain signal;

the emotion recognition module comprises:

2. A speech emotion recognition method based on machine learning is characterized by comprising the following steps:

s5 receives the segment features generated by the feature extraction module, trains the emotion prediction model through the machine learning algorithm, integrates the prediction result of each model through the integration algorithm,

the noise reduction preprocessing in step S1 includes:

s11, inputting lossy data, and training and learning by using the lossy data;

s12 taking the undamaged data as the output of the deep learning algorithm;

s13, processing other lossy data according to the trained model;

the step S3 of classifying the segments using a machine learning algorithm, and identifying the speaker according to the classification includes:

s32, integrating the classification results of the models;

the step S4 of extracting segment features after processing the spectral features and mel-frequency cepstral coefficients includes:

s43, extracting image features through the original voice time domain signal and the spectrogram generated by converting the original voice time domain signal;

in step S5, the training of the emotion prediction model and the integration of the prediction results of each model by using the integration algorithm includes: