CN109256150B - Speech emotion recognition system and method based on machine learning - Google Patents

Speech emotion recognition system and method based on machine learning Download PDF

Info

Publication number
CN109256150B
CN109256150B CN201811186572.8A CN201811186572A CN109256150B CN 109256150 B CN109256150 B CN 109256150B CN 201811186572 A CN201811186572 A CN 201811186572A CN 109256150 B CN109256150 B CN 109256150B
Authority
CN
China
Prior art keywords
module
different
model
segments
machine learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811186572.8A
Other languages
Chinese (zh)
Other versions
CN109256150A (en
Inventor
徐心
胡宇澄
王麒铭
饶鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tranzvision Consulting Co ltd
Original Assignee
Tranzvision Consulting Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tranzvision Consulting Co ltd filed Critical Tranzvision Consulting Co ltd
Priority to CN201811186572.8A priority Critical patent/CN109256150B/en
Publication of CN109256150A publication Critical patent/CN109256150A/en
Application granted granted Critical
Publication of CN109256150B publication Critical patent/CN109256150B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Hospice & Palliative Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Child & Adolescent Psychology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a speech emotion recognition system and method based on machine learning, comprising a recording noise reduction module; the sentence breaking module is used for receiving the recording data transmitted by the recording noise reduction module and cutting the recording data into segments according to the relevant characteristics of phonetics; the speaker recognition module is used for receiving the segments transmitted by the sentence segmentation module, classifying the segments by utilizing a machine learning algorithm and recognizing the speaker according to the classification; the characteristic extraction module is used for receiving the fragments transmitted by the sentence segmentation module, extracting the spectral characteristic and the Mel cepstrum coefficient of each fragment, and extracting the fragment characteristic after processing the spectral characteristic and the Mel cepstrum coefficient; and the emotion recognition module is used for receiving the segment features generated by the feature extraction module, training the emotion prediction models through a machine learning algorithm and integrating the prediction results of each model by using an integration algorithm. The invention has the beneficial effects that: effectively obtains good performance in the Chinese language environment and the actual production environment of customer service telephones.

Description

Speech emotion recognition system and method based on machine learning
Technical Field
The invention relates to the technical field of voice recognition, in particular to a system and a method for recognizing voice emotion based on machine learning.
Background
In the field of speech recognition, the method can be roughly divided into two modules, one is based on displaying the content expressed by the speech audio in a text form, and the other is based on recognizing the emotion (such as anger or calmness, etc.) contained in the audio based on the speech audio. Speech emotion recognition is involved in foreign documents, but the limitation is large; in the existing documents, emotion recognition of other languages besides the Chinese language is mostly performed, and the emotion recognition cannot be directly performed by speech in the Chinese environment, and the documents mostly use a single algorithm to recognize emotion, so that the recognition effect is relatively close to laboratory data, the recognition effect is not ideal in actual production, and the requirements in the Chinese production environment cannot be met.
An effective solution to the problems in the related art has not been proposed yet.
Disclosure of Invention
Aiming at the technical problems in the related art, the invention provides a speech emotion recognition system and method based on machine learning, which can solve the emotion recognition problem based on Chinese audio recording, and comprises but is not limited to scenes such as incoming calls and outgoing calls of customer services.
In order to achieve the technical purpose, the technical scheme of the invention is realized as follows:
a speech emotion recognition system based on machine learning comprises a recording noise reduction module, a sentence break module, a speaker recognition module, a feature extraction module and an emotion recognition module,
the recording noise reduction module is used for acquiring recording data and performing noise reduction pretreatment on the recording data by using a related algorithm;
the sentence breaking module is used for receiving the recording data transmitted by the recording noise reduction module and cutting the recording data into segments according to the relevant characteristics of phonetics;
the speaker recognition module is used for receiving the segments transmitted by the sentence segmentation module, classifying the segments by utilizing a machine learning algorithm and recognizing the speaker according to the classification;
the characteristic extraction module is used for receiving the fragments transmitted by the sentence segmentation module, extracting the spectral characteristic and the Mel cepstrum coefficient of each fragment, and extracting the fragment characteristic after processing the spectral characteristic and the Mel cepstrum coefficient;
and the emotion recognition module is used for receiving the segment features generated by the feature extraction module, training the emotion prediction models through a machine learning algorithm and integrating the prediction results of each model by using an integration algorithm.
Further, the noise reduction preprocessing in the recording noise reduction module comprises:
the training learning module is used for inputting lossy data and utilizing the lossy data to train and learn;
the output module is used for outputting undamaged data as the output of the deep learning algorithm;
and the first processing module is used for processing other lossy data according to the trained model.
Further, the speaker recognition module includes:
the classification module is used for classifying different fragments and voice frames in the voice recording data into two or more types by using different modeling methods;
the first integration module is used for integrating the classification results of the models and classifying and marking the voice segments of different speakers in real time and in batches.
Further, the feature extraction module comprises:
the extraction module is used for extracting various different characteristic indexes from each model according to the modeling requirement of each model;
the second processing module is used for processing different dimensionality indexes of the Mel cepstrum coefficient to generate identification features;
and the conversion module is used for extracting the image characteristics through the original voice time domain signal and the spectrogram generated by converting the original voice time domain signal.
Further, the emotion recognition module comprises:
the input module is used for inputting the training samples into different machine learning algorithm models for training and learning by utilizing the extracted various characteristics;
the second integration module is used for integrating different prediction results obtained after each model is trained;
and the prediction module is used for obtaining a final overall model of the prediction effect under different application scenes by adjusting different models and predicting other unknown segment emotions by using the final overall model.
In another aspect of the present invention, a speech emotion recognition method based on machine learning is provided, which includes the following steps:
s1, acquiring the recording data, and performing noise reduction pretreatment on the recording data by using a related algorithm;
s2, receiving the recording data transmitted by the recording noise reduction module, and cutting the recording data into segments according to the relevant characteristics of phonetics;
s3, receiving the segments transmitted by the sentence-breaking module, classifying the segments by using a machine learning algorithm, and identifying the speaker according to the classification;
s4 receiving the segments transmitted by the sentence-breaking module, extracting the spectral feature and the Mel cepstrum coefficient of each segment, and extracting the segment feature after processing on the spectral feature and the Mel cepstrum coefficient;
s5 receives the segment features generated by the feature extraction module, trains the emotion prediction model through a machine learning algorithm, and integrates the prediction results of each model through an integration algorithm.
Further, the noise reduction preprocessing in step S1 includes:
s11, inputting lossy data, and training and learning by using the lossy data;
s12 taking the undamaged data as the output of the deep learning algorithm;
s13 processes other lossy data according to the trained model.
Further, the step S3 of classifying the segments by using a machine learning algorithm, and identifying the speaker according to the classification includes:
s31, using different modeling methods to divide different segments and voice frames in the voice recording data into two or more types;
s32 integrates the classification results of the respective models.
Further, the step S4 of extracting segment features after processing the spectral features and mel frequency cepstral coefficients includes:
s41, extracting various different characteristic indexes of each model according to the modeling requirement of each model;
s42, processing different dimensionality indexes of the Mel cepstrum coefficient to generate identification features;
s43, extracting image features through the original speech time domain signal and the spectrogram generated by converting the original speech time domain signal.
Further, the training of the emotion prediction model in step S5, and the integrating the prediction result of each model by using an integration algorithm includes:
s51, inputting the training samples into different machine learning algorithm models for training and learning by utilizing the various extracted characteristics;
s52, integrating different prediction results obtained after training of the models;
s53, obtaining a final overall model of the prediction effect under different application scenes by adjusting different models, and predicting other unknown segment emotions by using the final overall model.
The invention has the beneficial effects that: the system is designed and built on the premise of application of a Chinese language environment and an actual production environment, so that good performance can be effectively obtained in the Chinese language environment and the actual production environment of a customer service telephone; the required recording can be efficiently and quickly retrieved according to the emotion, so that the efficiency of the telephone quality inspection work including but not limited to is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a schematic structural diagram of a speech emotion recognition system based on machine learning according to an embodiment of the present invention;
FIG. 2 is a flowchart of a speech emotion recognition method based on machine learning according to an embodiment of the present invention;
FIG. 3 is a second flowchart of the speech emotion recognition method based on machine learning according to the embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.
As shown in fig. 1, the speech emotion recognition system based on machine learning according to the embodiment of the present invention includes a recording noise reduction module, a sentence break module, a speaker recognition module, a feature extraction module, and an emotion recognition module, where the recording noise reduction module, the sentence break module, and the feature extraction module belong to data preprocessing, the recording noise reduction module, the sentence break module, and the feature extraction module provide a basis for prediction, improve accuracy and stability in a prediction process, and provide features that can be used for prediction, the speaker recognition module, and the emotion recognition module belong to prediction, and predict a speaker and an emotion of each segment by using segments and features obtained by data preprocessing;
the recording noise reduction module is used for acquiring recording data and performing noise reduction pretreatment on the recording data by using a related algorithm;
specifically, the recording noise reduction module processes the recording by using a correlation algorithm aiming at different types of noise such as environmental noise, so as to reduce or eliminate the influence of the noise on voice recognition, enhance the performance effect of other modules of the system, and transmit the recording after noise reduction to the sentence segmentation module.
The sentence breaking module is used for receiving the recording data transmitted by the recording noise reduction module and cutting the recording data into segments according to the relevant characteristics of phonetics; wherein, sentence-breaking module includes:
the second classification module is used for classifying different voice frames into two types of human voices and non-human voices according to a clustering method in machine learning;
and the aggregation module is used for aggregating the classified voice frames into human voice and non-human voice segments based on a rule method.
Specifically, each small segment of speech frames is clustered according to the difference of relevant characteristics of phonetics on the voices and the non-voices, then the classified speech frames are aggregated based on rules, similar speech frames which are continuous in front and back in time are classified into the same segment, the same speech of the same speaker is ensured to be contained in the same segment, different speakers are contained in the different segments, and finally the generated segments are respectively transmitted to the speaker recognition module and the characteristic extraction module.
The speaker recognition module is used for receiving the segments transmitted by the sentence segmentation module, classifying the segments by utilizing a machine learning algorithm and recognizing the speaker according to the classification;
specifically, a machine learning algorithm is utilized to divide all segments cut out from a one-pass recording into two or more categories, and two or more speakers are identified according to the divided categories, and which speaker each segment of a period length belongs to is divided.
The characteristic extraction module is used for receiving the fragments transmitted by the sentence segmentation module, extracting the spectral characteristic and the Mel cepstrum coefficient of each fragment, and extracting the fragment characteristic after processing the spectral characteristic and the Mel cepstrum coefficient;
specifically, the feature extraction module is a pre-step of emotion recognition, extracts an index for each small segment, and the extracted index is a feature commonly used in the field of speech recognition (audio to text): Mel-Frequency Cepstral Coefficients (MFCCs) and performing relevant processing on the Mel-Frequency Cepstral Coefficients, such as methods for further extracting features or adding other features, wherein the Mel-Frequency Cepstral Coefficients (MFCCs) still contain a large amount of feature information, so that statistical features similar to mean values, standard deviations and the like can be further extracted from the extracted MFCCs, the statistical features are supplementary features of the MFCCs, the effect of the emotion prediction model is improved, the features extracted by the method are used for representing a segment, and the extracted features are transmitted to the emotion recognition module.
And the emotion recognition module is used for receiving the segment features generated by the feature extraction module, training the emotion prediction models through a machine learning algorithm and integrating the prediction results of each model by using an integration algorithm.
Specifically, firstly, the features extracted by the feature extraction module and related machine learning algorithms, such as Convolutional Neural Network (CNN), Support Vector Machine (SVM) and other machine learning and deep learning related algorithms, are utilized, and secondly, the speech segments closer to the Chinese language environment and the training method more prone to the application level of the production environment are utilized to train the emotion prediction model; secondly, predicting the characteristics of the unknown segment by using the trained models so as to obtain the emotion represented by the unknown segment, wherein the prediction results of each model on the same recording segment are possibly not completely the same; and finally, integrating the prediction result of each model by using an integration algorithm so as to obtain the final emotion of each recording segment.
In an embodiment of the present invention, the noise reduction preprocessing in the recording noise reduction module includes:
the training learning module is used for inputting lossy data and utilizing the lossy data to train and learn;
the output module is used for outputting undamaged data as the output of the deep learning algorithm;
and the first processing module is used for processing other lossy data according to the trained model.
Specifically, the noise reduction method is that DAE (Denoising auto encoder, noise reduction auto encoder) is an algorithm of deep learning, and the principle is as follows: one lossy data (which can be lossy data in various forms) is input, training learning is carried out by using the lossy data, undamaged data is used as the output of a deep learning algorithm, and other lossy data is processed by using a trained model.
In one embodiment of the present invention, the speaker identification module comprises:
the classification module is used for classifying different fragments and voice frames in the voice recording data into two or more types by using different modeling methods;
the first integration module is used for integrating the classification results of the models and classifying and marking the voice segments of different speakers in real time and in batches.
Specifically, because different speakers have different expressions on voice characteristic data and indexes, different fragments and voice frames in the one-pass recording are divided into two or more types by using different modeling methods such as clustering, Gaussian mixture models and CNN (convolutional neural network), and classification results of the models are integrated, so that real-time and batch classification marking of the voice fragments of the different speakers is realized. For example, in the insurance telephone, the seat and the client can be classified and identified by the method; in addition, the system can also perform pre-speech sample sampling modeling on a known speaker according to some application scenes, so that the accuracy of speaker recognition is further improved by applying a pre-trained speaker model, for example, the recognition effect on an unknown client is improved by obtaining a sound pre-trained sample model of a customer service seat in advance.
In a specific embodiment of the present invention, the feature extraction module includes:
the extraction module is used for extracting various different characteristic indexes of each model according to the modeling requirement of each model, wherein the characteristic indexes are Mel cepstrum coefficients;
the second processing module is used for processing different dimensionality indexes of the Mel cepstrum coefficient to generate identification features;
and the conversion module is used for extracting the image characteristics through the original voice time domain signal and the spectrogram generated by converting the original voice time domain signal.
Specifically, a plurality of models of the system select and extract various different characteristic indexes according to modeling requirements, a classification model adopts a relatively universal MFCC extraction method, and different dimensional indexes of the MFCC are further processed to generate more effective identification characteristics; the CNN and other models extract the graphic features of the spectrogram generated by converting the original voice spectrum.
For example: the feature extraction module performs pre-emphasis processing on an original recording signal, divides the whole complete recording signal into a plurality of frames (the frames represent a small segment of recording in the original recording signal), multiplies each frame by a Hamming window, performs fast Fourier transform on the Hamming window to obtain an energy spectrum of each frame, and then performs Discrete Cosine Transform (DCT) on logarithmic energy output by each triangular filter bank by passing the energy spectrum through the triangular filter bank to obtain features of voice segments such as MFCC coefficients.
In a specific embodiment of the present invention, the emotion recognition module includes:
the input module is used for inputting the training samples into different machine learning algorithm models for training and learning by utilizing the extracted various characteristics;
the second integration module is used for integrating different prediction results obtained after each model is trained;
and the prediction module is used for obtaining a final overall model of the prediction effect under different application scenes by adjusting different models and predicting other unknown segment emotions by using the final overall model.
Specifically, in the emotion recognition module, emotion marks of historical samples are used as training samples, various features extracted in the previous steps are used for inputting the training samples into different machine learning algorithm models for training and learning, the features learned and recognized by the different algorithm models are different, the recognition and prediction effects of the models are different, the system further integrates different prediction results obtained after the models are trained, an integral model with the best prediction effect in different application scenes such as accuracy and recall rate is obtained by adjusting integration weights, target evaluation functions and the like of the different models, and then the final model is used for predicting other unknown segment emotions.
For example: the voice segments in different calls respectively represent the emotions of different clients and agents, the voice segments are extracted and processed according to the method of the feature extraction module, the segment sample input method is trained and integrated in a machine learning algorithm model (for example, algorithms such as a support vector machine), and the finally obtained model is used for predicting other voice segments with unknown emotions.
As shown in fig. 2 and fig. 3, in another aspect of the present invention, a speech emotion recognition method based on machine learning is provided, which includes the following steps:
s1, acquiring the recording data, and performing noise reduction pretreatment on the recording data by using a related algorithm;
s2, receiving the recording data transmitted by the recording noise reduction module, and cutting the recording data into segments according to the relevant characteristics of phonetics; wherein the cutting of the recorded data into segments according to the phonetic correlation features further comprises the steps of: s21, dividing different voice frames into two types of human voice and non-human voice according to the clustering method in machine learning; s22 is based on the rule method, the classified speech frames are aggregated into human voice and non-human voice segments.
Specifically, each small segment of speech frames is clustered according to the difference of relevant characteristics of phonetics on the voices and the non-voices, then the classified speech frames are aggregated based on rules, similar speech frames which are continuous in front and back in time are classified into the same segment, the same speech of the same speaker is ensured to be contained in the same segment, different speakers are contained in the different segments, and finally the generated segments are respectively transmitted to the speaker recognition module and the characteristic extraction module.
S3, receiving the segments transmitted by the sentence-breaking module, classifying the segments by using a machine learning algorithm, and identifying the speaker according to the classification;
s4 receiving the segments transmitted by the sentence-breaking module, extracting the spectral feature and the Mel cepstrum coefficient of each segment, and extracting the segment feature after processing on the spectral feature and the Mel cepstrum coefficient;
s5 receives the segment features generated by the feature extraction module, trains the emotion prediction model through a machine learning algorithm, and integrates the prediction results of each model through an integration algorithm.
In an embodiment of the present invention, the denoising preprocessing in step S1 includes:
s11, inputting lossy data, and training and learning by using the lossy data;
s12 taking the undamaged data as the output of the deep learning algorithm;
s13 processes other lossy data according to the trained model.
In an embodiment of the present invention, the classifying the segments by using a machine learning algorithm in step S3, and the recognizing the speaker according to the classification includes:
s31, using different modeling methods to divide different segments and voice frames in the voice recording data into two or more types;
s32 integrates the classification results of the respective models.
In an embodiment of the invention, the extracting the segment features after the processing on the spectral features and the mel-frequency cepstral coefficients in the step S4 includes:
s41, extracting various different characteristic indexes of each model according to the modeling requirement of each model;
s42, processing different dimensionality indexes of the Mel cepstrum coefficient to generate identification features;
s43, extracting image features through the original speech time domain signal and the spectrogram generated by converting the original speech time domain signal.
In an embodiment of the present invention, the training of the emotion prediction model in step S5, and the integrating the prediction result of each model by using the integration algorithm includes:
s51, inputting the training samples into different machine learning algorithm models for training and learning by utilizing the various extracted characteristics;
s52, integrating different prediction results obtained after training of the models;
s53, obtaining a final overall model of the prediction effect under different application scenes by adjusting different models, and predicting other unknown segment emotions by using the final overall model.
In order to facilitate understanding of the above-described technical aspects of the present invention, the above-described technical aspects of the present invention will be described in detail below in terms of specific usage.
When the machine learning-based speech emotion recognition system is used specifically, noise reduction pretreatment is carried out on real telephone data through real telephone recording data by using a correlation algorithm, and then a speech fragment is cut into a plurality of fragments with the length of one sentence according to the obvious difference between a human voice part and a non-human voice part in the speech fragment; because the segments in the real recording data contain two or more speakers, the two or more speakers are distinguished according to different sound characteristics of each person, each segment is represented by a Mel cepstrum coefficient on the basis of a sentence segmentation module, the Mel cepstrum coefficient representing each segment is sent to an emotion recognition module, and then the emotion represented by each segment is recognized according to the Mel cepstrum coefficient.
With the deployment and continuous application of the model, the system continuously accumulates new emotion sample data which are predicted correctly and wrongly in the actual use process, and the system continuously optimizes and promotes parameters, structures and the like of the model through methods such as incremental learning and the like, so that the model is continuously improved to achieve a better application effect.
In conclusion, by means of the technical scheme of the invention, the Chinese language environment and the actual production environment are designed and established on the premise of application, so that good performance can be effectively obtained in the Chinese language environment and the actual production environment of the customer service telephone; the required recording can be efficiently and quickly retrieved according to the emotion, so that the efficiency of the telephone quality inspection work including but not limited to is improved.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (2)

1. A speech emotion recognition system based on machine learning is characterized by comprising a recording noise reduction module, a sentence segmentation module, a speaker recognition module, a feature extraction module and an emotion recognition module, wherein,
the recording noise reduction module is used for acquiring recording data and performing noise reduction pretreatment on the recording data by using a related algorithm;
the sentence breaking module is used for receiving the recording data transmitted by the recording noise reduction module and cutting the recording data into segments according to the relevant characteristics of phonetics;
the speaker recognition module is used for receiving the segments transmitted by the sentence segmentation module, classifying the segments by utilizing a machine learning algorithm and recognizing the speaker according to the classification;
the characteristic extraction module is used for receiving the fragments transmitted by the sentence segmentation module, extracting the spectral characteristic and the Mel cepstrum coefficient of each fragment, and extracting the fragment characteristic after processing the spectral characteristic and the Mel cepstrum coefficient;
the emotion recognition module is used for receiving the segment features generated by the feature extraction module, training emotion prediction models through a machine learning algorithm and integrating prediction results of each model through an integration algorithm;
the sentence-breaking module comprises: the second classification module is used for classifying different voice frames into two types of human voices and non-human voices according to a clustering method in machine learning; the aggregation module is used for aggregating the classified voice frames into human voice and non-human voice segments based on a rule method;
the noise reduction preprocessing in the recording noise reduction module comprises the following steps:
the training learning module is used for inputting lossy data and utilizing the lossy data to train and learn;
the output module is used for outputting undamaged data as the output of the deep learning algorithm;
the first processing module is used for processing other lossy data according to the trained model;
the speaker recognition module includes:
the classification module is used for classifying different fragments and voice frames in the voice recording data into two or more types by using different modeling methods;
the first integration module is used for integrating the classification results of the models and classifying and marking the voice segments of different speakers in real time and in batches;
the feature extraction module includes:
the extraction module is used for extracting various different characteristic indexes from each model according to the modeling requirement of each model;
the second processing module is used for processing different dimensionality indexes of the Mel cepstrum coefficient to generate identification features;
the conversion module is used for extracting image characteristics through the original voice time domain signal and a spectrogram generated by conversion of the original voice time domain signal;
the emotion recognition module comprises:
the input module is used for inputting the training samples into different machine learning algorithm models for training and learning by utilizing the extracted various characteristics;
the second integration module is used for integrating different prediction results obtained after each model is trained;
and the prediction module is used for obtaining a final overall model of the prediction effect under different application scenes by adjusting different models and predicting other unknown segment emotions by using the final overall model.
2. A speech emotion recognition method based on machine learning is characterized by comprising the following steps:
s1, acquiring the recording data, and performing noise reduction pretreatment on the recording data by using a related algorithm;
s2, receiving the recording data transmitted by the recording noise reduction module, and cutting the recording data into segments according to the relevant characteristics of phonetics;
s3, receiving the segments transmitted by the sentence-breaking module, classifying the segments by using a machine learning algorithm, and identifying the speaker according to the classification;
s4 receiving the segments transmitted by the sentence-breaking module, extracting the spectral feature and the Mel cepstrum coefficient of each segment, and extracting the segment feature after processing on the spectral feature and the Mel cepstrum coefficient;
s5 receives the segment features generated by the feature extraction module, trains the emotion prediction model through the machine learning algorithm, integrates the prediction result of each model through the integration algorithm,
the noise reduction preprocessing in step S1 includes:
s11, inputting lossy data, and training and learning by using the lossy data;
s12 taking the undamaged data as the output of the deep learning algorithm;
s13, processing other lossy data according to the trained model;
the step S3 of classifying the segments using a machine learning algorithm, and identifying the speaker according to the classification includes:
s31, using different modeling methods to divide different segments and voice frames in the voice recording data into two or more types;
s32, integrating the classification results of the models;
the step S4 of extracting segment features after processing the spectral features and mel-frequency cepstral coefficients includes:
s41, extracting various different characteristic indexes of each model according to the modeling requirement of each model;
s42, processing different dimensionality indexes of the Mel cepstrum coefficient to generate identification features;
s43, extracting image features through the original voice time domain signal and the spectrogram generated by converting the original voice time domain signal;
in step S5, the training of the emotion prediction model and the integration of the prediction results of each model by using the integration algorithm includes:
s51, inputting the training samples into different machine learning algorithm models for training and learning by utilizing the various extracted characteristics;
s52, integrating different prediction results obtained after training of the models;
s53, obtaining a final overall model of the prediction effect under different application scenes by adjusting different models, and predicting other unknown segment emotions by using the final overall model.
CN201811186572.8A 2018-10-12 2018-10-12 Speech emotion recognition system and method based on machine learning Active CN109256150B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811186572.8A CN109256150B (en) 2018-10-12 2018-10-12 Speech emotion recognition system and method based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811186572.8A CN109256150B (en) 2018-10-12 2018-10-12 Speech emotion recognition system and method based on machine learning

Publications (2)

Publication Number Publication Date
CN109256150A CN109256150A (en) 2019-01-22
CN109256150B true CN109256150B (en) 2021-11-30

Family

ID=65045954

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811186572.8A Active CN109256150B (en) 2018-10-12 2018-10-12 Speech emotion recognition system and method based on machine learning

Country Status (1)

Country Link
CN (1) CN109256150B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109697982A (en) * 2019-02-01 2019-04-30 北京清帆科技有限公司 A kind of speaker speech recognition system in instruction scene
CN110211563B (en) * 2019-06-19 2024-05-24 平安科技(深圳)有限公司 Chinese speech synthesis method, device and storage medium for scenes and emotion
CN112151042A (en) * 2019-06-27 2020-12-29 中国电信股份有限公司 Voiceprint recognition method, device and system and computer readable storage medium
CN110473566A (en) * 2019-07-25 2019-11-19 深圳壹账通智能科技有限公司 Audio separation method, device, electronic equipment and computer readable storage medium
CN110517694A (en) * 2019-09-06 2019-11-29 北京清帆科技有限公司 A kind of teaching scene voice conversion detection system
CN110933235B (en) * 2019-11-06 2021-07-27 杭州哲信信息技术有限公司 Noise identification method in intelligent calling system based on machine learning
CN110956953B (en) * 2019-11-29 2023-03-10 中山大学 Quarrel recognition method based on audio analysis and deep learning
US11670286B2 (en) * 2019-12-31 2023-06-06 Beijing Didi Infinity Technology And Development Co., Ltd. Training mechanism of verbal harassment detection systems
US20210201893A1 (en) * 2019-12-31 2021-07-01 Beijing Didi Infinity Technology And Development Co., Ltd. Pattern-based adaptation model for detecting contact information requests in a vehicle
US11664043B2 (en) * 2019-12-31 2023-05-30 Beijing Didi Infinity Technology And Development Co., Ltd. Real-time verbal harassment detection system
CN111583967A (en) * 2020-05-14 2020-08-25 西安医学院 Mental health emotion recognition device based on utterance model and operation method thereof
CN111681681A (en) * 2020-05-22 2020-09-18 深圳壹账通智能科技有限公司 Voice emotion recognition method and device, electronic equipment and storage medium
CN111833653A (en) * 2020-07-13 2020-10-27 江苏理工学院 Driving assistance system, method, device, and storage medium using ambient noise
CN113361647A (en) * 2021-07-06 2021-09-07 青岛洞听智能科技有限公司 Method for identifying type of missed call

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810994A (en) * 2013-09-05 2014-05-21 江苏大学 Method and system for voice emotion inference on basis of emotion context
CN103985381A (en) * 2014-05-16 2014-08-13 清华大学 Voice frequency indexing method based on parameter fusion optimized decision
CN106340309A (en) * 2016-08-23 2017-01-18 南京大空翼信息技术有限公司 Dog bark emotion recognition method and device based on deep learning
CN107068161A (en) * 2017-04-14 2017-08-18 百度在线网络技术(北京)有限公司 Voice de-noising method, device and computer equipment based on artificial intelligence
CN108074576A (en) * 2017-12-14 2018-05-25 讯飞智元信息科技有限公司 Inquest the speaker role's separation method and system under scene
CN108259686A (en) * 2017-12-28 2018-07-06 合肥凯捷技术有限公司 A kind of customer service system based on speech analysis

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180018969A1 (en) * 2016-07-15 2018-01-18 Circle River, Inc. Call Forwarding to Unavailable Party Based on Artificial Intelligence

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810994A (en) * 2013-09-05 2014-05-21 江苏大学 Method and system for voice emotion inference on basis of emotion context
CN103985381A (en) * 2014-05-16 2014-08-13 清华大学 Voice frequency indexing method based on parameter fusion optimized decision
CN106340309A (en) * 2016-08-23 2017-01-18 南京大空翼信息技术有限公司 Dog bark emotion recognition method and device based on deep learning
CN107068161A (en) * 2017-04-14 2017-08-18 百度在线网络技术(北京)有限公司 Voice de-noising method, device and computer equipment based on artificial intelligence
CN108074576A (en) * 2017-12-14 2018-05-25 讯飞智元信息科技有限公司 Inquest the speaker role's separation method and system under scene
CN108259686A (en) * 2017-12-28 2018-07-06 合肥凯捷技术有限公司 A kind of customer service system based on speech analysis

Also Published As

Publication number Publication date
CN109256150A (en) 2019-01-22

Similar Documents

Publication Publication Date Title
CN109256150B (en) Speech emotion recognition system and method based on machine learning
CN112804400B (en) Customer service call voice quality inspection method and device, electronic equipment and storage medium
CN108900725B (en) Voiceprint recognition method and device, terminal equipment and storage medium
CN105632501B (en) A kind of automatic accent classification method and device based on depth learning technology
CN111081279A (en) Voice emotion fluctuation analysis method and device
CN110211594B (en) Speaker identification method based on twin network model and KNN algorithm
CN107767881B (en) Method and device for acquiring satisfaction degree of voice information
CN105810205A (en) Speech processing method and device
CN110428853A (en) Voice activity detection method, Voice activity detection device and electronic equipment
Joshi et al. Speech emotion recognition: a review
CN116665676B (en) Semantic recognition method for intelligent voice outbound system
CN114818649A (en) Service consultation processing method and device based on intelligent voice interaction technology
Gupta et al. Speech feature extraction and recognition using genetic algorithm
CN113889090A (en) Multi-language recognition model construction and training method based on multi-task learning
Reddy et al. Audio compression with multi-algorithm fusion and its impact in speech emotion recognition
CN114420169B (en) Emotion recognition method and device and robot
Praksah et al. Analysis of emotion recognition system through speech signal using KNN, GMM & SVM classifier
CN113744742A (en) Role identification method, device and system in conversation scene
CN111091840A (en) Method for establishing gender identification model and gender identification method
CN112927723A (en) High-performance anti-noise speech emotion recognition method based on deep neural network
CN111091809A (en) Regional accent recognition method and device based on depth feature fusion
CN110299133A (en) The method for determining illegally to broadcast based on keyword
Singh et al. A comparative study on feature extraction techniques for language identification
Camarena-Ibarrola et al. Speaker identification using entropygrams and convolutional neural networks
Bora et al. Speaker identification for biometric access control using hybrid features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant