CN109256150A

CN109256150A - Speech emotion recognition system and method based on machine learning

Info

Publication number: CN109256150A
Application number: CN201811186572.8A
Authority: CN
Inventors: 徐心; 胡宇澄; 王麒铭; 饶鹏
Original assignee: Beijing Chuangjing Consulting Co Ltd
Current assignee: Beijing Chuangjing Consulting Co Ltd
Priority date: 2018-10-12
Filing date: 2018-10-12
Publication date: 2019-01-22
Anticipated expiration: 2038-10-12
Also published as: CN109256150B

Abstract

The invention discloses the speech emotion recognition system and methods based on machine learning, including recording noise reduction module；Punctuate module, the recording data transmitted for receiving recording noise reduction module, is cut into segment for recording data according to etic correlated characteristic；Speaker Identification module, the segment to come for receiving punctuate module transfer using machine learning algorithm by segment classification, and identify speaker according to classification；Characteristic extracting module, the segment to come for receiving punctuate module transfer extract segment characterizations to each snippet extraction spectrum signature and mel cepstrum coefficients, and after being handled on it；Emotion recognition module is trained emotion prediction model by machine learning algorithm, and integrated using prediction result of the Integrated Algorithm to each model for receiving the segment characterizations of characteristic extracting module generation.The invention has the advantages that: good performance is effectively obtained in the actual production environment of Chinese language environment and customer service call.

Description

Speech emotion recognition system and method based on machine learning

Technical field

The present invention relates to technical field of voice recognition, it particularly relates to which a kind of speech emotional based on machine learning is known Other system and method.

Background technique

Two modules can be substantially divided into field of speech recognition, one is based on by content expressed by speech audio Textual form is converted to show, second is that be based on speech audio, identification audio inside include mood (such as: it is angry or flat Wait quietly).It identifies about voice mood, is related in external document, but it is bigger to limit to face；In existing document, The Emotion identification of other language mostly other than Chinese, can not be directly applied to the voice under Chinese environment to identify mood, And document is mostly to identify using single algorithm to mood, recognition effect is comparatively closer to laboratory data, in reality Unsatisfactory, not to be able to satisfy in Chinese production environment requirement is showed in the production of border.

For the problems in the relevant technologies, currently no effective solution has been proposed.

Summary of the invention

For above-mentioned technical problem in the related technology, the present invention proposes a kind of speech emotion recognition based on machine learning System and method is able to solve the Emotion identification problem based on Chinese audio sound-recording, the including but not limited to incoming call of customer service The scenes such as outbound calling.

To realize the above-mentioned technical purpose, the technical scheme of the present invention is realized as follows:

A kind of speech emotion recognition system based on machine learning, including recording noise reduction module, punctuate module, Speaker Identification mould Block, characteristic extracting module and emotion recognition module, wherein

Noise reduction module of recording carries out noise reduction pretreatment to recording data using related algorithm for obtaining recording data；

Punctuate module, the recording data transmitted for receiving recording noise reduction module, will record according to etic correlated characteristic Sound data are cut into segment；

Speaker Identification module, the segment to come for receiving punctuate module transfer, using machine learning algorithm by segment classification, And speaker is identified according to classification；

Characteristic extracting module, the segment to come for receiving punctuate module transfer, to each snippet extraction spectrum signature and Meier Cepstrum coefficient, and segment characterizations are extracted after being handled on it；

Emotion recognition module, it is pre- to emotion by machine learning algorithm for receiving the segment characterizations of characteristic extracting module generation It surveys model to be trained, and is integrated using prediction result of the Integrated Algorithm to each model.

Further, noise reduction pretreatment includes: in the recording noise reduction module

Training study module is trained study using data are damaged for inputting the data damaged；

Output module, for using not impaired data as the output of deep learning algorithm；

First processing module is handled for damaging data to other according to trained model.

Further, the Speaker Identification module includes:

Categorization module, for using different modeling methods by recording data different fragments and speech frame be divided into two classes or more Class；

First integration module is integrated for the classification results to each model, to the reality of the sound bite of different speakers When and batch classification marker.

Further, the characteristic extracting module includes:

Extraction module, for needing to extract all kinds of different characteristic indexs according to its modeling for each model；

Second processing module is processed generation identification feature for the different dimensions index to mel cepstrum coefficients；

Conversion module, the sound spectrograph for being generated by raw tone time-domain signal and its conversion carry out mentioning for characteristics of image It takes.

Further, the emotion recognition module includes:

Input module, for training sample using each category feature extracted, to be input to different machine learning algorithm moulds Study is trained in type；

Second integration module, for integrating the different prediction results obtained after each model training；

Prediction module, for obtaining the overall model that prediction effect is final under different application scene by adjusting different models, benefit Other unknown segment emotions are predicted with final overall model.

Another aspect of the present invention provides a kind of speech-emotion recognition method based on machine learning, comprising the following steps:

S1 obtains recording data, carries out noise reduction pretreatment to recording data using related algorithm；

S2 receives the recording data that recording noise reduction module transmits, and is cut recording data according to etic correlated characteristic At segment；

S3 receives the segment that comes of punctuate module transfer, using machine learning algorithm by segment classification, and according to classification to speaking People identifies；

S4 receives the segment that punctuate module transfer comes, to each snippet extraction spectrum signature and mel cepstrum coefficients, and at it On handled after extract segment characterizations；

S5 receives the segment characterizations that characteristic extracting module generates, and is trained by machine learning algorithm to emotion prediction model, And it is integrated using prediction result of the Integrated Algorithm to each model.

Further, noise reduction pretreatment includes: in the step S1

S11 inputs the data damaged, is trained study using data are damaged；

S12 is using not impaired data as the output of deep learning algorithm；

S13 damages data to other according to trained model and handles.

Further, in the step S3 using machine learning algorithm by segment classification, and according to classification to speaker into Row identifies

S31 using different modeling methods by recording data different fragments and speech frame be divided into two classes or multiclass；

S32 integrates the classification results of each model.

Further, segment characterizations are extracted after being handled on spectrum signature and mel cepstrum coefficients in the step S4 Include:

Each model is needed to extract all kinds of different characteristic indexs according to its modeling by S41；

S42 is processed generation identification feature to the different dimensions index of mel cepstrum coefficients；

S43 carries out the extraction of characteristics of image by the sound spectrograph that raw tone time-domain signal and its conversion generate.

Further, emotion prediction model is trained in the step S5, and using Integrated Algorithm to each model Prediction result carry out integrated include:

Training sample using each category feature extracted, is input in different machine learning algorithm models and is instructed by S51 Practice study；

S52 integrates the different prediction results obtained after each model training；

S53 obtains the overall model that prediction effect is final under different application scene by adjusting different models, using final whole Body Model predicts other unknown segment emotions.

Beneficial effects of the present invention: it designs and takes premised on being applied according to the language environment of Chinese and actual production environment It builds, good performance can be more effectively obtained in the actual production environment of Chinese language environment and customer service call；Energy It is enough that required recording is efficiently quickly retrieved according to mood, to improve the efficiency of including but not limited to phone quality inspection work.

Detailed description of the invention

It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without creative efforts, can also obtain according to these attached drawings Obtain other attached drawings.

Fig. 1 is the structural representation of the speech emotion recognition system based on machine learning described according to embodiments of the present invention Figure；

Fig. 2 is one of the flow chart of speech-emotion recognition method based on machine learning described according to embodiments of the present invention；

Fig. 3 is the two of the flow chart of the speech-emotion recognition method based on machine learning described according to embodiments of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art's every other embodiment obtained belong to what the present invention protected Range.

As shown in Figure 1, the speech emotion recognition system based on machine learning, including record according to embodiments of the present invention Sound noise reduction module, punctuate module, Speaker Identification module, characteristic extracting module and emotion recognition module, wherein recording noise reduction mould Block, punctuate module and characteristic extracting module belong to data prediction, recording noise reduction module, punctuate module and characteristic extracting module pair Prediction provides basis, improves Stability and veracity during prediction, and provides the feature that can be used to predict, speaker Identification module and emotion recognition module belong to prediction, and the segment and feature obtained using data prediction, each segment is spoken People and emotion are predicted；

Specifically, recording noise reduction module is for the different types of noise such as ambient noise, using related algorithm to recording at Reason, to reduce or eliminate influence of the noise to speech recognition, with the expression effect of the other modules of strengthen the system, and noise reduction is complete Complete recording is transferred to punctuate module.

Punctuate module, the recording data transmitted for receiving recording noise reduction module, according to etic correlated characteristic Recording data is cut into segment；Wherein, punctuate module includes:

Second categorization module, for according to the clustering method in machine learning, different speech frame to be divided into voice and non-voice two Class；

Aggregation module is used for rule-based method, the speech frame classified is polymerized to voice and non-vocal segments.

Specifically, the difference according to etic correlated characteristic on voice and non-voice, to each small segments of speech into Row cluster is then based on rule and polymerize to sorted speech frame, and the continuous similar speech frame in time upper front and back is classified as Among the same segment, while guaranteeing that same a word of identical speaker is comprised in a segment, different speakers are wrapped It is contained in different fragments, the segment of generation is finally respectively transmitted to Speaker Identification module and characteristic extracting module again.

Specifically, all segments come out of breaking in one logical recording are divided into two or more using machine learning algorithm Classification, and two or more speakers are identified according to the classification branched away, separate every section of a word length Small fragment be which speaker belonged to.

Characteristic extracting module, the segment to come for receiving punctuate module transfer, to each snippet extraction spectrum signature and Mel cepstrum coefficients, and segment characterizations are extracted after being handled on it；

Specifically, characteristic extracting module is the previous step of emotion recognition, an index, the finger of extraction are extracted to each small fragment It is designated as in the common feature in speech recognition (audio conversion text) field: mel cepstrum coefficients (Mel-Frequency Cepstral Coefficients, MFCCs), and relevant processing is carried out on mel cepstrum coefficients, such as further extract feature or Increase the methods of other features, because mel cepstrum coefficients (MFCC) still include a large amount of characteristic information, therefore can be to having mentioned The MFCC feature taken, which is further extracted, is similar to mean value, and the statistical feature such as standard deviation is the complementary features of MFCC, To promote the effect of emotional prediction model, and a segment is indicated using the feature that the above method extracts, and will mention The feature got is transferred to emotion recognition module.

Emotion recognition module, for receiving the segment characterizations of characteristic extracting module generation, by machine learning algorithm to feelings Sense prediction model is trained, and is integrated using prediction result of the Integrated Algorithm to each model.

Specifically, firstly, using characteristic extracting module extract feature and relevant machine learning algorithm, such as convolution mind Related algorithm through the machine learning and deep learning such as network (CNN) and support vector machines (SVM), secondly, using in The sound bite of literary language environment and the training method for being more likely to production environment application carry out emotion prediction model Training；And then, utilization trained model, predicts the feature of unknown fragment, to obtain unknown fragment Representative mood, because each model may be not exactly the same to identical recording segment prediction result；Finally, utilizing collection preconceived plan Method integrates the prediction result of each model, to obtain the final mood of each recording segment.

In one particular embodiment of the present invention, noise reduction pretreatment includes: in the recording noise reduction module

Specifically, the method for noise reduction is DAE(Denoising Autoencoder, noise reduction self-encoding encoder) it is a depth The algorithm of study, principle are: the data (can damage data to be various forms of) that input one damages have using this part Damage data are trained study, using not impaired data as the output of deep learning algorithm, and utilize trained model Data are damaged to other to handle, because the noise for including in recording, which belongs in recording, damages data, therefore can use DAE Noise reduction process is carried out to recording.

In one particular embodiment of the present invention, the Speaker Identification module includes:

Specifically, utilizing cluster, Gauss because performance of the different speakers on voice feature data and index has differences Different fragments in one logical recording, speech frame are divided into two classes or multiclass by the difference modeling method such as mixed model and CNN, and right The classification results of each model carry out integrated to realize real-time and batch the contingency table to the sound bite of different speakers Note.Such as insurance phone in, can use the above method to attend a banquet and client carry out Classification and Identification；In addition, the system may be used also To be directed to the scene of certain applications, preparatory speech samples sampling modeling is carried out to known speaker, thus using pre-training Speaker model further increase the accuracy rate of Speaker Identification, such as the sound that the customer service by obtaining in advance is attended a banquet is pre- Training sample model promotes the recognition effect for unknown client.

In one particular embodiment of the present invention, the characteristic extracting module includes:

Extraction module, for needing to extract all kinds of different characteristic indexs according to its modeling for each model, wherein characteristic index For mel cepstrum coefficients；

Specifically, multiple models of the system need all kinds of different characteristic indexs of selective extraction according to its modeling, classification Using more common MFCC extracting method and processing is further processed for the different dimensions index of MFCC in model To generate significantly more efficient identification feature；The models such as CNN then by the sound spectrograph that is generated for original voice spectrum conversion into The extraction of row graphic feature.

Such as: characteristic extracting module carries out preemphasis processing to original recording signal first, will entire complete recording letter Number be divided into multiple frames (frame represents a bit of recording in original recording signal), by each frame multiplied by Hamming window after, it is done fastly Fast Fourier transformation obtains the energy spectrum of each frame, and then by energy spectrum by triangular filter group, to each triangular filter After the logarithmic energy of group output does discrete cosine transform (DCT), the feature of the sound bites such as MFCC coefficient is obtained.

In one particular embodiment of the present invention, the emotion recognition module includes:

Specifically, being used as training sample by the mood label to historical sample, utilization is aforementioned in emotion recognition module Each category feature extracted in step is input in different machine learning algorithm models and is trained study, due to not With algorithm model learn, the feature that recognizes is not quite similar, there is also differences for the identification of each model and prediction effect, are Unite so the different prediction results obtained after each model training are integrated, by adjusting different models integrated weight, The methods of objective appraisal function obtains the optimal overall model of prediction effect under the different applications scene such as accurate rate, recall rate, And then it goes to predict the segment emotion that other are unknown using final model.

Such as: the mood that the sound bite in difference call respectively represents different clients and attends a banquet, by these voices After segment is extracted and handled according to the method for characteristic extracting module, by segment sample input method to machine learning algorithm model It trains and integrates in (such as: support vector machines scheduling algorithm), go to predict the language that other moods are unknown using finally obtained model Tablet section.

As shown in Figures 2 and 3, another aspect of the present invention provides a kind of speech-emotion recognition method based on machine learning, The following steps are included:

S2 receives the recording data that recording noise reduction module transmits, and is cut recording data according to etic correlated characteristic At segment；Wherein, recording data is cut by segment according to etic correlated characteristic and further includes steps of S21 root According to the clustering method in machine learning, different speech frame is divided into two class of voice and non-voice；The rule-based method of S22, will The speech frame classified is polymerized to voice and non-vocal segments.

S3 receives the segment that punctuate module transfer comes, using machine learning algorithm by segment classification, and according to classification pair Speaker identifies；

In one particular embodiment of the present invention, noise reduction pretreatment includes: in the step S1

S11 inputs the data damaged, is trained study using data are damaged；

S12 is using not impaired data as the output of deep learning algorithm；

S13 damages data to other according to trained model and handles.

In one particular embodiment of the present invention, utilize machine learning algorithm by segment classification in the step S3, and Carrying out identification to speaker according to classification includes:

S32 integrates the classification results of each model.

In one particular embodiment of the present invention, it is carried out on spectrum signature and mel cepstrum coefficients in the step S4 Segment characterizations are extracted after processing includes:

In one particular embodiment of the present invention, emotion prediction model is trained in the step S5, and utilized Integrated Algorithm integrate to the prediction result of each model

In order to facilitate understanding above-mentioned technical proposal of the invention, below by way of in specifically used mode to of the invention above-mentioned Technical solution is described in detail.

When specifically used, the speech emotion recognition system according to the present invention based on machine learning, by true Telephonograph data, using related algorithm to true phone data carry out noise reduction pretreatment, then according in sound bite Sound bite is cut into the segment that multiple length are a word by the notable difference of vocal sections and non-vocal sections；Because true Recording data in segment include two or more speakers, it is different according to everyone sound characteristic, to two or More than two speakers distinguish, and on the basis of punctuate module, each segment are indicated with mel cepstrum coefficients, and will generation The mel cepstrum coefficients of each segment of table are sent to emotion recognition module, then according to mel cepstrum coefficients to representated by each segment Mood identified.

With the deployment and lasting application of model, system will constantly the new prediction in actual use of accumulation it is correct and The mood sample data of mistake, system are constantly optimized and are mentioned to the parameter of model, structure etc. by the methods of incremental learning It rises, sustained improvement reaches better application effect.

In conclusion by means of above-mentioned technical proposal of the invention, according to the language environment and actual production environment of Chinese It designs and builds premised on, it can be more effectively in the actual production environment of Chinese language environment and customer service call In obtain good performance；Required recording efficiently can be quickly retrieved according to mood, so that improving includes but is not limited to electricity Talk about the efficiency of quality inspection work.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of speech emotion recognition system based on machine learning, which is characterized in that including the noise reduction module, punctuate mould of recording Block, Speaker Identification module, characteristic extracting module and emotion recognition module, wherein

2. the speech emotion recognition system according to claim 1 based on machine learning, which is characterized in that the recording drop Noise reduction, which pre-processes, in module of making an uproar includes:

3. the speech emotion recognition system according to claim 1 based on machine learning, which is characterized in that the speaker Identification module includes:

4. the speech emotion recognition system according to claim 3 based on machine learning, which is characterized in that the feature mentions Modulus block includes:

5. the speech emotion recognition system according to claim 3 or 4 based on machine learning, which is characterized in that the feelings Feeling identification module includes:

6. a kind of speech-emotion recognition method based on machine learning, which comprises the following steps:

7. the speech-emotion recognition method according to claim 6 based on machine learning, which is characterized in that the step S1 Middle noise reduction pre-processes

S11 inputs the data damaged, is trained study using data are damaged；

S12 is using not impaired data as the output of deep learning algorithm；

S13 damages data to other according to trained model and handles.

8. the speech-emotion recognition method according to claim 6 based on machine learning, which is characterized in that the step S3 It is middle to utilize machine learning algorithm by segment classification, and identification is carried out to speaker according to classification and includes:

S32 integrates the classification results of each model.

9. the speech-emotion recognition method according to claim 8 based on machine learning, which is characterized in that the step S4 In handled on spectrum signature and mel cepstrum coefficients after extract segment characterizations include:

10. the speech-emotion recognition method based on machine learning according to claim 8 or claim 9, which is characterized in that the step Emotion prediction model is trained in rapid S5, and integrate using prediction result of the Integrated Algorithm to each model and includes: