CN109256150A - Speech emotion recognition system and method based on machine learning - Google Patents

Speech emotion recognition system and method based on machine learning Download PDF

Info

Publication number
CN109256150A
CN109256150A CN201811186572.8A CN201811186572A CN109256150A CN 109256150 A CN109256150 A CN 109256150A CN 201811186572 A CN201811186572 A CN 201811186572A CN 109256150 A CN109256150 A CN 109256150A
Authority
CN
China
Prior art keywords
module
segment
machine learning
model
different
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811186572.8A
Other languages
Chinese (zh)
Other versions
CN109256150B (en
Inventor
徐心
胡宇澄
王麒铭
饶鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Chuangjing Consulting Co Ltd
Original Assignee
Beijing Chuangjing Consulting Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Chuangjing Consulting Co Ltd filed Critical Beijing Chuangjing Consulting Co Ltd
Priority to CN201811186572.8A priority Critical patent/CN109256150B/en
Publication of CN109256150A publication Critical patent/CN109256150A/en
Application granted granted Critical
Publication of CN109256150B publication Critical patent/CN109256150B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Abstract

The invention discloses the speech emotion recognition system and methods based on machine learning, including recording noise reduction module;Punctuate module, the recording data transmitted for receiving recording noise reduction module, is cut into segment for recording data according to etic correlated characteristic;Speaker Identification module, the segment to come for receiving punctuate module transfer using machine learning algorithm by segment classification, and identify speaker according to classification;Characteristic extracting module, the segment to come for receiving punctuate module transfer extract segment characterizations to each snippet extraction spectrum signature and mel cepstrum coefficients, and after being handled on it;Emotion recognition module is trained emotion prediction model by machine learning algorithm, and integrated using prediction result of the Integrated Algorithm to each model for receiving the segment characterizations of characteristic extracting module generation.The invention has the advantages that: good performance is effectively obtained in the actual production environment of Chinese language environment and customer service call.

Description

Speech emotion recognition system and method based on machine learning
Technical field
The present invention relates to technical field of voice recognition, it particularly relates to which a kind of speech emotional based on machine learning is known Other system and method.
Background technique
Two modules can be substantially divided into field of speech recognition, one is based on by content expressed by speech audio Textual form is converted to show, second is that be based on speech audio, identification audio inside include mood (such as: it is angry or flat Wait quietly).It identifies about voice mood, is related in external document, but it is bigger to limit to face;In existing document, The Emotion identification of other language mostly other than Chinese, can not be directly applied to the voice under Chinese environment to identify mood, And document is mostly to identify using single algorithm to mood, recognition effect is comparatively closer to laboratory data, in reality Unsatisfactory, not to be able to satisfy in Chinese production environment requirement is showed in the production of border.
For the problems in the relevant technologies, currently no effective solution has been proposed.
Summary of the invention
For above-mentioned technical problem in the related technology, the present invention proposes a kind of speech emotion recognition based on machine learning System and method is able to solve the Emotion identification problem based on Chinese audio sound-recording, the including but not limited to incoming call of customer service The scenes such as outbound calling.
To realize the above-mentioned technical purpose, the technical scheme of the present invention is realized as follows:
A kind of speech emotion recognition system based on machine learning, including recording noise reduction module, punctuate module, Speaker Identification mould Block, characteristic extracting module and emotion recognition module, wherein
Noise reduction module of recording carries out noise reduction pretreatment to recording data using related algorithm for obtaining recording data;
Punctuate module, the recording data transmitted for receiving recording noise reduction module, will record according to etic correlated characteristic Sound data are cut into segment;
Speaker Identification module, the segment to come for receiving punctuate module transfer, using machine learning algorithm by segment classification, And speaker is identified according to classification;
Characteristic extracting module, the segment to come for receiving punctuate module transfer, to each snippet extraction spectrum signature and Meier Cepstrum coefficient, and segment characterizations are extracted after being handled on it;
Emotion recognition module, it is pre- to emotion by machine learning algorithm for receiving the segment characterizations of characteristic extracting module generation It surveys model to be trained, and is integrated using prediction result of the Integrated Algorithm to each model.
Further, noise reduction pretreatment includes: in the recording noise reduction module
Training study module is trained study using data are damaged for inputting the data damaged;
Output module, for using not impaired data as the output of deep learning algorithm;
First processing module is handled for damaging data to other according to trained model.
Further, the Speaker Identification module includes:
Categorization module, for using different modeling methods by recording data different fragments and speech frame be divided into two classes or more Class;
First integration module is integrated for the classification results to each model, to the reality of the sound bite of different speakers When and batch classification marker.
Further, the characteristic extracting module includes:
Extraction module, for needing to extract all kinds of different characteristic indexs according to its modeling for each model;
Second processing module is processed generation identification feature for the different dimensions index to mel cepstrum coefficients;
Conversion module, the sound spectrograph for being generated by raw tone time-domain signal and its conversion carry out mentioning for characteristics of image It takes.
Further, the emotion recognition module includes:
Input module, for training sample using each category feature extracted, to be input to different machine learning algorithm moulds Study is trained in type;
Second integration module, for integrating the different prediction results obtained after each model training;
Prediction module, for obtaining the overall model that prediction effect is final under different application scene by adjusting different models, benefit Other unknown segment emotions are predicted with final overall model.
Another aspect of the present invention provides a kind of speech-emotion recognition method based on machine learning, comprising the following steps:
S1 obtains recording data, carries out noise reduction pretreatment to recording data using related algorithm;
S2 receives the recording data that recording noise reduction module transmits, and is cut recording data according to etic correlated characteristic At segment;
S3 receives the segment that comes of punctuate module transfer, using machine learning algorithm by segment classification, and according to classification to speaking People identifies;
S4 receives the segment that punctuate module transfer comes, to each snippet extraction spectrum signature and mel cepstrum coefficients, and at it On handled after extract segment characterizations;
S5 receives the segment characterizations that characteristic extracting module generates, and is trained by machine learning algorithm to emotion prediction model, And it is integrated using prediction result of the Integrated Algorithm to each model.
Further, noise reduction pretreatment includes: in the step S1
S11 inputs the data damaged, is trained study using data are damaged;
S12 is using not impaired data as the output of deep learning algorithm;
S13 damages data to other according to trained model and handles.
Further, in the step S3 using machine learning algorithm by segment classification, and according to classification to speaker into Row identifies
S31 using different modeling methods by recording data different fragments and speech frame be divided into two classes or multiclass;
S32 integrates the classification results of each model.
Further, segment characterizations are extracted after being handled on spectrum signature and mel cepstrum coefficients in the step S4 Include:
Each model is needed to extract all kinds of different characteristic indexs according to its modeling by S41;
S42 is processed generation identification feature to the different dimensions index of mel cepstrum coefficients;
S43 carries out the extraction of characteristics of image by the sound spectrograph that raw tone time-domain signal and its conversion generate.
Further, emotion prediction model is trained in the step S5, and using Integrated Algorithm to each model Prediction result carry out integrated include:
Training sample using each category feature extracted, is input in different machine learning algorithm models and is instructed by S51 Practice study;
S52 integrates the different prediction results obtained after each model training;
S53 obtains the overall model that prediction effect is final under different application scene by adjusting different models, using final whole Body Model predicts other unknown segment emotions.
Beneficial effects of the present invention: it designs and takes premised on being applied according to the language environment of Chinese and actual production environment It builds, good performance can be more effectively obtained in the actual production environment of Chinese language environment and customer service call;Energy It is enough that required recording is efficiently quickly retrieved according to mood, to improve the efficiency of including but not limited to phone quality inspection work.
Detailed description of the invention
It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without creative efforts, can also obtain according to these attached drawings Obtain other attached drawings.
Fig. 1 is the structural representation of the speech emotion recognition system based on machine learning described according to embodiments of the present invention Figure;
Fig. 2 is one of the flow chart of speech-emotion recognition method based on machine learning described according to embodiments of the present invention;
Fig. 3 is the two of the flow chart of the speech-emotion recognition method based on machine learning described according to embodiments of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art's every other embodiment obtained belong to what the present invention protected Range.
As shown in Figure 1, the speech emotion recognition system based on machine learning, including record according to embodiments of the present invention Sound noise reduction module, punctuate module, Speaker Identification module, characteristic extracting module and emotion recognition module, wherein recording noise reduction mould Block, punctuate module and characteristic extracting module belong to data prediction, recording noise reduction module, punctuate module and characteristic extracting module pair Prediction provides basis, improves Stability and veracity during prediction, and provides the feature that can be used to predict, speaker Identification module and emotion recognition module belong to prediction, and the segment and feature obtained using data prediction, each segment is spoken People and emotion are predicted;
Noise reduction module of recording carries out noise reduction pretreatment to recording data using related algorithm for obtaining recording data;
Specifically, recording noise reduction module is for the different types of noise such as ambient noise, using related algorithm to recording at Reason, to reduce or eliminate influence of the noise to speech recognition, with the expression effect of the other modules of strengthen the system, and noise reduction is complete Complete recording is transferred to punctuate module.
Punctuate module, the recording data transmitted for receiving recording noise reduction module, according to etic correlated characteristic Recording data is cut into segment;Wherein, punctuate module includes:
Second categorization module, for according to the clustering method in machine learning, different speech frame to be divided into voice and non-voice two Class;
Aggregation module is used for rule-based method, the speech frame classified is polymerized to voice and non-vocal segments.
Specifically, the difference according to etic correlated characteristic on voice and non-voice, to each small segments of speech into Row cluster is then based on rule and polymerize to sorted speech frame, and the continuous similar speech frame in time upper front and back is classified as Among the same segment, while guaranteeing that same a word of identical speaker is comprised in a segment, different speakers are wrapped It is contained in different fragments, the segment of generation is finally respectively transmitted to Speaker Identification module and characteristic extracting module again.
Speaker Identification module, the segment to come for receiving punctuate module transfer, using machine learning algorithm by segment Classification, and speaker is identified according to classification;
Specifically, all segments come out of breaking in one logical recording are divided into two or more using machine learning algorithm Classification, and two or more speakers are identified according to the classification branched away, separate every section of a word length Small fragment be which speaker belonged to.
Characteristic extracting module, the segment to come for receiving punctuate module transfer, to each snippet extraction spectrum signature and Mel cepstrum coefficients, and segment characterizations are extracted after being handled on it;
Specifically, characteristic extracting module is the previous step of emotion recognition, an index, the finger of extraction are extracted to each small fragment It is designated as in the common feature in speech recognition (audio conversion text) field: mel cepstrum coefficients (Mel-Frequency Cepstral Coefficients, MFCCs), and relevant processing is carried out on mel cepstrum coefficients, such as further extract feature or Increase the methods of other features, because mel cepstrum coefficients (MFCC) still include a large amount of characteristic information, therefore can be to having mentioned The MFCC feature taken, which is further extracted, is similar to mean value, and the statistical feature such as standard deviation is the complementary features of MFCC, To promote the effect of emotional prediction model, and a segment is indicated using the feature that the above method extracts, and will mention The feature got is transferred to emotion recognition module.
Emotion recognition module, for receiving the segment characterizations of characteristic extracting module generation, by machine learning algorithm to feelings Sense prediction model is trained, and is integrated using prediction result of the Integrated Algorithm to each model.
Specifically, firstly, using characteristic extracting module extract feature and relevant machine learning algorithm, such as convolution mind Related algorithm through the machine learning and deep learning such as network (CNN) and support vector machines (SVM), secondly, using in The sound bite of literary language environment and the training method for being more likely to production environment application carry out emotion prediction model Training;And then, utilization trained model, predicts the feature of unknown fragment, to obtain unknown fragment Representative mood, because each model may be not exactly the same to identical recording segment prediction result;Finally, utilizing collection preconceived plan Method integrates the prediction result of each model, to obtain the final mood of each recording segment.
In one particular embodiment of the present invention, noise reduction pretreatment includes: in the recording noise reduction module
Training study module is trained study using data are damaged for inputting the data damaged;
Output module, for using not impaired data as the output of deep learning algorithm;
First processing module is handled for damaging data to other according to trained model.
Specifically, the method for noise reduction is DAE(Denoising Autoencoder, noise reduction self-encoding encoder) it is a depth The algorithm of study, principle are: the data (can damage data to be various forms of) that input one damages have using this part Damage data are trained study, using not impaired data as the output of deep learning algorithm, and utilize trained model Data are damaged to other to handle, because the noise for including in recording, which belongs in recording, damages data, therefore can use DAE Noise reduction process is carried out to recording.
In one particular embodiment of the present invention, the Speaker Identification module includes:
Categorization module, for using different modeling methods by recording data different fragments and speech frame be divided into two classes or more Class;
First integration module is integrated for the classification results to each model, to the reality of the sound bite of different speakers When and batch classification marker.
Specifically, utilizing cluster, Gauss because performance of the different speakers on voice feature data and index has differences Different fragments in one logical recording, speech frame are divided into two classes or multiclass by the difference modeling method such as mixed model and CNN, and right The classification results of each model carry out integrated to realize real-time and batch the contingency table to the sound bite of different speakers Note.Such as insurance phone in, can use the above method to attend a banquet and client carry out Classification and Identification;In addition, the system may be used also To be directed to the scene of certain applications, preparatory speech samples sampling modeling is carried out to known speaker, thus using pre-training Speaker model further increase the accuracy rate of Speaker Identification, such as the sound that the customer service by obtaining in advance is attended a banquet is pre- Training sample model promotes the recognition effect for unknown client.
In one particular embodiment of the present invention, the characteristic extracting module includes:
Extraction module, for needing to extract all kinds of different characteristic indexs according to its modeling for each model, wherein characteristic index For mel cepstrum coefficients;
Second processing module is processed generation identification feature for the different dimensions index to mel cepstrum coefficients;
Conversion module, the sound spectrograph for being generated by raw tone time-domain signal and its conversion carry out mentioning for characteristics of image It takes.
Specifically, multiple models of the system need all kinds of different characteristic indexs of selective extraction according to its modeling, classification Using more common MFCC extracting method and processing is further processed for the different dimensions index of MFCC in model To generate significantly more efficient identification feature;The models such as CNN then by the sound spectrograph that is generated for original voice spectrum conversion into The extraction of row graphic feature.
Such as: characteristic extracting module carries out preemphasis processing to original recording signal first, will entire complete recording letter Number be divided into multiple frames (frame represents a bit of recording in original recording signal), by each frame multiplied by Hamming window after, it is done fastly Fast Fourier transformation obtains the energy spectrum of each frame, and then by energy spectrum by triangular filter group, to each triangular filter After the logarithmic energy of group output does discrete cosine transform (DCT), the feature of the sound bites such as MFCC coefficient is obtained.
In one particular embodiment of the present invention, the emotion recognition module includes:
Input module, for training sample using each category feature extracted, to be input to different machine learning algorithm moulds Study is trained in type;
Second integration module, for integrating the different prediction results obtained after each model training;
Prediction module, for obtaining the overall model that prediction effect is final under different application scene by adjusting different models, benefit Other unknown segment emotions are predicted with final overall model.
Specifically, being used as training sample by the mood label to historical sample, utilization is aforementioned in emotion recognition module Each category feature extracted in step is input in different machine learning algorithm models and is trained study, due to not With algorithm model learn, the feature that recognizes is not quite similar, there is also differences for the identification of each model and prediction effect, are Unite so the different prediction results obtained after each model training are integrated, by adjusting different models integrated weight, The methods of objective appraisal function obtains the optimal overall model of prediction effect under the different applications scene such as accurate rate, recall rate, And then it goes to predict the segment emotion that other are unknown using final model.
Such as: the mood that the sound bite in difference call respectively represents different clients and attends a banquet, by these voices After segment is extracted and handled according to the method for characteristic extracting module, by segment sample input method to machine learning algorithm model It trains and integrates in (such as: support vector machines scheduling algorithm), go to predict the language that other moods are unknown using finally obtained model Tablet section.
As shown in Figures 2 and 3, another aspect of the present invention provides a kind of speech-emotion recognition method based on machine learning, The following steps are included:
S1 obtains recording data, carries out noise reduction pretreatment to recording data using related algorithm;
S2 receives the recording data that recording noise reduction module transmits, and is cut recording data according to etic correlated characteristic At segment;Wherein, recording data is cut by segment according to etic correlated characteristic and further includes steps of S21 root According to the clustering method in machine learning, different speech frame is divided into two class of voice and non-voice;The rule-based method of S22, will The speech frame classified is polymerized to voice and non-vocal segments.
Specifically, the difference according to etic correlated characteristic on voice and non-voice, to each small segments of speech into Row cluster is then based on rule and polymerize to sorted speech frame, and the continuous similar speech frame in time upper front and back is classified as Among the same segment, while guaranteeing that same a word of identical speaker is comprised in a segment, different speakers are wrapped It is contained in different fragments, the segment of generation is finally respectively transmitted to Speaker Identification module and characteristic extracting module again.
S3 receives the segment that punctuate module transfer comes, using machine learning algorithm by segment classification, and according to classification pair Speaker identifies;
S4 receives the segment that punctuate module transfer comes, to each snippet extraction spectrum signature and mel cepstrum coefficients, and at it On handled after extract segment characterizations;
S5 receives the segment characterizations that characteristic extracting module generates, and is trained by machine learning algorithm to emotion prediction model, And it is integrated using prediction result of the Integrated Algorithm to each model.
In one particular embodiment of the present invention, noise reduction pretreatment includes: in the step S1
S11 inputs the data damaged, is trained study using data are damaged;
S12 is using not impaired data as the output of deep learning algorithm;
S13 damages data to other according to trained model and handles.
In one particular embodiment of the present invention, utilize machine learning algorithm by segment classification in the step S3, and Carrying out identification to speaker according to classification includes:
S31 using different modeling methods by recording data different fragments and speech frame be divided into two classes or multiclass;
S32 integrates the classification results of each model.
In one particular embodiment of the present invention, it is carried out on spectrum signature and mel cepstrum coefficients in the step S4 Segment characterizations are extracted after processing includes:
Each model is needed to extract all kinds of different characteristic indexs according to its modeling by S41;
S42 is processed generation identification feature to the different dimensions index of mel cepstrum coefficients;
S43 carries out the extraction of characteristics of image by the sound spectrograph that raw tone time-domain signal and its conversion generate.
In one particular embodiment of the present invention, emotion prediction model is trained in the step S5, and utilized Integrated Algorithm integrate to the prediction result of each model
Training sample using each category feature extracted, is input in different machine learning algorithm models and is instructed by S51 Practice study;
S52 integrates the different prediction results obtained after each model training;
S53 obtains the overall model that prediction effect is final under different application scene by adjusting different models, using final whole Body Model predicts other unknown segment emotions.
In order to facilitate understanding above-mentioned technical proposal of the invention, below by way of in specifically used mode to of the invention above-mentioned Technical solution is described in detail.
When specifically used, the speech emotion recognition system according to the present invention based on machine learning, by true Telephonograph data, using related algorithm to true phone data carry out noise reduction pretreatment, then according in sound bite Sound bite is cut into the segment that multiple length are a word by the notable difference of vocal sections and non-vocal sections;Because true Recording data in segment include two or more speakers, it is different according to everyone sound characteristic, to two or More than two speakers distinguish, and on the basis of punctuate module, each segment are indicated with mel cepstrum coefficients, and will generation The mel cepstrum coefficients of each segment of table are sent to emotion recognition module, then according to mel cepstrum coefficients to representated by each segment Mood identified.
With the deployment and lasting application of model, system will constantly the new prediction in actual use of accumulation it is correct and The mood sample data of mistake, system are constantly optimized and are mentioned to the parameter of model, structure etc. by the methods of incremental learning It rises, sustained improvement reaches better application effect.
In conclusion by means of above-mentioned technical proposal of the invention, according to the language environment and actual production environment of Chinese It designs and builds premised on, it can be more effectively in the actual production environment of Chinese language environment and customer service call In obtain good performance;Required recording efficiently can be quickly retrieved according to mood, so that improving includes but is not limited to electricity Talk about the efficiency of quality inspection work.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims (10)

1. a kind of speech emotion recognition system based on machine learning, which is characterized in that including the noise reduction module, punctuate mould of recording Block, Speaker Identification module, characteristic extracting module and emotion recognition module, wherein
Noise reduction module of recording carries out noise reduction pretreatment to recording data using related algorithm for obtaining recording data;
Punctuate module, the recording data transmitted for receiving recording noise reduction module, will record according to etic correlated characteristic Sound data are cut into segment;
Speaker Identification module, the segment to come for receiving punctuate module transfer, using machine learning algorithm by segment classification, And speaker is identified according to classification;
Characteristic extracting module, the segment to come for receiving punctuate module transfer, to each snippet extraction spectrum signature and Meier Cepstrum coefficient, and segment characterizations are extracted after being handled on it;
Emotion recognition module, it is pre- to emotion by machine learning algorithm for receiving the segment characterizations of characteristic extracting module generation It surveys model to be trained, and is integrated using prediction result of the Integrated Algorithm to each model.
2. the speech emotion recognition system according to claim 1 based on machine learning, which is characterized in that the recording drop Noise reduction, which pre-processes, in module of making an uproar includes:
Training study module is trained study using data are damaged for inputting the data damaged;
Output module, for using not impaired data as the output of deep learning algorithm;
First processing module is handled for damaging data to other according to trained model.
3. the speech emotion recognition system according to claim 1 based on machine learning, which is characterized in that the speaker Identification module includes:
Categorization module, for using different modeling methods by recording data different fragments and speech frame be divided into two classes or more Class;
First integration module is integrated for the classification results to each model, to the reality of the sound bite of different speakers When and batch classification marker.
4. the speech emotion recognition system according to claim 3 based on machine learning, which is characterized in that the feature mentions Modulus block includes:
Extraction module, for needing to extract all kinds of different characteristic indexs according to its modeling for each model;
Second processing module is processed generation identification feature for the different dimensions index to mel cepstrum coefficients;
Conversion module, the sound spectrograph for being generated by raw tone time-domain signal and its conversion carry out mentioning for characteristics of image It takes.
5. the speech emotion recognition system according to claim 3 or 4 based on machine learning, which is characterized in that the feelings Feeling identification module includes:
Input module, for training sample using each category feature extracted, to be input to different machine learning algorithm moulds Study is trained in type;
Second integration module, for integrating the different prediction results obtained after each model training;
Prediction module, for obtaining the overall model that prediction effect is final under different application scene by adjusting different models, benefit Other unknown segment emotions are predicted with final overall model.
6. a kind of speech-emotion recognition method based on machine learning, which comprises the following steps:
S1 obtains recording data, carries out noise reduction pretreatment to recording data using related algorithm;
S2 receives the recording data that recording noise reduction module transmits, and is cut recording data according to etic correlated characteristic At segment;
S3 receives the segment that comes of punctuate module transfer, using machine learning algorithm by segment classification, and according to classification to speaking People identifies;
S4 receives the segment that punctuate module transfer comes, to each snippet extraction spectrum signature and mel cepstrum coefficients, and at it On handled after extract segment characterizations;
S5 receives the segment characterizations that characteristic extracting module generates, and is trained by machine learning algorithm to emotion prediction model, And it is integrated using prediction result of the Integrated Algorithm to each model.
7. the speech-emotion recognition method according to claim 6 based on machine learning, which is characterized in that the step S1 Middle noise reduction pre-processes
S11 inputs the data damaged, is trained study using data are damaged;
S12 is using not impaired data as the output of deep learning algorithm;
S13 damages data to other according to trained model and handles.
8. the speech-emotion recognition method according to claim 6 based on machine learning, which is characterized in that the step S3 It is middle to utilize machine learning algorithm by segment classification, and identification is carried out to speaker according to classification and includes:
S31 using different modeling methods by recording data different fragments and speech frame be divided into two classes or multiclass;
S32 integrates the classification results of each model.
9. the speech-emotion recognition method according to claim 8 based on machine learning, which is characterized in that the step S4 In handled on spectrum signature and mel cepstrum coefficients after extract segment characterizations include:
Each model is needed to extract all kinds of different characteristic indexs according to its modeling by S41;
S42 is processed generation identification feature to the different dimensions index of mel cepstrum coefficients;
S43 carries out the extraction of characteristics of image by the sound spectrograph that raw tone time-domain signal and its conversion generate.
10. the speech-emotion recognition method based on machine learning according to claim 8 or claim 9, which is characterized in that the step Emotion prediction model is trained in rapid S5, and integrate using prediction result of the Integrated Algorithm to each model and includes:
Training sample using each category feature extracted, is input in different machine learning algorithm models and is instructed by S51 Practice study;
S52 integrates the different prediction results obtained after each model training;
S53 obtains the overall model that prediction effect is final under different application scene by adjusting different models, using final whole Body Model predicts other unknown segment emotions.
CN201811186572.8A 2018-10-12 2018-10-12 Speech emotion recognition system and method based on machine learning Active CN109256150B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811186572.8A CN109256150B (en) 2018-10-12 2018-10-12 Speech emotion recognition system and method based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811186572.8A CN109256150B (en) 2018-10-12 2018-10-12 Speech emotion recognition system and method based on machine learning

Publications (2)

Publication Number Publication Date
CN109256150A true CN109256150A (en) 2019-01-22
CN109256150B CN109256150B (en) 2021-11-30

Family

ID=65045954

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811186572.8A Active CN109256150B (en) 2018-10-12 2018-10-12 Speech emotion recognition system and method based on machine learning

Country Status (1)

Country Link
CN (1) CN109256150B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109697982A (en) * 2019-02-01 2019-04-30 北京清帆科技有限公司 A kind of speaker speech recognition system in instruction scene
CN110473566A (en) * 2019-07-25 2019-11-19 深圳壹账通智能科技有限公司 Audio separation method, device, electronic equipment and computer readable storage medium
CN110517694A (en) * 2019-09-06 2019-11-29 北京清帆科技有限公司 A kind of teaching scene voice conversion detection system
CN110933235A (en) * 2019-11-06 2020-03-27 杭州哲信信息技术有限公司 Noise removing method in intelligent calling system based on machine learning
CN110956953A (en) * 2019-11-29 2020-04-03 中山大学 Quarrel identification method based on audio analysis and deep learning
CN111583967A (en) * 2020-05-14 2020-08-25 西安医学院 Mental health emotion recognition device based on utterance model and operation method thereof
CN111833653A (en) * 2020-07-13 2020-10-27 江苏理工学院 Driving assistance system, method, device, and storage medium using ambient noise
WO2020253509A1 (en) * 2019-06-19 2020-12-24 平安科技(深圳)有限公司 Situation- and emotion-oriented chinese speech synthesis method, device, and storage medium
CN112151042A (en) * 2019-06-27 2020-12-29 中国电信股份有限公司 Voiceprint recognition method, device and system and computer readable storage medium
US20210201892A1 (en) * 2019-12-31 2021-07-01 Beijing Didi Infinity Technology And Development Co., Ltd. Training mechanism of verbal harassment detection systems
US20210201893A1 (en) * 2019-12-31 2021-07-01 Beijing Didi Infinity Technology And Development Co., Ltd. Pattern-based adaptation model for detecting contact information requests in a vehicle
US20210201934A1 (en) * 2019-12-31 2021-07-01 Beijing Didi Infinity Technology And Development Co., Ltd. Real-time verbal harassment detection system
CN113361647A (en) * 2021-07-06 2021-09-07 青岛洞听智能科技有限公司 Method for identifying type of missed call
WO2021232594A1 (en) * 2020-05-22 2021-11-25 深圳壹账通智能科技有限公司 Speech emotion recognition method and apparatus, electronic device, and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810994A (en) * 2013-09-05 2014-05-21 江苏大学 Method and system for voice emotion inference on basis of emotion context
CN103985381A (en) * 2014-05-16 2014-08-13 清华大学 Voice frequency indexing method based on parameter fusion optimized decision
CN106340309A (en) * 2016-08-23 2017-01-18 南京大空翼信息技术有限公司 Dog bark emotion recognition method and device based on deep learning
CN107068161A (en) * 2017-04-14 2017-08-18 百度在线网络技术(北京)有限公司 Voice de-noising method, device and computer equipment based on artificial intelligence
US20180018969A1 (en) * 2016-07-15 2018-01-18 Circle River, Inc. Call Forwarding to Unavailable Party Based on Artificial Intelligence
CN108074576A (en) * 2017-12-14 2018-05-25 讯飞智元信息科技有限公司 Inquest the speaker role's separation method and system under scene
CN108259686A (en) * 2017-12-28 2018-07-06 合肥凯捷技术有限公司 A kind of customer service system based on speech analysis

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810994A (en) * 2013-09-05 2014-05-21 江苏大学 Method and system for voice emotion inference on basis of emotion context
CN103985381A (en) * 2014-05-16 2014-08-13 清华大学 Voice frequency indexing method based on parameter fusion optimized decision
US20180018969A1 (en) * 2016-07-15 2018-01-18 Circle River, Inc. Call Forwarding to Unavailable Party Based on Artificial Intelligence
CN106340309A (en) * 2016-08-23 2017-01-18 南京大空翼信息技术有限公司 Dog bark emotion recognition method and device based on deep learning
CN107068161A (en) * 2017-04-14 2017-08-18 百度在线网络技术(北京)有限公司 Voice de-noising method, device and computer equipment based on artificial intelligence
CN108074576A (en) * 2017-12-14 2018-05-25 讯飞智元信息科技有限公司 Inquest the speaker role's separation method and system under scene
CN108259686A (en) * 2017-12-28 2018-07-06 合肥凯捷技术有限公司 A kind of customer service system based on speech analysis

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109697982A (en) * 2019-02-01 2019-04-30 北京清帆科技有限公司 A kind of speaker speech recognition system in instruction scene
WO2020253509A1 (en) * 2019-06-19 2020-12-24 平安科技(深圳)有限公司 Situation- and emotion-oriented chinese speech synthesis method, device, and storage medium
CN112151042A (en) * 2019-06-27 2020-12-29 中国电信股份有限公司 Voiceprint recognition method, device and system and computer readable storage medium
CN110473566A (en) * 2019-07-25 2019-11-19 深圳壹账通智能科技有限公司 Audio separation method, device, electronic equipment and computer readable storage medium
CN110517694A (en) * 2019-09-06 2019-11-29 北京清帆科技有限公司 A kind of teaching scene voice conversion detection system
CN110933235B (en) * 2019-11-06 2021-07-27 杭州哲信信息技术有限公司 Noise identification method in intelligent calling system based on machine learning
CN110933235A (en) * 2019-11-06 2020-03-27 杭州哲信信息技术有限公司 Noise removing method in intelligent calling system based on machine learning
CN110956953A (en) * 2019-11-29 2020-04-03 中山大学 Quarrel identification method based on audio analysis and deep learning
CN110956953B (en) * 2019-11-29 2023-03-10 中山大学 Quarrel recognition method based on audio analysis and deep learning
US20210201892A1 (en) * 2019-12-31 2021-07-01 Beijing Didi Infinity Technology And Development Co., Ltd. Training mechanism of verbal harassment detection systems
US20210201893A1 (en) * 2019-12-31 2021-07-01 Beijing Didi Infinity Technology And Development Co., Ltd. Pattern-based adaptation model for detecting contact information requests in a vehicle
US20210201934A1 (en) * 2019-12-31 2021-07-01 Beijing Didi Infinity Technology And Development Co., Ltd. Real-time verbal harassment detection system
US11664043B2 (en) * 2019-12-31 2023-05-30 Beijing Didi Infinity Technology And Development Co., Ltd. Real-time verbal harassment detection system
US11670286B2 (en) * 2019-12-31 2023-06-06 Beijing Didi Infinity Technology And Development Co., Ltd. Training mechanism of verbal harassment detection systems
CN111583967A (en) * 2020-05-14 2020-08-25 西安医学院 Mental health emotion recognition device based on utterance model and operation method thereof
WO2021232594A1 (en) * 2020-05-22 2021-11-25 深圳壹账通智能科技有限公司 Speech emotion recognition method and apparatus, electronic device, and storage medium
CN111833653A (en) * 2020-07-13 2020-10-27 江苏理工学院 Driving assistance system, method, device, and storage medium using ambient noise
CN113361647A (en) * 2021-07-06 2021-09-07 青岛洞听智能科技有限公司 Method for identifying type of missed call

Also Published As

Publication number Publication date
CN109256150B (en) 2021-11-30

Similar Documents

Publication Publication Date Title
CN109256150A (en) Speech emotion recognition system and method based on machine learning
CN112804400B (en) Customer service call voice quality inspection method and device, electronic equipment and storage medium
CN106503805B (en) A kind of bimodal based on machine learning everybody talk with sentiment analysis method
CN111128223B (en) Text information-based auxiliary speaker separation method and related device
WO2016150257A1 (en) Speech summarization program
CN110033756B (en) Language identification method and device, electronic equipment and storage medium
Sarthak et al. Spoken language identification using convnets
CN111489765A (en) Telephone traffic service quality inspection method based on intelligent voice technology
CN111312292A (en) Emotion recognition method and device based on voice, electronic equipment and storage medium
CN111326178A (en) Multi-mode speech emotion recognition system and method based on convolutional neural network
CN116665676B (en) Semantic recognition method for intelligent voice outbound system
CN108735200A (en) A kind of speaker's automatic marking method
CN109714608A (en) Video data handling procedure, device, computer equipment and storage medium
Huang et al. Emotional speech feature normalization and recognition based on speaker-sensitive feature clustering
Reddy et al. Audio compression with multi-algorithm fusion and its impact in speech emotion recognition
Venkatesan et al. Automatic language identification using machine learning techniques
CN113611286B (en) Cross-language speech emotion recognition method and system based on common feature extraction
Krishna et al. Language independent gender identification from raw waveform using multi-scale convolutional neural networks
CN111091809A (en) Regional accent recognition method and device based on depth feature fusion
CN111091840A (en) Method for establishing gender identification model and gender identification method
Johar Paralinguistic profiling using speech recognition
CN111009262A (en) Voice gender identification method and system
CN111427996A (en) Method and device for extracting date and time from human-computer interaction text
US11398239B1 (en) ASR-enhanced speech compression
CN115063155A (en) Data labeling method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant