CN109256150A - Speech emotion recognition system and method based on machine learning - Google Patents
Speech emotion recognition system and method based on machine learning Download PDFInfo
- Publication number
- CN109256150A CN109256150A CN201811186572.8A CN201811186572A CN109256150A CN 109256150 A CN109256150 A CN 109256150A CN 201811186572 A CN201811186572 A CN 201811186572A CN 109256150 A CN109256150 A CN 109256150A
- Authority
- CN
- China
- Prior art keywords
- module
- segment
- machine learning
- model
- different
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
Abstract
The invention discloses the speech emotion recognition system and methods based on machine learning, including recording noise reduction module;Punctuate module, the recording data transmitted for receiving recording noise reduction module, is cut into segment for recording data according to etic correlated characteristic;Speaker Identification module, the segment to come for receiving punctuate module transfer using machine learning algorithm by segment classification, and identify speaker according to classification;Characteristic extracting module, the segment to come for receiving punctuate module transfer extract segment characterizations to each snippet extraction spectrum signature and mel cepstrum coefficients, and after being handled on it;Emotion recognition module is trained emotion prediction model by machine learning algorithm, and integrated using prediction result of the Integrated Algorithm to each model for receiving the segment characterizations of characteristic extracting module generation.The invention has the advantages that: good performance is effectively obtained in the actual production environment of Chinese language environment and customer service call.
Description
Technical field
The present invention relates to technical field of voice recognition, it particularly relates to which a kind of speech emotional based on machine learning is known
Other system and method.
Background technique
Two modules can be substantially divided into field of speech recognition, one is based on by content expressed by speech audio
Textual form is converted to show, second is that be based on speech audio, identification audio inside include mood (such as: it is angry or flat
Wait quietly).It identifies about voice mood, is related in external document, but it is bigger to limit to face;In existing document,
The Emotion identification of other language mostly other than Chinese, can not be directly applied to the voice under Chinese environment to identify mood,
And document is mostly to identify using single algorithm to mood, recognition effect is comparatively closer to laboratory data, in reality
Unsatisfactory, not to be able to satisfy in Chinese production environment requirement is showed in the production of border.
For the problems in the relevant technologies, currently no effective solution has been proposed.
Summary of the invention
For above-mentioned technical problem in the related technology, the present invention proposes a kind of speech emotion recognition based on machine learning
System and method is able to solve the Emotion identification problem based on Chinese audio sound-recording, the including but not limited to incoming call of customer service
The scenes such as outbound calling.
To realize the above-mentioned technical purpose, the technical scheme of the present invention is realized as follows:
A kind of speech emotion recognition system based on machine learning, including recording noise reduction module, punctuate module, Speaker Identification mould
Block, characteristic extracting module and emotion recognition module, wherein
Noise reduction module of recording carries out noise reduction pretreatment to recording data using related algorithm for obtaining recording data;
Punctuate module, the recording data transmitted for receiving recording noise reduction module, will record according to etic correlated characteristic
Sound data are cut into segment;
Speaker Identification module, the segment to come for receiving punctuate module transfer, using machine learning algorithm by segment classification,
And speaker is identified according to classification;
Characteristic extracting module, the segment to come for receiving punctuate module transfer, to each snippet extraction spectrum signature and Meier
Cepstrum coefficient, and segment characterizations are extracted after being handled on it;
Emotion recognition module, it is pre- to emotion by machine learning algorithm for receiving the segment characterizations of characteristic extracting module generation
It surveys model to be trained, and is integrated using prediction result of the Integrated Algorithm to each model.
Further, noise reduction pretreatment includes: in the recording noise reduction module
Training study module is trained study using data are damaged for inputting the data damaged;
Output module, for using not impaired data as the output of deep learning algorithm;
First processing module is handled for damaging data to other according to trained model.
Further, the Speaker Identification module includes:
Categorization module, for using different modeling methods by recording data different fragments and speech frame be divided into two classes or more
Class;
First integration module is integrated for the classification results to each model, to the reality of the sound bite of different speakers
When and batch classification marker.
Further, the characteristic extracting module includes:
Extraction module, for needing to extract all kinds of different characteristic indexs according to its modeling for each model;
Second processing module is processed generation identification feature for the different dimensions index to mel cepstrum coefficients;
Conversion module, the sound spectrograph for being generated by raw tone time-domain signal and its conversion carry out mentioning for characteristics of image
It takes.
Further, the emotion recognition module includes:
Input module, for training sample using each category feature extracted, to be input to different machine learning algorithm moulds
Study is trained in type;
Second integration module, for integrating the different prediction results obtained after each model training;
Prediction module, for obtaining the overall model that prediction effect is final under different application scene by adjusting different models, benefit
Other unknown segment emotions are predicted with final overall model.
Another aspect of the present invention provides a kind of speech-emotion recognition method based on machine learning, comprising the following steps:
S1 obtains recording data, carries out noise reduction pretreatment to recording data using related algorithm;
S2 receives the recording data that recording noise reduction module transmits, and is cut recording data according to etic correlated characteristic
At segment;
S3 receives the segment that comes of punctuate module transfer, using machine learning algorithm by segment classification, and according to classification to speaking
People identifies;
S4 receives the segment that punctuate module transfer comes, to each snippet extraction spectrum signature and mel cepstrum coefficients, and at it
On handled after extract segment characterizations;
S5 receives the segment characterizations that characteristic extracting module generates, and is trained by machine learning algorithm to emotion prediction model,
And it is integrated using prediction result of the Integrated Algorithm to each model.
Further, noise reduction pretreatment includes: in the step S1
S11 inputs the data damaged, is trained study using data are damaged;
S12 is using not impaired data as the output of deep learning algorithm;
S13 damages data to other according to trained model and handles.
Further, in the step S3 using machine learning algorithm by segment classification, and according to classification to speaker into
Row identifies
S31 using different modeling methods by recording data different fragments and speech frame be divided into two classes or multiclass;
S32 integrates the classification results of each model.
Further, segment characterizations are extracted after being handled on spectrum signature and mel cepstrum coefficients in the step S4
Include:
Each model is needed to extract all kinds of different characteristic indexs according to its modeling by S41;
S42 is processed generation identification feature to the different dimensions index of mel cepstrum coefficients;
S43 carries out the extraction of characteristics of image by the sound spectrograph that raw tone time-domain signal and its conversion generate.
Further, emotion prediction model is trained in the step S5, and using Integrated Algorithm to each model
Prediction result carry out integrated include:
Training sample using each category feature extracted, is input in different machine learning algorithm models and is instructed by S51
Practice study;
S52 integrates the different prediction results obtained after each model training;
S53 obtains the overall model that prediction effect is final under different application scene by adjusting different models, using final whole
Body Model predicts other unknown segment emotions.
Beneficial effects of the present invention: it designs and takes premised on being applied according to the language environment of Chinese and actual production environment
It builds, good performance can be more effectively obtained in the actual production environment of Chinese language environment and customer service call;Energy
It is enough that required recording is efficiently quickly retrieved according to mood, to improve the efficiency of including but not limited to phone quality inspection work.
Detailed description of the invention
It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment
Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention
Example, for those of ordinary skill in the art, without creative efforts, can also obtain according to these attached drawings
Obtain other attached drawings.
Fig. 1 is the structural representation of the speech emotion recognition system based on machine learning described according to embodiments of the present invention
Figure;
Fig. 2 is one of the flow chart of speech-emotion recognition method based on machine learning described according to embodiments of the present invention;
Fig. 3 is the two of the flow chart of the speech-emotion recognition method based on machine learning described according to embodiments of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, those of ordinary skill in the art's every other embodiment obtained belong to what the present invention protected
Range.
As shown in Figure 1, the speech emotion recognition system based on machine learning, including record according to embodiments of the present invention
Sound noise reduction module, punctuate module, Speaker Identification module, characteristic extracting module and emotion recognition module, wherein recording noise reduction mould
Block, punctuate module and characteristic extracting module belong to data prediction, recording noise reduction module, punctuate module and characteristic extracting module pair
Prediction provides basis, improves Stability and veracity during prediction, and provides the feature that can be used to predict, speaker
Identification module and emotion recognition module belong to prediction, and the segment and feature obtained using data prediction, each segment is spoken
People and emotion are predicted;
Noise reduction module of recording carries out noise reduction pretreatment to recording data using related algorithm for obtaining recording data;
Specifically, recording noise reduction module is for the different types of noise such as ambient noise, using related algorithm to recording at
Reason, to reduce or eliminate influence of the noise to speech recognition, with the expression effect of the other modules of strengthen the system, and noise reduction is complete
Complete recording is transferred to punctuate module.
Punctuate module, the recording data transmitted for receiving recording noise reduction module, according to etic correlated characteristic
Recording data is cut into segment;Wherein, punctuate module includes:
Second categorization module, for according to the clustering method in machine learning, different speech frame to be divided into voice and non-voice two
Class;
Aggregation module is used for rule-based method, the speech frame classified is polymerized to voice and non-vocal segments.
Specifically, the difference according to etic correlated characteristic on voice and non-voice, to each small segments of speech into
Row cluster is then based on rule and polymerize to sorted speech frame, and the continuous similar speech frame in time upper front and back is classified as
Among the same segment, while guaranteeing that same a word of identical speaker is comprised in a segment, different speakers are wrapped
It is contained in different fragments, the segment of generation is finally respectively transmitted to Speaker Identification module and characteristic extracting module again.
Speaker Identification module, the segment to come for receiving punctuate module transfer, using machine learning algorithm by segment
Classification, and speaker is identified according to classification;
Specifically, all segments come out of breaking in one logical recording are divided into two or more using machine learning algorithm
Classification, and two or more speakers are identified according to the classification branched away, separate every section of a word length
Small fragment be which speaker belonged to.
Characteristic extracting module, the segment to come for receiving punctuate module transfer, to each snippet extraction spectrum signature and
Mel cepstrum coefficients, and segment characterizations are extracted after being handled on it;
Specifically, characteristic extracting module is the previous step of emotion recognition, an index, the finger of extraction are extracted to each small fragment
It is designated as in the common feature in speech recognition (audio conversion text) field: mel cepstrum coefficients (Mel-Frequency Cepstral
Coefficients, MFCCs), and relevant processing is carried out on mel cepstrum coefficients, such as further extract feature or
Increase the methods of other features, because mel cepstrum coefficients (MFCC) still include a large amount of characteristic information, therefore can be to having mentioned
The MFCC feature taken, which is further extracted, is similar to mean value, and the statistical feature such as standard deviation is the complementary features of MFCC,
To promote the effect of emotional prediction model, and a segment is indicated using the feature that the above method extracts, and will mention
The feature got is transferred to emotion recognition module.
Emotion recognition module, for receiving the segment characterizations of characteristic extracting module generation, by machine learning algorithm to feelings
Sense prediction model is trained, and is integrated using prediction result of the Integrated Algorithm to each model.
Specifically, firstly, using characteristic extracting module extract feature and relevant machine learning algorithm, such as convolution mind
Related algorithm through the machine learning and deep learning such as network (CNN) and support vector machines (SVM), secondly, using in
The sound bite of literary language environment and the training method for being more likely to production environment application carry out emotion prediction model
Training;And then, utilization trained model, predicts the feature of unknown fragment, to obtain unknown fragment
Representative mood, because each model may be not exactly the same to identical recording segment prediction result;Finally, utilizing collection preconceived plan
Method integrates the prediction result of each model, to obtain the final mood of each recording segment.
In one particular embodiment of the present invention, noise reduction pretreatment includes: in the recording noise reduction module
Training study module is trained study using data are damaged for inputting the data damaged;
Output module, for using not impaired data as the output of deep learning algorithm;
First processing module is handled for damaging data to other according to trained model.
Specifically, the method for noise reduction is DAE(Denoising Autoencoder, noise reduction self-encoding encoder) it is a depth
The algorithm of study, principle are: the data (can damage data to be various forms of) that input one damages have using this part
Damage data are trained study, using not impaired data as the output of deep learning algorithm, and utilize trained model
Data are damaged to other to handle, because the noise for including in recording, which belongs in recording, damages data, therefore can use DAE
Noise reduction process is carried out to recording.
In one particular embodiment of the present invention, the Speaker Identification module includes:
Categorization module, for using different modeling methods by recording data different fragments and speech frame be divided into two classes or more
Class;
First integration module is integrated for the classification results to each model, to the reality of the sound bite of different speakers
When and batch classification marker.
Specifically, utilizing cluster, Gauss because performance of the different speakers on voice feature data and index has differences
Different fragments in one logical recording, speech frame are divided into two classes or multiclass by the difference modeling method such as mixed model and CNN, and right
The classification results of each model carry out integrated to realize real-time and batch the contingency table to the sound bite of different speakers
Note.Such as insurance phone in, can use the above method to attend a banquet and client carry out Classification and Identification;In addition, the system may be used also
To be directed to the scene of certain applications, preparatory speech samples sampling modeling is carried out to known speaker, thus using pre-training
Speaker model further increase the accuracy rate of Speaker Identification, such as the sound that the customer service by obtaining in advance is attended a banquet is pre-
Training sample model promotes the recognition effect for unknown client.
In one particular embodiment of the present invention, the characteristic extracting module includes:
Extraction module, for needing to extract all kinds of different characteristic indexs according to its modeling for each model, wherein characteristic index
For mel cepstrum coefficients;
Second processing module is processed generation identification feature for the different dimensions index to mel cepstrum coefficients;
Conversion module, the sound spectrograph for being generated by raw tone time-domain signal and its conversion carry out mentioning for characteristics of image
It takes.
Specifically, multiple models of the system need all kinds of different characteristic indexs of selective extraction according to its modeling, classification
Using more common MFCC extracting method and processing is further processed for the different dimensions index of MFCC in model
To generate significantly more efficient identification feature;The models such as CNN then by the sound spectrograph that is generated for original voice spectrum conversion into
The extraction of row graphic feature.
Such as: characteristic extracting module carries out preemphasis processing to original recording signal first, will entire complete recording letter
Number be divided into multiple frames (frame represents a bit of recording in original recording signal), by each frame multiplied by Hamming window after, it is done fastly
Fast Fourier transformation obtains the energy spectrum of each frame, and then by energy spectrum by triangular filter group, to each triangular filter
After the logarithmic energy of group output does discrete cosine transform (DCT), the feature of the sound bites such as MFCC coefficient is obtained.
In one particular embodiment of the present invention, the emotion recognition module includes:
Input module, for training sample using each category feature extracted, to be input to different machine learning algorithm moulds
Study is trained in type;
Second integration module, for integrating the different prediction results obtained after each model training;
Prediction module, for obtaining the overall model that prediction effect is final under different application scene by adjusting different models, benefit
Other unknown segment emotions are predicted with final overall model.
Specifically, being used as training sample by the mood label to historical sample, utilization is aforementioned in emotion recognition module
Each category feature extracted in step is input in different machine learning algorithm models and is trained study, due to not
With algorithm model learn, the feature that recognizes is not quite similar, there is also differences for the identification of each model and prediction effect, are
Unite so the different prediction results obtained after each model training are integrated, by adjusting different models integrated weight,
The methods of objective appraisal function obtains the optimal overall model of prediction effect under the different applications scene such as accurate rate, recall rate,
And then it goes to predict the segment emotion that other are unknown using final model.
Such as: the mood that the sound bite in difference call respectively represents different clients and attends a banquet, by these voices
After segment is extracted and handled according to the method for characteristic extracting module, by segment sample input method to machine learning algorithm model
It trains and integrates in (such as: support vector machines scheduling algorithm), go to predict the language that other moods are unknown using finally obtained model
Tablet section.
As shown in Figures 2 and 3, another aspect of the present invention provides a kind of speech-emotion recognition method based on machine learning,
The following steps are included:
S1 obtains recording data, carries out noise reduction pretreatment to recording data using related algorithm;
S2 receives the recording data that recording noise reduction module transmits, and is cut recording data according to etic correlated characteristic
At segment;Wherein, recording data is cut by segment according to etic correlated characteristic and further includes steps of S21 root
According to the clustering method in machine learning, different speech frame is divided into two class of voice and non-voice;The rule-based method of S22, will
The speech frame classified is polymerized to voice and non-vocal segments.
Specifically, the difference according to etic correlated characteristic on voice and non-voice, to each small segments of speech into
Row cluster is then based on rule and polymerize to sorted speech frame, and the continuous similar speech frame in time upper front and back is classified as
Among the same segment, while guaranteeing that same a word of identical speaker is comprised in a segment, different speakers are wrapped
It is contained in different fragments, the segment of generation is finally respectively transmitted to Speaker Identification module and characteristic extracting module again.
S3 receives the segment that punctuate module transfer comes, using machine learning algorithm by segment classification, and according to classification pair
Speaker identifies;
S4 receives the segment that punctuate module transfer comes, to each snippet extraction spectrum signature and mel cepstrum coefficients, and at it
On handled after extract segment characterizations;
S5 receives the segment characterizations that characteristic extracting module generates, and is trained by machine learning algorithm to emotion prediction model,
And it is integrated using prediction result of the Integrated Algorithm to each model.
In one particular embodiment of the present invention, noise reduction pretreatment includes: in the step S1
S11 inputs the data damaged, is trained study using data are damaged;
S12 is using not impaired data as the output of deep learning algorithm;
S13 damages data to other according to trained model and handles.
In one particular embodiment of the present invention, utilize machine learning algorithm by segment classification in the step S3, and
Carrying out identification to speaker according to classification includes:
S31 using different modeling methods by recording data different fragments and speech frame be divided into two classes or multiclass;
S32 integrates the classification results of each model.
In one particular embodiment of the present invention, it is carried out on spectrum signature and mel cepstrum coefficients in the step S4
Segment characterizations are extracted after processing includes:
Each model is needed to extract all kinds of different characteristic indexs according to its modeling by S41;
S42 is processed generation identification feature to the different dimensions index of mel cepstrum coefficients;
S43 carries out the extraction of characteristics of image by the sound spectrograph that raw tone time-domain signal and its conversion generate.
In one particular embodiment of the present invention, emotion prediction model is trained in the step S5, and utilized
Integrated Algorithm integrate to the prediction result of each model
Training sample using each category feature extracted, is input in different machine learning algorithm models and is instructed by S51
Practice study;
S52 integrates the different prediction results obtained after each model training;
S53 obtains the overall model that prediction effect is final under different application scene by adjusting different models, using final whole
Body Model predicts other unknown segment emotions.
In order to facilitate understanding above-mentioned technical proposal of the invention, below by way of in specifically used mode to of the invention above-mentioned
Technical solution is described in detail.
When specifically used, the speech emotion recognition system according to the present invention based on machine learning, by true
Telephonograph data, using related algorithm to true phone data carry out noise reduction pretreatment, then according in sound bite
Sound bite is cut into the segment that multiple length are a word by the notable difference of vocal sections and non-vocal sections;Because true
Recording data in segment include two or more speakers, it is different according to everyone sound characteristic, to two or
More than two speakers distinguish, and on the basis of punctuate module, each segment are indicated with mel cepstrum coefficients, and will generation
The mel cepstrum coefficients of each segment of table are sent to emotion recognition module, then according to mel cepstrum coefficients to representated by each segment
Mood identified.
With the deployment and lasting application of model, system will constantly the new prediction in actual use of accumulation it is correct and
The mood sample data of mistake, system are constantly optimized and are mentioned to the parameter of model, structure etc. by the methods of incremental learning
It rises, sustained improvement reaches better application effect.
In conclusion by means of above-mentioned technical proposal of the invention, according to the language environment and actual production environment of Chinese
It designs and builds premised on, it can be more effectively in the actual production environment of Chinese language environment and customer service call
In obtain good performance;Required recording efficiently can be quickly retrieved according to mood, so that improving includes but is not limited to electricity
Talk about the efficiency of quality inspection work.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention
Within mind and principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.
Claims (10)
1. a kind of speech emotion recognition system based on machine learning, which is characterized in that including the noise reduction module, punctuate mould of recording
Block, Speaker Identification module, characteristic extracting module and emotion recognition module, wherein
Noise reduction module of recording carries out noise reduction pretreatment to recording data using related algorithm for obtaining recording data;
Punctuate module, the recording data transmitted for receiving recording noise reduction module, will record according to etic correlated characteristic
Sound data are cut into segment;
Speaker Identification module, the segment to come for receiving punctuate module transfer, using machine learning algorithm by segment classification,
And speaker is identified according to classification;
Characteristic extracting module, the segment to come for receiving punctuate module transfer, to each snippet extraction spectrum signature and Meier
Cepstrum coefficient, and segment characterizations are extracted after being handled on it;
Emotion recognition module, it is pre- to emotion by machine learning algorithm for receiving the segment characterizations of characteristic extracting module generation
It surveys model to be trained, and is integrated using prediction result of the Integrated Algorithm to each model.
2. the speech emotion recognition system according to claim 1 based on machine learning, which is characterized in that the recording drop
Noise reduction, which pre-processes, in module of making an uproar includes:
Training study module is trained study using data are damaged for inputting the data damaged;
Output module, for using not impaired data as the output of deep learning algorithm;
First processing module is handled for damaging data to other according to trained model.
3. the speech emotion recognition system according to claim 1 based on machine learning, which is characterized in that the speaker
Identification module includes:
Categorization module, for using different modeling methods by recording data different fragments and speech frame be divided into two classes or more
Class;
First integration module is integrated for the classification results to each model, to the reality of the sound bite of different speakers
When and batch classification marker.
4. the speech emotion recognition system according to claim 3 based on machine learning, which is characterized in that the feature mentions
Modulus block includes:
Extraction module, for needing to extract all kinds of different characteristic indexs according to its modeling for each model;
Second processing module is processed generation identification feature for the different dimensions index to mel cepstrum coefficients;
Conversion module, the sound spectrograph for being generated by raw tone time-domain signal and its conversion carry out mentioning for characteristics of image
It takes.
5. the speech emotion recognition system according to claim 3 or 4 based on machine learning, which is characterized in that the feelings
Feeling identification module includes:
Input module, for training sample using each category feature extracted, to be input to different machine learning algorithm moulds
Study is trained in type;
Second integration module, for integrating the different prediction results obtained after each model training;
Prediction module, for obtaining the overall model that prediction effect is final under different application scene by adjusting different models, benefit
Other unknown segment emotions are predicted with final overall model.
6. a kind of speech-emotion recognition method based on machine learning, which comprises the following steps:
S1 obtains recording data, carries out noise reduction pretreatment to recording data using related algorithm;
S2 receives the recording data that recording noise reduction module transmits, and is cut recording data according to etic correlated characteristic
At segment;
S3 receives the segment that comes of punctuate module transfer, using machine learning algorithm by segment classification, and according to classification to speaking
People identifies;
S4 receives the segment that punctuate module transfer comes, to each snippet extraction spectrum signature and mel cepstrum coefficients, and at it
On handled after extract segment characterizations;
S5 receives the segment characterizations that characteristic extracting module generates, and is trained by machine learning algorithm to emotion prediction model,
And it is integrated using prediction result of the Integrated Algorithm to each model.
7. the speech-emotion recognition method according to claim 6 based on machine learning, which is characterized in that the step S1
Middle noise reduction pre-processes
S11 inputs the data damaged, is trained study using data are damaged;
S12 is using not impaired data as the output of deep learning algorithm;
S13 damages data to other according to trained model and handles.
8. the speech-emotion recognition method according to claim 6 based on machine learning, which is characterized in that the step S3
It is middle to utilize machine learning algorithm by segment classification, and identification is carried out to speaker according to classification and includes:
S31 using different modeling methods by recording data different fragments and speech frame be divided into two classes or multiclass;
S32 integrates the classification results of each model.
9. the speech-emotion recognition method according to claim 8 based on machine learning, which is characterized in that the step S4
In handled on spectrum signature and mel cepstrum coefficients after extract segment characterizations include:
Each model is needed to extract all kinds of different characteristic indexs according to its modeling by S41;
S42 is processed generation identification feature to the different dimensions index of mel cepstrum coefficients;
S43 carries out the extraction of characteristics of image by the sound spectrograph that raw tone time-domain signal and its conversion generate.
10. the speech-emotion recognition method based on machine learning according to claim 8 or claim 9, which is characterized in that the step
Emotion prediction model is trained in rapid S5, and integrate using prediction result of the Integrated Algorithm to each model and includes:
Training sample using each category feature extracted, is input in different machine learning algorithm models and is instructed by S51
Practice study;
S52 integrates the different prediction results obtained after each model training;
S53 obtains the overall model that prediction effect is final under different application scene by adjusting different models, using final whole
Body Model predicts other unknown segment emotions.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811186572.8A CN109256150B (en) | 2018-10-12 | 2018-10-12 | Speech emotion recognition system and method based on machine learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811186572.8A CN109256150B (en) | 2018-10-12 | 2018-10-12 | Speech emotion recognition system and method based on machine learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109256150A true CN109256150A (en) | 2019-01-22 |
CN109256150B CN109256150B (en) | 2021-11-30 |
Family
ID=65045954
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811186572.8A Active CN109256150B (en) | 2018-10-12 | 2018-10-12 | Speech emotion recognition system and method based on machine learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109256150B (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109697982A (en) * | 2019-02-01 | 2019-04-30 | 北京清帆科技有限公司 | A kind of speaker speech recognition system in instruction scene |
CN110473566A (en) * | 2019-07-25 | 2019-11-19 | 深圳壹账通智能科技有限公司 | Audio separation method, device, electronic equipment and computer readable storage medium |
CN110517694A (en) * | 2019-09-06 | 2019-11-29 | 北京清帆科技有限公司 | A kind of teaching scene voice conversion detection system |
CN110933235A (en) * | 2019-11-06 | 2020-03-27 | 杭州哲信信息技术有限公司 | Noise removing method in intelligent calling system based on machine learning |
CN110956953A (en) * | 2019-11-29 | 2020-04-03 | 中山大学 | Quarrel identification method based on audio analysis and deep learning |
CN111583967A (en) * | 2020-05-14 | 2020-08-25 | 西安医学院 | Mental health emotion recognition device based on utterance model and operation method thereof |
CN111833653A (en) * | 2020-07-13 | 2020-10-27 | 江苏理工学院 | Driving assistance system, method, device, and storage medium using ambient noise |
WO2020253509A1 (en) * | 2019-06-19 | 2020-12-24 | 平安科技(深圳)有限公司 | Situation- and emotion-oriented chinese speech synthesis method, device, and storage medium |
CN112151042A (en) * | 2019-06-27 | 2020-12-29 | 中国电信股份有限公司 | Voiceprint recognition method, device and system and computer readable storage medium |
US20210201892A1 (en) * | 2019-12-31 | 2021-07-01 | Beijing Didi Infinity Technology And Development Co., Ltd. | Training mechanism of verbal harassment detection systems |
US20210201893A1 (en) * | 2019-12-31 | 2021-07-01 | Beijing Didi Infinity Technology And Development Co., Ltd. | Pattern-based adaptation model for detecting contact information requests in a vehicle |
US20210201934A1 (en) * | 2019-12-31 | 2021-07-01 | Beijing Didi Infinity Technology And Development Co., Ltd. | Real-time verbal harassment detection system |
CN113361647A (en) * | 2021-07-06 | 2021-09-07 | 青岛洞听智能科技有限公司 | Method for identifying type of missed call |
WO2021232594A1 (en) * | 2020-05-22 | 2021-11-25 | 深圳壹账通智能科技有限公司 | Speech emotion recognition method and apparatus, electronic device, and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103810994A (en) * | 2013-09-05 | 2014-05-21 | 江苏大学 | Method and system for voice emotion inference on basis of emotion context |
CN103985381A (en) * | 2014-05-16 | 2014-08-13 | 清华大学 | Voice frequency indexing method based on parameter fusion optimized decision |
CN106340309A (en) * | 2016-08-23 | 2017-01-18 | 南京大空翼信息技术有限公司 | Dog bark emotion recognition method and device based on deep learning |
CN107068161A (en) * | 2017-04-14 | 2017-08-18 | 百度在线网络技术(北京)有限公司 | Voice de-noising method, device and computer equipment based on artificial intelligence |
US20180018969A1 (en) * | 2016-07-15 | 2018-01-18 | Circle River, Inc. | Call Forwarding to Unavailable Party Based on Artificial Intelligence |
CN108074576A (en) * | 2017-12-14 | 2018-05-25 | 讯飞智元信息科技有限公司 | Inquest the speaker role's separation method and system under scene |
CN108259686A (en) * | 2017-12-28 | 2018-07-06 | 合肥凯捷技术有限公司 | A kind of customer service system based on speech analysis |
-
2018
- 2018-10-12 CN CN201811186572.8A patent/CN109256150B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103810994A (en) * | 2013-09-05 | 2014-05-21 | 江苏大学 | Method and system for voice emotion inference on basis of emotion context |
CN103985381A (en) * | 2014-05-16 | 2014-08-13 | 清华大学 | Voice frequency indexing method based on parameter fusion optimized decision |
US20180018969A1 (en) * | 2016-07-15 | 2018-01-18 | Circle River, Inc. | Call Forwarding to Unavailable Party Based on Artificial Intelligence |
CN106340309A (en) * | 2016-08-23 | 2017-01-18 | 南京大空翼信息技术有限公司 | Dog bark emotion recognition method and device based on deep learning |
CN107068161A (en) * | 2017-04-14 | 2017-08-18 | 百度在线网络技术(北京)有限公司 | Voice de-noising method, device and computer equipment based on artificial intelligence |
CN108074576A (en) * | 2017-12-14 | 2018-05-25 | 讯飞智元信息科技有限公司 | Inquest the speaker role's separation method and system under scene |
CN108259686A (en) * | 2017-12-28 | 2018-07-06 | 合肥凯捷技术有限公司 | A kind of customer service system based on speech analysis |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109697982A (en) * | 2019-02-01 | 2019-04-30 | 北京清帆科技有限公司 | A kind of speaker speech recognition system in instruction scene |
WO2020253509A1 (en) * | 2019-06-19 | 2020-12-24 | 平安科技(深圳)有限公司 | Situation- and emotion-oriented chinese speech synthesis method, device, and storage medium |
CN112151042A (en) * | 2019-06-27 | 2020-12-29 | 中国电信股份有限公司 | Voiceprint recognition method, device and system and computer readable storage medium |
CN110473566A (en) * | 2019-07-25 | 2019-11-19 | 深圳壹账通智能科技有限公司 | Audio separation method, device, electronic equipment and computer readable storage medium |
CN110517694A (en) * | 2019-09-06 | 2019-11-29 | 北京清帆科技有限公司 | A kind of teaching scene voice conversion detection system |
CN110933235B (en) * | 2019-11-06 | 2021-07-27 | 杭州哲信信息技术有限公司 | Noise identification method in intelligent calling system based on machine learning |
CN110933235A (en) * | 2019-11-06 | 2020-03-27 | 杭州哲信信息技术有限公司 | Noise removing method in intelligent calling system based on machine learning |
CN110956953A (en) * | 2019-11-29 | 2020-04-03 | 中山大学 | Quarrel identification method based on audio analysis and deep learning |
CN110956953B (en) * | 2019-11-29 | 2023-03-10 | 中山大学 | Quarrel recognition method based on audio analysis and deep learning |
US20210201892A1 (en) * | 2019-12-31 | 2021-07-01 | Beijing Didi Infinity Technology And Development Co., Ltd. | Training mechanism of verbal harassment detection systems |
US20210201893A1 (en) * | 2019-12-31 | 2021-07-01 | Beijing Didi Infinity Technology And Development Co., Ltd. | Pattern-based adaptation model for detecting contact information requests in a vehicle |
US20210201934A1 (en) * | 2019-12-31 | 2021-07-01 | Beijing Didi Infinity Technology And Development Co., Ltd. | Real-time verbal harassment detection system |
US11664043B2 (en) * | 2019-12-31 | 2023-05-30 | Beijing Didi Infinity Technology And Development Co., Ltd. | Real-time verbal harassment detection system |
US11670286B2 (en) * | 2019-12-31 | 2023-06-06 | Beijing Didi Infinity Technology And Development Co., Ltd. | Training mechanism of verbal harassment detection systems |
CN111583967A (en) * | 2020-05-14 | 2020-08-25 | 西安医学院 | Mental health emotion recognition device based on utterance model and operation method thereof |
WO2021232594A1 (en) * | 2020-05-22 | 2021-11-25 | 深圳壹账通智能科技有限公司 | Speech emotion recognition method and apparatus, electronic device, and storage medium |
CN111833653A (en) * | 2020-07-13 | 2020-10-27 | 江苏理工学院 | Driving assistance system, method, device, and storage medium using ambient noise |
CN113361647A (en) * | 2021-07-06 | 2021-09-07 | 青岛洞听智能科技有限公司 | Method for identifying type of missed call |
Also Published As
Publication number | Publication date |
---|---|
CN109256150B (en) | 2021-11-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109256150A (en) | Speech emotion recognition system and method based on machine learning | |
CN112804400B (en) | Customer service call voice quality inspection method and device, electronic equipment and storage medium | |
CN106503805B (en) | A kind of bimodal based on machine learning everybody talk with sentiment analysis method | |
CN111128223B (en) | Text information-based auxiliary speaker separation method and related device | |
WO2016150257A1 (en) | Speech summarization program | |
CN110033756B (en) | Language identification method and device, electronic equipment and storage medium | |
Sarthak et al. | Spoken language identification using convnets | |
CN111489765A (en) | Telephone traffic service quality inspection method based on intelligent voice technology | |
CN111312292A (en) | Emotion recognition method and device based on voice, electronic equipment and storage medium | |
CN111326178A (en) | Multi-mode speech emotion recognition system and method based on convolutional neural network | |
CN116665676B (en) | Semantic recognition method for intelligent voice outbound system | |
CN108735200A (en) | A kind of speaker's automatic marking method | |
CN109714608A (en) | Video data handling procedure, device, computer equipment and storage medium | |
Huang et al. | Emotional speech feature normalization and recognition based on speaker-sensitive feature clustering | |
Reddy et al. | Audio compression with multi-algorithm fusion and its impact in speech emotion recognition | |
Venkatesan et al. | Automatic language identification using machine learning techniques | |
CN113611286B (en) | Cross-language speech emotion recognition method and system based on common feature extraction | |
Krishna et al. | Language independent gender identification from raw waveform using multi-scale convolutional neural networks | |
CN111091809A (en) | Regional accent recognition method and device based on depth feature fusion | |
CN111091840A (en) | Method for establishing gender identification model and gender identification method | |
Johar | Paralinguistic profiling using speech recognition | |
CN111009262A (en) | Voice gender identification method and system | |
CN111427996A (en) | Method and device for extracting date and time from human-computer interaction text | |
US11398239B1 (en) | ASR-enhanced speech compression | |
CN115063155A (en) | Data labeling method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |