CN109785863A

CN109785863A - A speech emotion recognition method and system based on deep belief network

Info

Publication number: CN109785863A
Application number: CN201910173690.3A
Authority: CN
Inventors: 巩微; 黄玮; 伏文龙; 黄晨晨; 范文庆
Original assignee: Communication University of China
Current assignee: Communication University of China
Priority date: 2019-02-28
Filing date: 2019-02-28
Publication date: 2019-05-21

Abstract

The present invention discloses the speech-emotion recognition method and system of a kind of deepness belief network.The recognition methods includes: acquisition voice signal；The voice signal is pre-processed, pretreatment voice signal is obtained；Unsupervised speech recognition is carried out using deepness belief network to the pretreatment voice signal, obtains phonic signal character；The identification classification that the phonic signal character is carried out to speech emotional using support vector machines, obtains speech emotion recognition result.Using the multi-categorizer model based on deepness belief network and limitation Boltzmann machine, the multi-classifier system of a speech emotion recognition is established, the discrimination of speech emotional is improved.

Description

A kind of speech-emotion recognition method and system of deepness belief network

Technical field

The present invention relates to field of speech recognition, a kind of speech-emotion recognition method more particularly to deepness belief network and System.

Background technique

With cloud computing, the development of mobile Internet, big data, machine is that the mankind service intelligence further, people and machine For the dream to be engaged in the dialogue with natural language gradually close to realizing, requirement of the people to machine interaction capabilities is also higher and higher.Simply The identification of voice content be no longer satisfied the requirement of people, handle, identify and understand the emotion in voice in practical application In become particularly important.Language emotion recognition has boundless application prospect, it can be applied not only to man-machine friendship Mutual system, can be also used for speech recognition, enhance the robustness of speech recognition；Or it is used for speaker identification, improve speaker Discrimination rate.Speech emotion recognition technology intelligent human-machine interaction, human-computer interaction teaching, be widely used.Automatic language feelings The other research of perception, be not only able to push computer technology further development, it also by substantially increase people work and Learning efficiency improves people's lives quality.

Extraneous various emotion signals are sampled to identify various emotions, in terms of deep neural network research, for The accuracy of emotional semantic classification is low, in terms of pattern-recognition, using the feelings in the prior art based in neural network extraction voice Sense, it is lower for the discrimination of sad, excited, happy and angry emotion, using adaptive neural network to speech emotional state Discrimination it is relatively low.

Using traditional neural network in training, each layer of network is trained together as a whole, when facing big data When situation, the training time of network just will increase, the convergence rate of network is made to become slower.Back-propagation algorithm is neural network Most commonly used method in training trains entire neural network by the method for iteration, and network parameter is using the side being randomized Formula is initialized, and adjusts net using the difference of the actual value of the output valve and data that currently calculate network top obtained The parameter of each layer of network, using traditional gradient descent method, the target of undated parameter be so that neural network forecast value and true value more It is close, still, come initialization network parameter by the way of random initializtion, it will lead to error correction more down when network updates Signal is weaker, and gradient also becomes more sparse, so that network is easily trapped into local optimum.So leading to the knowledge of speech emotional state Not rate is low.

Summary of the invention

The object of the present invention is to provide a kind of speech emotionals of deepness belief network that can be improved speech emotion recognition rate Recognition methods and system.

To achieve the above object, the present invention provides following schemes:

A kind of speech-emotion recognition method of deepness belief network, which is characterized in that the recognition methods includes:

Obtain voice signal；

The voice signal is pre-processed, pretreatment voice signal is obtained；

Unsupervised speech recognition is carried out using deepness belief network to the pretreatment voice signal, is obtained Phonic signal character；

The identification classification that the phonic signal character is carried out to speech emotional using support vector machines, obtains speech emotional and knows Other result.

Optionally, described that unsupervised voice signal spy is carried out using deepness belief network to the pretreatment voice signal Sign is extracted, and is obtained phonic signal character and is specifically included:

Low layer to high-rise N layer limitation Boltzmann machine is stacked, deepness belief network is obtained；

Limitation Boltzmann machine according to the pretreatment voice signal to i-th layer carries out unsupervised training, obtains i-th most Excellent parameter, the optimized parameter for the limitation Boltzmann machine that i-th optimized parameter is described i-th layer；Wherein, the value of i is successively It is 1,2 ..., N；

The limitation Boltzmann machine of i+1 layer is carried out according to i-th optimized parameter and the pretreatment voice signal Unsupervised training obtains i+1 optimized parameter；

The multiple optimized parameter is finely tuned using the method for overall situation training to the deepness belief network and converges to the overall situation It is optimal, obtain multiple fine tuning optimized parameters；

The phonic signal character of the pretreatment voice signal is extracted according to the fine tuning optimized parameter.

Optionally, described that the phonic signal character is classified using the identification that support vector machines carries out speech emotional, it obtains Speech emotion recognition result is obtained to specifically include:

The sample point of the phonic signal character is mapped to by high-dimensional feature space using kernel function, obtaining spatial linear can The sample divided；

The sample that the support vector machines can divide according to the spatial linear carries out logic to the phonic signal character and sentences It is disconnected, obtain speech emotion recognition result.

A kind of speech emotion recognition system of deepness belief network, the identifying system include:

Voice signal obtains module, for obtaining voice signal；

Speech signal pre-processing module obtains pretreatment voice signal for pre-processing the voice signal；

Characteristic extracting module, for carrying out unsupervised voice using deepness belief network to the pretreatment voice signal Signal characteristic abstraction obtains phonic signal character；

Emotion recognition module, for the phonic signal character to be carried out to the identification point of speech emotional using support vector machines Class obtains speech emotion recognition result.

Optionally, the characteristic extracting module specifically includes:

Deepness belief network establishes unit, for stacking low layer to high-rise N layer limitation Boltzmann machine, obtains depth Belief network；

Supervised training unit carries out nothing for the limitation Boltzmann machine according to the pretreatment voice signal to i-th layer Supervised training obtains the i-th optimized parameter, the optimal ginseng for the limitation Boltzmann machine that i-th optimized parameter is described i-th layer Number；Wherein, the value of i is followed successively by 1,2 ..., N；According to i-th optimized parameter and the pretreatment voice signal to I+1 layers of limitation Boltzmann machine carries out unsupervised training, obtains i+1 optimized parameter；

Small parameter perturbations unit is believed for being finely tuned the multiple optimized parameter using the method for overall situation training to the depth Network convergence is read to global optimum, obtains multiple fine tuning optimized parameters；

Speech recognition unit, for extracting the pretreatment voice signal according to the fine tuning optimized parameter Phonic signal character.

Optionally, the emotion recognition module specifically includes:

Kernel function unit, for the sample point of the phonic signal character to be mapped to high dimensional feature sky using kernel function Between, obtain the sample that spatial linear can divide；

Logic judgment unit believes the voice according to the sample that the spatial linear can divide for the support vector machines Number feature carries out logic judgment, obtains speech emotion recognition result.

The specific embodiment provided according to the present invention, the invention discloses following technical effects: the invention discloses one kind The speech-emotion recognition method and system of deepness belief network.The recognition methods includes: acquisition voice signal；Described in pretreatment Voice signal obtains pretreatment voice signal；Deepness belief network is used to carry out the pretreatment voice signal unsupervised Speech recognition obtains phonic signal character；The phonic signal character is subjected to voice feelings using support vector machines The identification of sense is classified, and speech emotion recognition result is obtained.Each limitation Bohr is successively trained hereby using the deepness belief network The mode of graceful machine entirely trains the entire deepness belief network to reach training, using based on the deepness belief network and institute The multi-categorizer model for stating limitation Boltzmann machine, establishes the multi-classifier system of a speech emotion recognition, improves language The discrimination of sound emotion.

Detailed description of the invention

It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.

Fig. 1 is the flow chart of the speech-emotion recognition method of deepness belief network provided by the invention；

Fig. 2 is the structure composition figure of the speech emotion recognition system of deepness belief network provided by the invention；

Fig. 3 is the feelings identifying system block diagram provided by the invention based on support vector machines.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.

As shown in Figure 1, a kind of speech-emotion recognition method of deepness belief network, which is characterized in that the recognition methods Include:

Step 100: obtaining voice signal；

Step 200: pre-processing the voice signal, obtain pretreatment voice signal；

Step 300: unsupervised phonic signal character is carried out using deepness belief network to the pretreatment voice signal It extracts, obtains phonic signal character；

Step 400: the phonic signal character being classified using the identification that support vector machines carries out speech emotional, obtains language Sound emotion recognition result.

The step 300: unsupervised voice signal is carried out using deepness belief network to the pretreatment voice signal Feature extraction obtains phonic signal character and specifically includes:

The step 400: the phonic signal character is classified using the identification that support vector machines carries out speech emotional, is obtained Speech emotion recognition result is obtained to specifically include:

As shown in Fig. 2, a kind of speech emotion recognition system of deepness belief network, the identifying system include:

Voice signal obtains module 1, for obtaining voice signal；

Speech signal pre-processing module 2 obtains pretreatment voice signal for pre-processing the voice signal；

Characteristic extracting module 3, for carrying out unsupervised language using deepness belief network to the pretreatment voice signal Sound signal feature extraction obtains phonic signal character；

Emotion recognition module 4, for the phonic signal character to be carried out to the identification of speech emotional using support vector machines Classification obtains speech emotion recognition result.

The characteristic extracting module 3 specifically includes:

The emotion recognition module 4 specifically includes:

After the multidimensional characteristic vectors for extracting the affective characteristics in voice signal by deepness belief network, one is needed to be suitble to Emotion classifiers.This method is using support vector machines using one-to-one mode to four kinds of emotions (surprised, glad, angry, sad) Classify.Deepness belief network is extracted to the multidimensional characteristic vectors of the affective characteristics in voice signal as support vector machines The sample point of input feature vector is mapped to the Nonlinear separability problem of speech emotional using kernel function by the input of classifier High-dimensional feature space, so that corresponding sample space linear separability.Feelings identifying system block diagram such as Fig. 3 institute based on support vector machines Show.

" one-to-one " mode is to construct hyperplane to any two kinds of emotions, needs to train k* (k-1)/2 sub-classifier.It is whole A training process needs altogetherA support vector machines sub-classifier, i.e., 6.Each sub-classifier is by surprised, glad, anger Any two kinds of training in anger, sad four kinds of affective characteristics form.That is: glad-indignation, glad-sad, glad-surprised, anger Anger-sadness, indignation-is surprised, sad-surprised.One classifier of training between every two class is carried out when to a unknown speech emotional When classification, each classifier carries out its classification to judge and for corresponding classification " throwing a upper ticket ", last who gets the most votes's class It Ji not be as the classification of the unknown emotion.Decision phase uses ballot method, it is understood that there may be the identical situation of the poll of multiple classes, from And make unknown sample while belonging to multiple classifications, influence nicety of grading.

It is both needed to before support vector machine classifier training and identification as one label of every emotional speech Design of Signal, to Indicate emotional category belonging to this emotional speech signal.The type of label must be set as dimorphism.During emotion recognition, together When feature vector is input in all support vector machines, the output of each support vector machines was selected most later by logical decision Possible emotional category, finally using the emotion of weight highest (poll is most) as the affective state of voice signal to be identified, energy Access recognition result.

Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For system disclosed in embodiment For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part It is bright.

Used herein a specific example illustrates the principle and implementation of the invention, and above embodiments are said It is bright to be merely used to help understand method and its core concept of the invention；At the same time, for those skilled in the art, foundation Thought of the invention, there will be changes in the specific implementation manner and application range.In conclusion the content of the present specification is not It is interpreted as limitation of the present invention.

Claims

1. a speech emotion recognition method of deep belief network, is characterized in that, described recognition method comprises:

get voice signal;

Preprocessing the voice signal to obtain a preprocessed voice signal;

Using a deep belief network for the preprocessed speech signal to perform unsupervised speech signal feature extraction to obtain speech signal features;

Using the support vector machine to recognize and classify the speech emotion of the speech signal features, and obtain the speech emotion recognition result.

2. the speech emotion recognition method of a kind of deep belief network according to claim 1, is characterized in that, described adopting deep belief network to carry out unsupervised speech signal feature extraction to described preprocessing speech signal, obtain speech signal characteristic Specifically include:

Stack low-level to high-level N-layer restricted Boltzmann machines to obtain a deep belief network;

Unsupervised training is performed on the restricted Boltzmann machine of the ith layer according to the preprocessed speech signal, and the ith optimal parameter is obtained, and the ith optimal parameter is the restricted Boltzmann machine of the ith layer. The optimal parameters of ; among them, the values of i are 1, 2, ..., N in turn;

Perform unsupervised training on the restricted Boltzmann machine of the i+1th layer according to the i-th optimal parameter and the preprocessed speech signal, and obtain the i+1-th optimal parameter;

Fine-tune the plurality of optimal parameters by using a global training method until the deep belief network converges to the global optimal, and obtain a plurality of fine-tuned optimal parameters;

The speech signal features of the preprocessed speech signal are extracted according to the fine-tuned optimal parameters.

3. the speech emotion recognition method of a kind of deep belief network according to claim 1, is characterized in that, described speech signal feature adopts support vector machine to carry out the recognition and classification of speech emotion, obtain speech emotion recognition result specifically comprises :

Using a kernel function to map the sample points of the speech signal feature to a high-dimensional feature space to obtain a spatially linearly separable sample;

The support vector machine performs logical judgment on the features of the speech signal according to the spatially linearly separable samples, and obtains a speech emotion recognition result.

4. A speech emotion recognition system of deep belief network, it is characterized in that, described recognition system comprises:

A voice signal acquisition module for acquiring voice signals;

a voice signal preprocessing module, configured to preprocess the voice signal to obtain a preprocessed voice signal;

a feature extraction module, configured to perform unsupervised feature extraction on the preprocessed voice signal by using a deep belief network to obtain features of the voice signal;

The emotion recognition module is used for recognizing and classifying the speech emotion by using the support vector machine on the features of the speech signal, and obtaining the speech emotion recognition result.

5. the speech emotion recognition system of a kind of deep belief network according to claim 4, is characterized in that, described feature extraction module specifically comprises:

The deep belief network establishment unit is used to stack the N-layer restricted Boltzmann machines from the low layer to the high layer to obtain the deep belief network;

A supervised training unit is used to perform unsupervised training on the restricted Boltzmann machine of the i-th layer according to the preprocessed speech signal, and obtain the i-th optimal parameter, where the i-th optimal parameter is the Restrict the optimal parameters of the Boltzmann machine; wherein, the values of i are 1, 2, ..., N in sequence; according to the i-th optimal parameter and the preprocessed speech signal, the i-th +1 layer of restricted Boltzmann machine for unsupervised training to obtain the i+1th optimal parameter;

a parameter fine-tuning unit, configured to fine-tune the plurality of optimal parameters by using a global training method until the deep belief network converges to the global optimum, and obtain a plurality of fine-tuning optimal parameters;

A voice signal feature extraction unit, configured to extract the voice signal feature of the preprocessed voice signal according to the fine-tuning optimal parameter.

6. the speech emotion recognition system of a kind of deep belief network according to claim 4, is characterized in that, described emotion recognition module specifically comprises:

a kernel function unit, used for using a kernel function to map the sample points of the speech signal feature to a high-dimensional feature space to obtain samples that are linearly separable in space;

A logical judgment unit, used for the support vector machine to perform logical judgment on the features of the speech signal according to the spatially linearly separable samples to obtain a speech emotion recognition result.