CN110111797A

CN110111797A - Method for distinguishing speek person based on Gauss super vector and deep neural network

Info

Publication number: CN110111797A
Application number: CN201910271166.XA
Authority: CN
Inventors: 曾春艳; 马超峰; 武明虎; 朱栋梁; 赵楠; 朱莉; 王娟
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2019-04-04
Filing date: 2019-04-04
Publication date: 2019-08-09

Abstract

The invention discloses a kind of method for distinguishing speek person based on Gauss super vector and deep neural network, the stage is extracted including speaker characteristic, the deep neural network design phase, Speaker Identification and decision phase, the present invention is blended by deep neural network and Speaker Recognition System model, in conjunction with the remarkable result of Gauss super vector and the multilayered structure of deep neural network in terms of the characterization ability for improving evaluation model, and method for distinguishing speek person proposed by the present invention is capable of the recognition performance of effective lifting system in the environment of ambient noise, system performance is influenced reducing noise, while improving system noise robustness, optimization system structure, improve the competitiveness of corresponding Speaker Identification product.

Description

Method for distinguishing speek person based on Gauss super vector and deep neural network

Technical field

The present invention relates to technical field of voice recognition more particularly to a kind of based on Gauss super vector and deep neural network Method for distinguishing speek person.

Background technique

Speaker Identification is a kind of special biological identification technology realized based on voice messaging.After decades of development, Speaker Recognition Technology have been relatively mature under noiseless disturbed condition at present.The method of mainstream has GMM-UBM, GMM- at present SVM and i-vector.However under actual application environment, due to the presence of ambient noise and interchannel noise, Speaker Identification is calculated Method performance can be decreased obviously.Therefore, the noise robustness for how improving existing Speaker Recognition System becomes the field in recent years Research hotspot.

To solve this problem, researcher makes trial in the different level of Speech processing.Pertinent literature card It is real, the effect that can the Classical correlation algorithm of field of signal processing obtain depending on noise type and signal-to-noise ratio it is big It is small.For voice, the true probability distribution of feature is dependent on specific speaker and is multi-modal.However, in reality In the application scenarios of border, the factors such as the mismatch of channel and additive noise can the true probability distribution of destructive characteristics.Correlative study is logical It crosses in conjunction with the technologies such as phonetic feature and cepstral mean normalized square mean with noise robustness, can adjust under certain condition The probability distribution of whole feature, achieving the purpose that, which reduces noise, influences system performance.Feature bends algorithm (feature It warping is) that will train with the distribution map of the feature vector of tested speech into unified probability distribution, after mapping Feature vector all obeys standardized normal distribution per one-dimensional, compensates for channel to a certain extent and mismatches with additive noise to spy It is influenced caused by sign distribution.But the recognizer based on different phonetic feature is compared it can be found that recognition performance is No improvement and the type and signal-to-noise ratio of noise are also to be closely related.When containing a small amount of noise in environment, based on property field Related algorithm considers influence of the noise to feature distribution characteristic, and adjusting feature distribution by modes such as distribution maps can be improved The noise robustness of system.But with the reduction of signal-to-noise ratio, while influence of noise feature distribution characteristic, it can also change language The relevant information of speaker in sound, system performance can sharply decline, by adjusting mentioning in feature distribution bring system performance Rising just seems insignificant.

In recent years, with the promotion of machine learning algorithm performance and computer storage, the raising of computing capability, depth nerve Network (Deep NeuralNetwork, DNN) is applied in Speaker Identification field and achieves significant effect.Because of people The generation of speech-like signal and perception are exactly a complicated process, and are being biologically to have significantly at many levels Or profound processing structure.So sophisticated signal this for voice, being handled using shallow structure model it is obviously had very greatly Limitation, and use deep layer structure, utilize multilayer nonlinear transformation extract voice signal in structured message and height Layer information, is more reasonable selection.

MFCCs (Mel Frequency Cepstral Coefficents) is one kind in automatic speech and Speaker Identification Middle widely used feature the advantage is that the property independent of signal, not do any hypothesis and limitation to input signal. The time span of collected voice data is inconsistent in data set, this results in the MFCC feature sizes of every section of voice to be also It is different.The input of usual neural network to guarantee it is in the same size, if MFCC feature operated by interception or zero padding It can satisfy this requirement, but this operation can destroy the relevance between data, reduce the ability to express of feature, cause System recognition rate is greatly reduced.Therefore the present invention is further processed MFCC feature using MAP technology, extracts the super arrow of Gauss Amount extracts result for as new robust features, and combines deep neural network, to propose a kind of speaking for strong robustness People's identifying system.

Summary of the invention

The present invention is directed to solve at least some of the technical problems in related technologies.For this purpose, of the invention One purpose is to propose a kind of method for distinguishing speek person based on Gauss super vector and deep neural network, to improve evaluation mould The characterization ability of type, and while reducing noise influences system performance, improves system noise robustness, optimization system knot Structure improves the competitiveness of corresponding Speaker Identification product.

A kind of method for distinguishing speek person based on Gauss super vector and deep neural network according to an embodiment of the present invention, packet It includes:

S1: speaker characteristic extracts；

1-1) acquire primary speech signal and successively preemphasis, framing, adding window, Fast Fourier Transform (FFT) (FFT), quarter window It filters, ask logarithm, discrete Fourier transform (DCT), differential parameter, cepstral mean and normalized square mean (CMVN)；

1-11) preemphasis: in order to eliminate in voiced process, effect caused by vocal cords and lip, come compensate voice signal by The high frequency section constrained to articulatory system

Y=x (n)-a*x (n-1), 0.95 < a < 0.97 (1)

X (n) indicates input signal in formula；

1-12) framing: N number of sampling point set is synthesized into an observation unit, referred to as frame；

1-13) adding window: by each frame multiplied by Hamming window, to increase the continuity of frame left end and right end, x (n) indicates framing Signal later

1-14) Fast Fourier Transform (FFT) (FFT): time-domain signal is transformed into frequency domain and carries out subsequent frequency analysis

S (n) indicates that the voice signal of input, N indicate the frame number of Fourier transformation in formula；

The triangle filter group that energy spectrum 1-15) is passed through to one group of Mel scale, being defined as one has M triangle filtering The filter group of device, centre frequency are f (m), m=1,2 ..., M；Interval between each f (m) is directly proportional to m value；

1-16) MFCC coefficient is obtained through discrete cosine transform (DCT):

Bring above-mentioned logarithmic energy into discrete cosine transform, M is the number of triangular filter in formula；L is MFCC coefficient Order, take 12-16；

1-17) difference: in order to make feature that can more embody time domain continuity, can before and after characteristic dimension increase frame information Dimension, the most commonly used is first-order differences and second differnce；

1-18) cepstral mean and normalized square mean can eliminate stationary channel influence, the robustness of lifting feature；

It 1-2) provides one group of training and extracts MFCC feature, training universal background model by step 1-1) (UniversalBackgroundModel, UBM)；

If 1-21) the corresponding feature of certain voice data is X, wherein X={ x₁,x₂,…x_T, and assume that its dimension is D, For calculating the formula of its likelihood function are as follows:

The density function is by K single Gaussian density function p in formula_k(X_t) weighting obtain, wherein each Gaussian component is equal Value μ_kWith covariance ∑_kSize be respectively as follows: 1 × D and D × D；

Wherein hybrid weight w_kMeetAssuming that λ indicates the set of model parameter, then there is λ={ w_k,μ_i,∑_k, K=1,2 ..., K, the model are obtained by expectation maximization (EM) repetitive exercise；

It 1-22) is generally got parms λ with EM algorithm, first gives mono- initial value of λ, then estimate new parameter λ ', so that Likelihood score at λ ' is higher, i.e. and p (X | λ ') >=p (X | λ), new parameter is re-used as parameter current and is trained, continuous iteration, respectively The revaluation formula of parameter are as follows:

1-3) the voice application step 1-1 first to target speaker and speaker to be identified), MFCC feature is extracted, Then applying step 1-2) in UBM model to each feature vector carry out maximum a posteriori probability (Maximum aposteriori, MAP it) operates, extracts Gauss super vector；

1-31) traditional GMM-UBM model is respectively trained to obtain specific to the feature vector of S people first in this stage Speaker GMM, is denoted as λ₁,λ₂,…,λ_s, in cognitive phase, by the characteristic sequence X={ x of target speaker_t, t=1,2 ... T } It is matched respectively with GMM model, probability P (λ is calculated according to MAP_i| X), model corresponding to maximum probability is to identify knot Fruit；

Wherein, P (X) is constant, if premise is equal for everyone probabilityFormula (8) can With abbreviation are as follows:

If assuming between every frame phonetic feature independently of each other, and formula (10) are finally obtained to its abbreviation:

1-32) present invention is using each feature vector as a classification, actually to MFCC feature in this stage Re-start extraction operation；

S2: deep neural network design；

2-1) DNN is the extension of conventional feed forward artificial neural network (Artificalneuralnetwork, ANN), is had The more hiding number of plies and stronger ability to express use stochastic parameter common in shallow-layer network initialization and backpropagation (Back-Propagation, BP) algorithm trains this multilayered structure to be easy to make model to fall into locally optimal solution, DNN at Function has benefited from the unsupervised production pre-training algorithm of one kind proposed in recent years, and it is preferably initial which obtain model Parameter, then on this basis, using the mode of Training to the further tuning of model parameter；

2-11) based on the parameter pre-training of limited Boltzmann machine；

Pre-training (Pre-training), limited Boltzmann machine is trained using the algorithm of unsupervised learning (RestrictedBoltzmannmachine, RBM), RBM is by successively training and is stacked into depth confidence network (DBN), RBM It is made of in structure one layer of visible layer and one layer of hidden layer, onrelevant between the node of identical layer, it is assumed that the visible layer of RBM is V, hidden layer h, the joint probability distribution of (v, h) is defined as:

Wherein, connection matrix of the W between visible layer and hidden layer, b and c are respectively visible layer and hidden layer biasing, and Z is Normalization factor is declined using gradient and to sdpecific dispersion (ContrastiveDivergence, CD) learning algorithm, passes through maximum Change visible node layer probability distribution P (v) to obtain model parameter；

2-12) the small parameter perturbations based on back-propagation algorithm (Fine-tuning)

After completing the pre-training of DBN, using its each layer network parameter model parameter initial as DNN, in the last layer It is upper to increase by one layer softmax layers, then using the data with mark, utilize the learning algorithm (such as BP algorithm) of traditional neural network To learn the model parameter of DNN；

Assuming that the 0th layer is input layer, L layer are output layer, and 1 to L-1 is hidden layer, for hidden layer l (l=1,2 ..., ), L-1 node output drive value may be calculated:

z^l=W^l-1h^l-1+b^l-1

h^l=σ (z^l) (12)

Wherein, W^l-1And b^l-1For weight matrix and biasing, z^lFor the weighted sum of l layers of input value, σ () is activation primitive, Generally use sigmoid or tanh function；

2-13) convolutional neural networks (Convolutional Neural Network, CNN) are another famous depth Learning model has been widely used in image domains, and compared to DNN, CNN is by using part filter and maximum pond skill Art directly can obtain the feature of more robust property from study sound spectrograph and compare with traditional voice feature, reduce in time domain and Information loss on frequency domain, simultaneously as the CNN feature that locally connection and weight are shared, so that CNN has translation invariance, It can overcome the problems, such as that voice signal itself is multifarious, convolution sum pond is added in network by the present invention, builds new DNN；

S3: Speaker Identification and decision (softmax):

3-1) in the back end test stage, after providing the Gauss super vector an of tested speech, first by the voice and institute There is speaker model to be compared, obtain test probability, is i.e. test score；

For output layer, using Softmax function:

K is the other index of output class in formula, i.e. the classification index of target speaker, p_sIndicate speaker to be identified in s The output valve of class, i.e. output probability；

3-2) the corresponding label of maximum score is compared with the label claimed, if they are the same, being considered as this section of voice is its institute Otherwise the voice of the speaker claimed is just refused；

3-3) calculate the probability that all tested speech correctly identify, the i.e. discrimination of system.

In the present invention, blended by deep neural network and Speaker Recognition System model, in conjunction with Gauss super vector and Remarkable result of the multilayered structure of deep neural network in terms of the characterization ability for improving evaluation model, and it is proposed by the present invention Method for distinguishing speek person is capable of the recognition performance of effective lifting system in the environment of ambient noise, is reducing noise to systematicness While capable of influencing, improve system noise robustness, optimization system structure improves the competition of corresponding Speaker Identification product Power.

Detailed description of the invention

Attached drawing is used to provide further understanding of the present invention, and constitutes part of specification, with reality of the invention It applies example to be used to explain the present invention together, not be construed as limiting the invention.In the accompanying drawings:

Fig. 1 is a kind of stream of the method for distinguishing speek person based on Gauss super vector and deep neural network proposed by the present invention Journey block diagram；

Fig. 2 is MFCC feature extraction flow diagram proposed by the present invention；

Fig. 3 is that Gauss super vector proposed by the present invention extracts flow diagram；

Fig. 4 is the system block diagram of deep neural network proposed by the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.

Examples of the embodiments are shown in the accompanying drawings, and in which the same or similar labels are throughly indicated identical or classes As element or element with the same or similar functions.The embodiments described below with reference to the accompanying drawings are exemplary, purport It is being used to explain the present invention, and is being not considered as limiting the invention.

Referring to Fig.1-4, a kind of method for distinguishing speek person based on Gauss super vector and deep neural network, comprising:

S1: speaker characteristic extracts；

Y=x (n)-a*x (n-1), 0.95 < a < 0.97 (1)

X (n) indicates input signal in formula；

1-16) MFCC coefficient is obtained through discrete cosine transform (DCT):

S2: deep neural network design；

2-1) DNN is the extension of conventional feed forward artificial neural network (Artifical neural network, ANN), tool There are the more hiding number of plies and stronger ability to express, uses stochastic parameter common in shallow-layer network initialization and backpropagation (Back-Propagation, BP) algorithm trains this multilayered structure to be easy to make model to fall into locally optimal solution, DNN at Function has benefited from the unsupervised production pre-training algorithm of one kind proposed in recent years, and it is preferably initial which obtain model Parameter, then on this basis, using the mode of Training to the further tuning of model parameter；

2-11) based on the parameter pre-training of limited Boltzmann machine；

Pre-training (Pre-training), limited Boltzmann machine is trained using the algorithm of unsupervised learning (Restricted Boltzmann machine, RBM), RBM is by successively training and is stacked into depth confidence network (DBN), RBM is made of in structure one layer of visible layer and one layer of hidden layer, onrelevant between the node of identical layer, it is assumed that RBM's is visible Layer is v, hidden layer h, the joint probability distribution of (v, h) is defined as:

Wherein, connection matrix of the W between visible layer and hidden layer, b and c are respectively visible layer and hidden layer biasing, and Z is Normalization factor is declined using gradient and to sdpecific dispersion (Contrastive Divergence, CD) learning algorithm, passes through maximum Change visible node layer probability distribution P (v) to obtain model parameter；

z^l=W^l-1h^l-1+b^l-1

h^l=σ (z^l) (12)

S3: Speaker Identification and decision (softmax):

For output layer, using Softmax function:

In conclusion depth nerve net should be passed through based on the method for distinguishing speek person of Gauss super vector and deep neural network Network is blended with Speaker Recognition System model, is improving evaluation in conjunction with Gauss super vector and the multilayered structure of deep neural network Remarkable result in terms of the characterization ability of model, and method for distinguishing speek person proposed by the present invention is in the environment of ambient noise It is capable of the recognition performance of effective lifting system, while reducing noise influences system performance, improves system noise robustness, Optimization system structure improves the competitiveness of corresponding Speaker Identification product.

In order to verify the recognition effect that the present invention is implemented, the present invention is ambient noise using white noise, and test macro exists Signal-to-noise ratio is respectively the recognition performance under 10,20,30, selects the system of GMM-UBM and GSV-SVM as a comparison.The present invention makes With the clean subset in Librispeech data set, the data of wherein 150 people is selected to train the UBM that Gaussage is 256, and In addition 34 people and its corresponding 50 voices are randomly selected as used in later period identification.Homologous ray is not under the conditions of three signal-to-noise ratio The accuracy rate comparison of identification is as shown in table 1.

Accuracy rate (%) of 1 Speaker Recognition System of table under white noise

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not Centainly refer to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be any One or more embodiment or examples in can be combined in any suitable manner.

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, Anyone skilled in the art in the technical scope disclosed by the present invention, according to the technique and scheme of the present invention and its Inventive concept is subject to equivalent substitution or change, should be covered by the protection scope of the present invention.

Claims

1. a kind of method for distinguishing speek person based on Gauss super vector and deep neural network is applied to Speaker Identification, special Sign is that the method for distinguishing speek person based on Gauss super vector and deep neural network includes:

S1: speaker characteristic extracts；

1-1) acquire primary speech signal and successively preemphasis, framing, adding window, Fast Fourier Transform (FFT) (FFT), quarter window filter Wave asks logarithm, discrete Fourier transform (DCT), differential parameter, cepstral mean and normalized square mean (CMVN)；

1-11) preemphasis: in order to eliminate in voiced process, effect caused by vocal cords and lip is sent out to compensate voice signal The high frequency section that system for electrical teaching constrains

Y=x (n)-a*x (n-1), 0.95 < a < 0.97 (1)

X (n) indicates input signal in formula；

1-13) adding window: by each frame multiplied by Hamming window, to increase the continuity of frame left end and right end, x (n) is indicated after framing Signal

Energy spectrum 1-15) is passed through to the triangle filter group of one group of Mel scale, being defined as one has M triangular filter Filter group, centre frequency are f (m), m=1,2 ..., M；Interval between each f (m) is directly proportional to m value；

1-16) MFCC coefficient is obtained through discrete cosine transform (DCT):

Bring above-mentioned logarithmic energy into discrete cosine transform, M is the number of triangular filter in formula；L is the rank of MFCC coefficient Number；

1-17) difference: in order to make feature that can more embody time domain continuity, can before and after characteristic dimension increase frame information dimension Degree, the most commonly used is first-order differences and second differnce；

It 1-2) provides one group of training and extracts MFCC feature, training universal background model (Universal by step 1-1) Background Model, UBM)；

If 1-21) the corresponding feature of certain voice data is X, wherein X={ x₁,x₂,…x_T, and assume that its dimension is D, it is used for Calculate the formula of its likelihood function are as follows:

The density function is by K single Gaussian density function p in formula_k(X_t) weighting obtains, the wherein mean μ of each Gaussian component_k With covariance ∑_kSize be respectively as follows: 1 × D and D × D；

Wherein hybrid weight w_kMeetAssuming that λ indicates the set of model parameter, then there is λ={ w_k,μ_i,∑_k, k= 1,2 ..., K, the model are obtained by expectation maximization (EM) repetitive exercise；

It 1-22) is generally got parms λ with EM algorithm, first gives mono- initial value of λ, new parameter λ ' is then estimated, so that in λ ' Under likelihood score it is higher, i.e. p (X | λ ') >=p (X | λ), new parameter is re-used as parameter current and is trained, continuous iteration, each parameter Revaluation formula are as follows:

1-3) the voice application step 1-1 first to target speaker and speaker to be identified), MFCC feature is extracted, then Applying step 1-2) in UBM model to each feature vector carry out maximum a posteriori probability (Maximum a posteriori, MAP) Operation, extracts Gauss super vector；

1-31) traditional GMM-UBM model is respectively trained the feature vector of S people to obtain specific speak first in this stage People GMM, is denoted as λ₁,λ₂,…,λ_s, in cognitive phase, by the characteristic sequence X={ x of target speaker_t, t=1,2 ... T } and GMM Model is matched respectively, calculates probability P (λ according to MAP_i| X), model corresponding to maximum probability is recognition result；

Wherein, P (X) is constant, if premise is equal for everyone probabilityFormula (8) can be changed Letter are as follows:

1-32) present invention is using each feature vector as a classification, actually again to MFCC feature in this stage Extract operation；

S2: deep neural network design；

2-1) DNN is the extension of conventional feed forward artificial neural network (Artifical neural network, ANN), in this base On plinth, using the mode of Training to the further tuning of model parameter；

2-11) based on the parameter pre-training of limited Boltzmann machine；

Pre-training (Pre-training) trains limited Boltzmann machine (Restricted using the algorithm of unsupervised learning Boltzmann machine, RBM), RBM is by successively training and is stacked into depth confidence network (DBN), RBM in structure by One layer of visible layer and one layer of hidden layer form, onrelevant between the node of identical layer, it is assumed that the visible layer of RBM is v, and hidden layer is H, the joint probability distribution of (v, h) is defined as:

Wherein, connection matrix of the W between visible layer and hidden layer, b and c are respectively visible layer and hidden layer biasing, and Z is normalizing Change the factor, it, can by maximizing using gradient decline and to sdpecific dispersion (Contrastive Divergence, CD) learning algorithm Node layer probability distribution P (v) is seen to obtain model parameter；

After completing the pre-training of DBN, its each layer network parameter model parameter initial as DNN increases in the last layer Add one layer softmax layers, then using the data with mark, is learned using the learning algorithm (such as BP algorithm) of traditional neural network Practise the model parameter of DNN；

Assuming that the 0th layer is input layer, L layers are output layer, and 1 to L-1 is hidden layer, for hidden layer l (l=1,2 ..., L- 1), node output drive value may be calculated:

z^l=W^l-1h^l-1+b^l-1

h^l=σ (z^l) (12)

Wherein, W^l-1And b^l-1For weight matrix and biasing, z^lFor the weighted sum of l layers of input value, σ () is activation primitive, is generally made With sigmoid or tanh function；

2-13) convolutional neural networks (Convolutional Neural Network, CNN) are another famous deep learnings Convolution sum pond is added in network by model, the present invention, builds new DNN；

S3: Speaker Identification and decision (softmax):

3-1) in the back end test stage, after providing the Gauss super vector an of tested speech, first by the voice and all theorys Words people's model is compared, and obtains test probability, i.e. test score；

For output layer, using Softmax function:

K is the other index of output class in formula, i.e. the classification index of target speaker, p_sIndicate speaker to be identified in the defeated of s class It is worth out, i.e. output probability；

3-2) the corresponding label of maximum score is compared with the label claimed, if they are the same, being considered as this section of voice is that it is claimed Speaker voice, otherwise just refuse；