CN110111797A - Method for distinguishing speek person based on Gauss super vector and deep neural network - Google Patents
Method for distinguishing speek person based on Gauss super vector and deep neural network Download PDFInfo
- Publication number
- CN110111797A CN110111797A CN201910271166.XA CN201910271166A CN110111797A CN 110111797 A CN110111797 A CN 110111797A CN 201910271166 A CN201910271166 A CN 201910271166A CN 110111797 A CN110111797 A CN 110111797A
- Authority
- CN
- China
- Prior art keywords
- layer
- neural network
- model
- parameter
- speaker
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 32
- 238000000034 method Methods 0.000 title claims abstract description 21
- 238000013461 design Methods 0.000 claims abstract description 4
- 238000012549 training Methods 0.000 claims description 26
- 238000009826 distribution Methods 0.000 claims description 18
- 230000006870 function Effects 0.000 claims description 18
- 238000013527 convolutional neural network Methods 0.000 claims description 15
- 239000000284 extract Substances 0.000 claims description 14
- 230000013016 learning Effects 0.000 claims description 11
- 238000012360 testing method Methods 0.000 claims description 10
- 238000009432 framing Methods 0.000 claims description 9
- 230000000694 effects Effects 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 6
- 230000008859 change Effects 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 4
- 230000009466 transformation Effects 0.000 claims description 4
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 claims description 3
- 230000001149 cognitive effect Effects 0.000 claims description 3
- 239000006185 dispersion Substances 0.000 claims description 3
- 230000003252 repetitive effect Effects 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 238000001228 spectrum Methods 0.000 claims description 3
- 210000001260 vocal cord Anatomy 0.000 claims description 3
- 230000007423 decrease Effects 0.000 claims description 2
- 238000013135 deep learning Methods 0.000 claims 1
- 238000012512 characterization method Methods 0.000 abstract description 4
- 238000005457 optimization Methods 0.000 abstract description 4
- 238000011156 evaluation Methods 0.000 abstract description 3
- 238000013210 evaluation model Methods 0.000 abstract description 2
- 238000005516 engineering process Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000000605 extraction Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 239000000654 additive Substances 0.000 description 2
- 230000000996 additive effect Effects 0.000 description 2
- 235000013399 edible fruits Nutrition 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000001066 destructive effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 210000004218 nerve net Anatomy 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
- G10L17/08—Use of distortion metrics or a particular distance between probe pattern and reference templates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a kind of method for distinguishing speek person based on Gauss super vector and deep neural network, the stage is extracted including speaker characteristic, the deep neural network design phase, Speaker Identification and decision phase, the present invention is blended by deep neural network and Speaker Recognition System model, in conjunction with the remarkable result of Gauss super vector and the multilayered structure of deep neural network in terms of the characterization ability for improving evaluation model, and method for distinguishing speek person proposed by the present invention is capable of the recognition performance of effective lifting system in the environment of ambient noise, system performance is influenced reducing noise, while improving system noise robustness, optimization system structure, improve the competitiveness of corresponding Speaker Identification product.
Description
Technical field
The present invention relates to technical field of voice recognition more particularly to a kind of based on Gauss super vector and deep neural network
Method for distinguishing speek person.
Background technique
Speaker Identification is a kind of special biological identification technology realized based on voice messaging.After decades of development,
Speaker Recognition Technology have been relatively mature under noiseless disturbed condition at present.The method of mainstream has GMM-UBM, GMM- at present
SVM and i-vector.However under actual application environment, due to the presence of ambient noise and interchannel noise, Speaker Identification is calculated
Method performance can be decreased obviously.Therefore, the noise robustness for how improving existing Speaker Recognition System becomes the field in recent years
Research hotspot.
To solve this problem, researcher makes trial in the different level of Speech processing.Pertinent literature card
It is real, the effect that can the Classical correlation algorithm of field of signal processing obtain depending on noise type and signal-to-noise ratio it is big
It is small.For voice, the true probability distribution of feature is dependent on specific speaker and is multi-modal.However, in reality
In the application scenarios of border, the factors such as the mismatch of channel and additive noise can the true probability distribution of destructive characteristics.Correlative study is logical
It crosses in conjunction with the technologies such as phonetic feature and cepstral mean normalized square mean with noise robustness, can adjust under certain condition
The probability distribution of whole feature, achieving the purpose that, which reduces noise, influences system performance.Feature bends algorithm (feature
It warping is) that will train with the distribution map of the feature vector of tested speech into unified probability distribution, after mapping
Feature vector all obeys standardized normal distribution per one-dimensional, compensates for channel to a certain extent and mismatches with additive noise to spy
It is influenced caused by sign distribution.But the recognizer based on different phonetic feature is compared it can be found that recognition performance is
No improvement and the type and signal-to-noise ratio of noise are also to be closely related.When containing a small amount of noise in environment, based on property field
Related algorithm considers influence of the noise to feature distribution characteristic, and adjusting feature distribution by modes such as distribution maps can be improved
The noise robustness of system.But with the reduction of signal-to-noise ratio, while influence of noise feature distribution characteristic, it can also change language
The relevant information of speaker in sound, system performance can sharply decline, by adjusting mentioning in feature distribution bring system performance
Rising just seems insignificant.
In recent years, with the promotion of machine learning algorithm performance and computer storage, the raising of computing capability, depth nerve
Network (Deep NeuralNetwork, DNN) is applied in Speaker Identification field and achieves significant effect.Because of people
The generation of speech-like signal and perception are exactly a complicated process, and are being biologically to have significantly at many levels
Or profound processing structure.So sophisticated signal this for voice, being handled using shallow structure model it is obviously had very greatly
Limitation, and use deep layer structure, utilize multilayer nonlinear transformation extract voice signal in structured message and height
Layer information, is more reasonable selection.
MFCCs (Mel Frequency Cepstral Coefficents) is one kind in automatic speech and Speaker Identification
Middle widely used feature the advantage is that the property independent of signal, not do any hypothesis and limitation to input signal.
The time span of collected voice data is inconsistent in data set, this results in the MFCC feature sizes of every section of voice to be also
It is different.The input of usual neural network to guarantee it is in the same size, if MFCC feature operated by interception or zero padding
It can satisfy this requirement, but this operation can destroy the relevance between data, reduce the ability to express of feature, cause
System recognition rate is greatly reduced.Therefore the present invention is further processed MFCC feature using MAP technology, extracts the super arrow of Gauss
Amount extracts result for as new robust features, and combines deep neural network, to propose a kind of speaking for strong robustness
People's identifying system.
Summary of the invention
The present invention is directed to solve at least some of the technical problems in related technologies.For this purpose, of the invention
One purpose is to propose a kind of method for distinguishing speek person based on Gauss super vector and deep neural network, to improve evaluation mould
The characterization ability of type, and while reducing noise influences system performance, improves system noise robustness, optimization system knot
Structure improves the competitiveness of corresponding Speaker Identification product.
A kind of method for distinguishing speek person based on Gauss super vector and deep neural network according to an embodiment of the present invention, packet
It includes:
S1: speaker characteristic extracts;
1-1) acquire primary speech signal and successively preemphasis, framing, adding window, Fast Fourier Transform (FFT) (FFT), quarter window
It filters, ask logarithm, discrete Fourier transform (DCT), differential parameter, cepstral mean and normalized square mean (CMVN);
1-11) preemphasis: in order to eliminate in voiced process, effect caused by vocal cords and lip, come compensate voice signal by
The high frequency section constrained to articulatory system
Y=x (n)-a*x (n-1), 0.95 < a < 0.97 (1)
X (n) indicates input signal in formula;
1-12) framing: N number of sampling point set is synthesized into an observation unit, referred to as frame;
1-13) adding window: by each frame multiplied by Hamming window, to increase the continuity of frame left end and right end, x (n) indicates framing
Signal later
1-14) Fast Fourier Transform (FFT) (FFT): time-domain signal is transformed into frequency domain and carries out subsequent frequency analysis
S (n) indicates that the voice signal of input, N indicate the frame number of Fourier transformation in formula;
The triangle filter group that energy spectrum 1-15) is passed through to one group of Mel scale, being defined as one has M triangle filtering
The filter group of device, centre frequency are f (m), m=1,2 ..., M;Interval between each f (m) is directly proportional to m value;
1-16) MFCC coefficient is obtained through discrete cosine transform (DCT):
Bring above-mentioned logarithmic energy into discrete cosine transform, M is the number of triangular filter in formula;L is MFCC coefficient
Order, take 12-16;
1-17) difference: in order to make feature that can more embody time domain continuity, can before and after characteristic dimension increase frame information
Dimension, the most commonly used is first-order differences and second differnce;
1-18) cepstral mean and normalized square mean can eliminate stationary channel influence, the robustness of lifting feature;
It 1-2) provides one group of training and extracts MFCC feature, training universal background model by step 1-1)
(UniversalBackgroundModel, UBM);
If 1-21) the corresponding feature of certain voice data is X, wherein X={ x1,x2,…xT, and assume that its dimension is D,
For calculating the formula of its likelihood function are as follows:
The density function is by K single Gaussian density function p in formulak(Xt) weighting obtain, wherein each Gaussian component is equal
Value μkWith covariance ∑kSize be respectively as follows: 1 × D and D × D;
Wherein hybrid weight wkMeetAssuming that λ indicates the set of model parameter, then there is λ={ wk,μi,∑k,
K=1,2 ..., K, the model are obtained by expectation maximization (EM) repetitive exercise;
It 1-22) is generally got parms λ with EM algorithm, first gives mono- initial value of λ, then estimate new parameter λ ', so that
Likelihood score at λ ' is higher, i.e. and p (X | λ ') >=p (X | λ), new parameter is re-used as parameter current and is trained, continuous iteration, respectively
The revaluation formula of parameter are as follows:
1-3) the voice application step 1-1 first to target speaker and speaker to be identified), MFCC feature is extracted,
Then applying step 1-2) in UBM model to each feature vector carry out maximum a posteriori probability (Maximum aposteriori,
MAP it) operates, extracts Gauss super vector;
1-31) traditional GMM-UBM model is respectively trained to obtain specific to the feature vector of S people first in this stage
Speaker GMM, is denoted as λ1,λ2,…,λs, in cognitive phase, by the characteristic sequence X={ x of target speakert, t=1,2 ... T }
It is matched respectively with GMM model, probability P (λ is calculated according to MAPi| X), model corresponding to maximum probability is to identify knot
Fruit;
Wherein, P (X) is constant, if premise is equal for everyone probabilityFormula (8) can
With abbreviation are as follows:
If assuming between every frame phonetic feature independently of each other, and formula (10) are finally obtained to its abbreviation:
1-32) present invention is using each feature vector as a classification, actually to MFCC feature in this stage
Re-start extraction operation;
S2: deep neural network design;
2-1) DNN is the extension of conventional feed forward artificial neural network (Artificalneuralnetwork, ANN), is had
The more hiding number of plies and stronger ability to express use stochastic parameter common in shallow-layer network initialization and backpropagation
(Back-Propagation, BP) algorithm trains this multilayered structure to be easy to make model to fall into locally optimal solution, DNN at
Function has benefited from the unsupervised production pre-training algorithm of one kind proposed in recent years, and it is preferably initial which obtain model
Parameter, then on this basis, using the mode of Training to the further tuning of model parameter;
2-11) based on the parameter pre-training of limited Boltzmann machine;
Pre-training (Pre-training), limited Boltzmann machine is trained using the algorithm of unsupervised learning
(RestrictedBoltzmannmachine, RBM), RBM is by successively training and is stacked into depth confidence network (DBN), RBM
It is made of in structure one layer of visible layer and one layer of hidden layer, onrelevant between the node of identical layer, it is assumed that the visible layer of RBM is
V, hidden layer h, the joint probability distribution of (v, h) is defined as:
Wherein, connection matrix of the W between visible layer and hidden layer, b and c are respectively visible layer and hidden layer biasing, and Z is
Normalization factor is declined using gradient and to sdpecific dispersion (ContrastiveDivergence, CD) learning algorithm, passes through maximum
Change visible node layer probability distribution P (v) to obtain model parameter;
2-12) the small parameter perturbations based on back-propagation algorithm (Fine-tuning)
After completing the pre-training of DBN, using its each layer network parameter model parameter initial as DNN, in the last layer
It is upper to increase by one layer softmax layers, then using the data with mark, utilize the learning algorithm (such as BP algorithm) of traditional neural network
To learn the model parameter of DNN;
Assuming that the 0th layer is input layer, L layer are output layer, and 1 to L-1 is hidden layer, for hidden layer l (l=1,2 ...,
), L-1 node output drive value may be calculated:
zl=Wl-1hl-1+bl-1
hl=σ (zl) (12)
Wherein, Wl-1And bl-1For weight matrix and biasing, zlFor the weighted sum of l layers of input value, σ () is activation primitive,
Generally use sigmoid or tanh function;
2-13) convolutional neural networks (Convolutional Neural Network, CNN) are another famous depth
Learning model has been widely used in image domains, and compared to DNN, CNN is by using part filter and maximum pond skill
Art directly can obtain the feature of more robust property from study sound spectrograph and compare with traditional voice feature, reduce in time domain and
Information loss on frequency domain, simultaneously as the CNN feature that locally connection and weight are shared, so that CNN has translation invariance,
It can overcome the problems, such as that voice signal itself is multifarious, convolution sum pond is added in network by the present invention, builds new DNN;
S3: Speaker Identification and decision (softmax):
3-1) in the back end test stage, after providing the Gauss super vector an of tested speech, first by the voice and institute
There is speaker model to be compared, obtain test probability, is i.e. test score;
For output layer, using Softmax function:
K is the other index of output class in formula, i.e. the classification index of target speaker, psIndicate speaker to be identified in s
The output valve of class, i.e. output probability;
3-2) the corresponding label of maximum score is compared with the label claimed, if they are the same, being considered as this section of voice is its institute
Otherwise the voice of the speaker claimed is just refused;
3-3) calculate the probability that all tested speech correctly identify, the i.e. discrimination of system.
In the present invention, blended by deep neural network and Speaker Recognition System model, in conjunction with Gauss super vector and
Remarkable result of the multilayered structure of deep neural network in terms of the characterization ability for improving evaluation model, and it is proposed by the present invention
Method for distinguishing speek person is capable of the recognition performance of effective lifting system in the environment of ambient noise, is reducing noise to systematicness
While capable of influencing, improve system noise robustness, optimization system structure improves the competition of corresponding Speaker Identification product
Power.
Detailed description of the invention
Attached drawing is used to provide further understanding of the present invention, and constitutes part of specification, with reality of the invention
It applies example to be used to explain the present invention together, not be construed as limiting the invention.In the accompanying drawings:
Fig. 1 is a kind of stream of the method for distinguishing speek person based on Gauss super vector and deep neural network proposed by the present invention
Journey block diagram;
Fig. 2 is MFCC feature extraction flow diagram proposed by the present invention;
Fig. 3 is that Gauss super vector proposed by the present invention extracts flow diagram;
Fig. 4 is the system block diagram of deep neural network proposed by the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.
Examples of the embodiments are shown in the accompanying drawings, and in which the same or similar labels are throughly indicated identical or classes
As element or element with the same or similar functions.The embodiments described below with reference to the accompanying drawings are exemplary, purport
It is being used to explain the present invention, and is being not considered as limiting the invention.
Referring to Fig.1-4, a kind of method for distinguishing speek person based on Gauss super vector and deep neural network, comprising:
S1: speaker characteristic extracts;
1-1) acquire primary speech signal and successively preemphasis, framing, adding window, Fast Fourier Transform (FFT) (FFT), quarter window
It filters, ask logarithm, discrete Fourier transform (DCT), differential parameter, cepstral mean and normalized square mean (CMVN);
1-11) preemphasis: in order to eliminate in voiced process, effect caused by vocal cords and lip, come compensate voice signal by
The high frequency section constrained to articulatory system
Y=x (n)-a*x (n-1), 0.95 < a < 0.97 (1)
X (n) indicates input signal in formula;
1-12) framing: N number of sampling point set is synthesized into an observation unit, referred to as frame;
1-13) adding window: by each frame multiplied by Hamming window, to increase the continuity of frame left end and right end, x (n) indicates framing
Signal later
1-14) Fast Fourier Transform (FFT) (FFT): time-domain signal is transformed into frequency domain and carries out subsequent frequency analysis
S (n) indicates that the voice signal of input, N indicate the frame number of Fourier transformation in formula;
The triangle filter group that energy spectrum 1-15) is passed through to one group of Mel scale, being defined as one has M triangle filtering
The filter group of device, centre frequency are f (m), m=1,2 ..., M;Interval between each f (m) is directly proportional to m value;
1-16) MFCC coefficient is obtained through discrete cosine transform (DCT):
Bring above-mentioned logarithmic energy into discrete cosine transform, M is the number of triangular filter in formula;L is MFCC coefficient
Order, take 12-16;
1-17) difference: in order to make feature that can more embody time domain continuity, can before and after characteristic dimension increase frame information
Dimension, the most commonly used is first-order differences and second differnce;
1-18) cepstral mean and normalized square mean can eliminate stationary channel influence, the robustness of lifting feature;
It 1-2) provides one group of training and extracts MFCC feature, training universal background model by step 1-1)
(UniversalBackgroundModel, UBM);
If 1-21) the corresponding feature of certain voice data is X, wherein X={ x1,x2,…xT, and assume that its dimension is D,
For calculating the formula of its likelihood function are as follows:
The density function is by K single Gaussian density function p in formulak(Xt) weighting obtain, wherein each Gaussian component is equal
Value μkWith covariance ∑kSize be respectively as follows: 1 × D and D × D;
Wherein hybrid weight wkMeetAssuming that λ indicates the set of model parameter, then there is λ={ wk,μi,∑k,
K=1,2 ..., K, the model are obtained by expectation maximization (EM) repetitive exercise;
It 1-22) is generally got parms λ with EM algorithm, first gives mono- initial value of λ, then estimate new parameter λ ', so that
Likelihood score at λ ' is higher, i.e. and p (X | λ ') >=p (X | λ), new parameter is re-used as parameter current and is trained, continuous iteration, respectively
The revaluation formula of parameter are as follows:
1-3) the voice application step 1-1 first to target speaker and speaker to be identified), MFCC feature is extracted,
Then applying step 1-2) in UBM model to each feature vector carry out maximum a posteriori probability (Maximum aposteriori,
MAP it) operates, extracts Gauss super vector;
1-31) traditional GMM-UBM model is respectively trained to obtain specific to the feature vector of S people first in this stage
Speaker GMM, is denoted as λ1,λ2,…,λs, in cognitive phase, by the characteristic sequence X={ x of target speakert, t=1,2 ... T }
It is matched respectively with GMM model, probability P (λ is calculated according to MAPi| X), model corresponding to maximum probability is to identify knot
Fruit;
Wherein, P (X) is constant, if premise is equal for everyone probabilityFormula (8) can
With abbreviation are as follows:
If assuming between every frame phonetic feature independently of each other, and formula (10) are finally obtained to its abbreviation:
1-32) present invention is using each feature vector as a classification, actually to MFCC feature in this stage
Re-start extraction operation;
S2: deep neural network design;
2-1) DNN is the extension of conventional feed forward artificial neural network (Artifical neural network, ANN), tool
There are the more hiding number of plies and stronger ability to express, uses stochastic parameter common in shallow-layer network initialization and backpropagation
(Back-Propagation, BP) algorithm trains this multilayered structure to be easy to make model to fall into locally optimal solution, DNN at
Function has benefited from the unsupervised production pre-training algorithm of one kind proposed in recent years, and it is preferably initial which obtain model
Parameter, then on this basis, using the mode of Training to the further tuning of model parameter;
2-11) based on the parameter pre-training of limited Boltzmann machine;
Pre-training (Pre-training), limited Boltzmann machine is trained using the algorithm of unsupervised learning
(Restricted Boltzmann machine, RBM), RBM is by successively training and is stacked into depth confidence network (DBN),
RBM is made of in structure one layer of visible layer and one layer of hidden layer, onrelevant between the node of identical layer, it is assumed that RBM's is visible
Layer is v, hidden layer h, the joint probability distribution of (v, h) is defined as:
Wherein, connection matrix of the W between visible layer and hidden layer, b and c are respectively visible layer and hidden layer biasing, and Z is
Normalization factor is declined using gradient and to sdpecific dispersion (Contrastive Divergence, CD) learning algorithm, passes through maximum
Change visible node layer probability distribution P (v) to obtain model parameter;
2-12) the small parameter perturbations based on back-propagation algorithm (Fine-tuning)
After completing the pre-training of DBN, using its each layer network parameter model parameter initial as DNN, in the last layer
It is upper to increase by one layer softmax layers, then using the data with mark, utilize the learning algorithm (such as BP algorithm) of traditional neural network
To learn the model parameter of DNN;
Assuming that the 0th layer is input layer, L layer are output layer, and 1 to L-1 is hidden layer, for hidden layer l (l=1,2 ...,
), L-1 node output drive value may be calculated:
zl=Wl-1hl-1+bl-1
hl=σ (zl) (12)
Wherein, Wl-1And bl-1For weight matrix and biasing, zlFor the weighted sum of l layers of input value, σ () is activation primitive,
Generally use sigmoid or tanh function;
2-13) convolutional neural networks (Convolutional Neural Network, CNN) are another famous depth
Learning model has been widely used in image domains, and compared to DNN, CNN is by using part filter and maximum pond skill
Art directly can obtain the feature of more robust property from study sound spectrograph and compare with traditional voice feature, reduce in time domain and
Information loss on frequency domain, simultaneously as the CNN feature that locally connection and weight are shared, so that CNN has translation invariance,
It can overcome the problems, such as that voice signal itself is multifarious, convolution sum pond is added in network by the present invention, builds new DNN;
S3: Speaker Identification and decision (softmax):
3-1) in the back end test stage, after providing the Gauss super vector an of tested speech, first by the voice and institute
There is speaker model to be compared, obtain test probability, is i.e. test score;
For output layer, using Softmax function:
K is the other index of output class in formula, i.e. the classification index of target speaker, psIndicate speaker to be identified in s
The output valve of class, i.e. output probability;
3-2) the corresponding label of maximum score is compared with the label claimed, if they are the same, being considered as this section of voice is its institute
Otherwise the voice of the speaker claimed is just refused;
3-3) calculate the probability that all tested speech correctly identify, the i.e. discrimination of system.
In conclusion depth nerve net should be passed through based on the method for distinguishing speek person of Gauss super vector and deep neural network
Network is blended with Speaker Recognition System model, is improving evaluation in conjunction with Gauss super vector and the multilayered structure of deep neural network
Remarkable result in terms of the characterization ability of model, and method for distinguishing speek person proposed by the present invention is in the environment of ambient noise
It is capable of the recognition performance of effective lifting system, while reducing noise influences system performance, improves system noise robustness,
Optimization system structure improves the competitiveness of corresponding Speaker Identification product.
In order to verify the recognition effect that the present invention is implemented, the present invention is ambient noise using white noise, and test macro exists
Signal-to-noise ratio is respectively the recognition performance under 10,20,30, selects the system of GMM-UBM and GSV-SVM as a comparison.The present invention makes
With the clean subset in Librispeech data set, the data of wherein 150 people is selected to train the UBM that Gaussage is 256, and
In addition 34 people and its corresponding 50 voices are randomly selected as used in later period identification.Homologous ray is not under the conditions of three signal-to-noise ratio
The accuracy rate comparison of identification is as shown in table 1.
Accuracy rate (%) of 1 Speaker Recognition System of table under white noise
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show
The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example
Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not
Centainly refer to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be any
One or more embodiment or examples in can be combined in any suitable manner.
The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto,
Anyone skilled in the art in the technical scope disclosed by the present invention, according to the technique and scheme of the present invention and its
Inventive concept is subject to equivalent substitution or change, should be covered by the protection scope of the present invention.
Claims (1)
1. a kind of method for distinguishing speek person based on Gauss super vector and deep neural network is applied to Speaker Identification, special
Sign is that the method for distinguishing speek person based on Gauss super vector and deep neural network includes:
S1: speaker characteristic extracts;
1-1) acquire primary speech signal and successively preemphasis, framing, adding window, Fast Fourier Transform (FFT) (FFT), quarter window filter
Wave asks logarithm, discrete Fourier transform (DCT), differential parameter, cepstral mean and normalized square mean (CMVN);
1-11) preemphasis: in order to eliminate in voiced process, effect caused by vocal cords and lip is sent out to compensate voice signal
The high frequency section that system for electrical teaching constrains
Y=x (n)-a*x (n-1), 0.95 < a < 0.97 (1)
X (n) indicates input signal in formula;
1-12) framing: N number of sampling point set is synthesized into an observation unit, referred to as frame;
1-13) adding window: by each frame multiplied by Hamming window, to increase the continuity of frame left end and right end, x (n) is indicated after framing
Signal
1-14) Fast Fourier Transform (FFT) (FFT): time-domain signal is transformed into frequency domain and carries out subsequent frequency analysis
S (n) indicates that the voice signal of input, N indicate the frame number of Fourier transformation in formula;
Energy spectrum 1-15) is passed through to the triangle filter group of one group of Mel scale, being defined as one has M triangular filter
Filter group, centre frequency are f (m), m=1,2 ..., M;Interval between each f (m) is directly proportional to m value;
1-16) MFCC coefficient is obtained through discrete cosine transform (DCT):
Bring above-mentioned logarithmic energy into discrete cosine transform, M is the number of triangular filter in formula;L is the rank of MFCC coefficient
Number;
1-17) difference: in order to make feature that can more embody time domain continuity, can before and after characteristic dimension increase frame information dimension
Degree, the most commonly used is first-order differences and second differnce;
1-18) cepstral mean and normalized square mean can eliminate stationary channel influence, the robustness of lifting feature;
It 1-2) provides one group of training and extracts MFCC feature, training universal background model (Universal by step 1-1)
Background Model, UBM);
If 1-21) the corresponding feature of certain voice data is X, wherein X={ x1,x2,…xT, and assume that its dimension is D, it is used for
Calculate the formula of its likelihood function are as follows:
The density function is by K single Gaussian density function p in formulak(Xt) weighting obtains, the wherein mean μ of each Gaussian componentk
With covariance ∑kSize be respectively as follows: 1 × D and D × D;
Wherein hybrid weight wkMeetAssuming that λ indicates the set of model parameter, then there is λ={ wk,μi,∑k, k=
1,2 ..., K, the model are obtained by expectation maximization (EM) repetitive exercise;
It 1-22) is generally got parms λ with EM algorithm, first gives mono- initial value of λ, new parameter λ ' is then estimated, so that in λ '
Under likelihood score it is higher, i.e. p (X | λ ') >=p (X | λ), new parameter is re-used as parameter current and is trained, continuous iteration, each parameter
Revaluation formula are as follows:
1-3) the voice application step 1-1 first to target speaker and speaker to be identified), MFCC feature is extracted, then
Applying step 1-2) in UBM model to each feature vector carry out maximum a posteriori probability (Maximum a posteriori, MAP)
Operation, extracts Gauss super vector;
1-31) traditional GMM-UBM model is respectively trained the feature vector of S people to obtain specific speak first in this stage
People GMM, is denoted as λ1,λ2,…,λs, in cognitive phase, by the characteristic sequence X={ x of target speakert, t=1,2 ... T } and GMM
Model is matched respectively, calculates probability P (λ according to MAPi| X), model corresponding to maximum probability is recognition result;
Wherein, P (X) is constant, if premise is equal for everyone probabilityFormula (8) can be changed
Letter are as follows:
If assuming between every frame phonetic feature independently of each other, and formula (10) are finally obtained to its abbreviation:
1-32) present invention is using each feature vector as a classification, actually again to MFCC feature in this stage
Extract operation;
S2: deep neural network design;
2-1) DNN is the extension of conventional feed forward artificial neural network (Artifical neural network, ANN), in this base
On plinth, using the mode of Training to the further tuning of model parameter;
2-11) based on the parameter pre-training of limited Boltzmann machine;
Pre-training (Pre-training) trains limited Boltzmann machine (Restricted using the algorithm of unsupervised learning
Boltzmann machine, RBM), RBM is by successively training and is stacked into depth confidence network (DBN), RBM in structure by
One layer of visible layer and one layer of hidden layer form, onrelevant between the node of identical layer, it is assumed that the visible layer of RBM is v, and hidden layer is
H, the joint probability distribution of (v, h) is defined as:
Wherein, connection matrix of the W between visible layer and hidden layer, b and c are respectively visible layer and hidden layer biasing, and Z is normalizing
Change the factor, it, can by maximizing using gradient decline and to sdpecific dispersion (Contrastive Divergence, CD) learning algorithm
Node layer probability distribution P (v) is seen to obtain model parameter;
2-12) the small parameter perturbations based on back-propagation algorithm (Fine-tuning)
After completing the pre-training of DBN, its each layer network parameter model parameter initial as DNN increases in the last layer
Add one layer softmax layers, then using the data with mark, is learned using the learning algorithm (such as BP algorithm) of traditional neural network
Practise the model parameter of DNN;
Assuming that the 0th layer is input layer, L layers are output layer, and 1 to L-1 is hidden layer, for hidden layer l (l=1,2 ..., L-
1), node output drive value may be calculated:
zl=Wl-1hl-1+bl-1
hl=σ (zl) (12)
Wherein, Wl-1And bl-1For weight matrix and biasing, zlFor the weighted sum of l layers of input value, σ () is activation primitive, is generally made
With sigmoid or tanh function;
2-13) convolutional neural networks (Convolutional Neural Network, CNN) are another famous deep learnings
Convolution sum pond is added in network by model, the present invention, builds new DNN;
S3: Speaker Identification and decision (softmax):
3-1) in the back end test stage, after providing the Gauss super vector an of tested speech, first by the voice and all theorys
Words people's model is compared, and obtains test probability, i.e. test score;
For output layer, using Softmax function:
K is the other index of output class in formula, i.e. the classification index of target speaker, psIndicate speaker to be identified in the defeated of s class
It is worth out, i.e. output probability;
3-2) the corresponding label of maximum score is compared with the label claimed, if they are the same, being considered as this section of voice is that it is claimed
Speaker voice, otherwise just refuse;
3-3) calculate the probability that all tested speech correctly identify, the i.e. discrimination of system.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910271166.XA CN110111797A (en) | 2019-04-04 | 2019-04-04 | Method for distinguishing speek person based on Gauss super vector and deep neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910271166.XA CN110111797A (en) | 2019-04-04 | 2019-04-04 | Method for distinguishing speek person based on Gauss super vector and deep neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110111797A true CN110111797A (en) | 2019-08-09 |
Family
ID=67485160
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910271166.XA Withdrawn CN110111797A (en) | 2019-04-04 | 2019-04-04 | Method for distinguishing speek person based on Gauss super vector and deep neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110111797A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111149154A (en) * | 2019-12-24 | 2020-05-12 | 广州国音智能科技有限公司 | Voiceprint recognition method, device, equipment and storage medium |
CN111161744A (en) * | 2019-12-06 | 2020-05-15 | 华南理工大学 | Speaker clustering method for simultaneously optimizing deep characterization learning and speaker classification estimation |
CN111177970A (en) * | 2019-12-10 | 2020-05-19 | 浙江大学 | Multi-stage semiconductor process virtual metering method based on Gaussian process and convolutional neural network |
CN111402901A (en) * | 2020-03-27 | 2020-07-10 | 广东外语外贸大学 | CNN voiceprint recognition method and system based on RGB mapping characteristics of color image |
CN111461173A (en) * | 2020-03-06 | 2020-07-28 | 华南理工大学 | Attention mechanism-based multi-speaker clustering system and method |
CN111666996A (en) * | 2020-05-29 | 2020-09-15 | 湖北工业大学 | High-precision equipment source identification method based on attention mechanism |
CN111755012A (en) * | 2020-06-24 | 2020-10-09 | 湖北工业大学 | Robust speaker recognition method based on depth layer feature fusion |
CN111933155A (en) * | 2020-09-18 | 2020-11-13 | 北京爱数智慧科技有限公司 | Voiceprint recognition model training method and device and computer system |
CN112151067A (en) * | 2020-09-27 | 2020-12-29 | 湖北工业大学 | Passive detection method for digital audio tampering based on convolutional neural network |
CN112259106A (en) * | 2020-10-20 | 2021-01-22 | 网易(杭州)网络有限公司 | Voiceprint recognition method and device, storage medium and computer equipment |
CN112992125A (en) * | 2021-04-20 | 2021-06-18 | 北京沃丰时代数据科技有限公司 | Voice recognition method and device, electronic equipment and readable storage medium |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140114660A1 (en) * | 2011-12-16 | 2014-04-24 | Huawei Technologies Co., Ltd. | Method and Device for Speaker Recognition |
CN103810999A (en) * | 2014-02-27 | 2014-05-21 | 清华大学 | Linguistic model training method and system based on distributed neural networks |
US20150301796A1 (en) * | 2014-04-17 | 2015-10-22 | Qualcomm Incorporated | Speaker verification |
CN106469560A (en) * | 2016-07-27 | 2017-03-01 | 江苏大学 | A kind of speech-emotion recognition method being adapted to based on unsupervised domain |
CN106683661A (en) * | 2015-11-05 | 2017-05-17 | 阿里巴巴集团控股有限公司 | Role separation method and device based on voice |
CN106782518A (en) * | 2016-11-25 | 2017-05-31 | 深圳市唯特视科技有限公司 | A kind of audio recognition method based on layered circulation neutral net language model |
CN107293291A (en) * | 2016-03-30 | 2017-10-24 | 中国科学院声学研究所 | A kind of audio recognition method end to end based on autoadapted learning rate |
CN107301864A (en) * | 2017-08-16 | 2017-10-27 | 重庆邮电大学 | A kind of two-way LSTM acoustic models of depth based on Maxout neurons |
CN108831486A (en) * | 2018-05-25 | 2018-11-16 | 南京邮电大学 | Method for distinguishing speek person based on DNN and GMM model |
CN108877775A (en) * | 2018-06-04 | 2018-11-23 | 平安科技(深圳)有限公司 | Voice data processing method, device, computer equipment and storage medium |
CN108922559A (en) * | 2018-07-06 | 2018-11-30 | 华南理工大学 | Recording terminal clustering method based on voice time-frequency conversion feature and integral linear programming |
CN109074822A (en) * | 2017-10-24 | 2018-12-21 | 深圳和而泰智能控制股份有限公司 | Specific sound recognition methods, equipment and storage medium |
CN109192199A (en) * | 2018-06-30 | 2019-01-11 | 中国人民解放军战略支援部队信息工程大学 | A kind of data processing method of combination bottleneck characteristic acoustic model |
CN109346084A (en) * | 2018-09-19 | 2019-02-15 | 湖北工业大学 | Method for distinguishing speek person based on depth storehouse autoencoder network |
-
2019
- 2019-04-04 CN CN201910271166.XA patent/CN110111797A/en not_active Withdrawn
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140114660A1 (en) * | 2011-12-16 | 2014-04-24 | Huawei Technologies Co., Ltd. | Method and Device for Speaker Recognition |
CN103810999A (en) * | 2014-02-27 | 2014-05-21 | 清华大学 | Linguistic model training method and system based on distributed neural networks |
US20150301796A1 (en) * | 2014-04-17 | 2015-10-22 | Qualcomm Incorporated | Speaker verification |
CN106683661A (en) * | 2015-11-05 | 2017-05-17 | 阿里巴巴集团控股有限公司 | Role separation method and device based on voice |
CN107293291A (en) * | 2016-03-30 | 2017-10-24 | 中国科学院声学研究所 | A kind of audio recognition method end to end based on autoadapted learning rate |
CN106469560A (en) * | 2016-07-27 | 2017-03-01 | 江苏大学 | A kind of speech-emotion recognition method being adapted to based on unsupervised domain |
CN106782518A (en) * | 2016-11-25 | 2017-05-31 | 深圳市唯特视科技有限公司 | A kind of audio recognition method based on layered circulation neutral net language model |
CN107301864A (en) * | 2017-08-16 | 2017-10-27 | 重庆邮电大学 | A kind of two-way LSTM acoustic models of depth based on Maxout neurons |
CN109074822A (en) * | 2017-10-24 | 2018-12-21 | 深圳和而泰智能控制股份有限公司 | Specific sound recognition methods, equipment and storage medium |
CN108831486A (en) * | 2018-05-25 | 2018-11-16 | 南京邮电大学 | Method for distinguishing speek person based on DNN and GMM model |
CN108877775A (en) * | 2018-06-04 | 2018-11-23 | 平安科技(深圳)有限公司 | Voice data processing method, device, computer equipment and storage medium |
CN109192199A (en) * | 2018-06-30 | 2019-01-11 | 中国人民解放军战略支援部队信息工程大学 | A kind of data processing method of combination bottleneck characteristic acoustic model |
CN108922559A (en) * | 2018-07-06 | 2018-11-30 | 华南理工大学 | Recording terminal clustering method based on voice time-frequency conversion feature and integral linear programming |
CN109346084A (en) * | 2018-09-19 | 2019-02-15 | 湖北工业大学 | Method for distinguishing speek person based on depth storehouse autoencoder network |
Non-Patent Citations (1)
Title |
---|
酆勇: "基于深度学习的说话人识别建模研究", 《中国博士学位论文全文数据库,信息科技辑》 * |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111161744A (en) * | 2019-12-06 | 2020-05-15 | 华南理工大学 | Speaker clustering method for simultaneously optimizing deep characterization learning and speaker classification estimation |
CN111161744B (en) * | 2019-12-06 | 2023-04-28 | 华南理工大学 | Speaker clustering method for simultaneously optimizing deep characterization learning and speaker identification estimation |
CN111177970A (en) * | 2019-12-10 | 2020-05-19 | 浙江大学 | Multi-stage semiconductor process virtual metering method based on Gaussian process and convolutional neural network |
CN111177970B (en) * | 2019-12-10 | 2021-11-19 | 浙江大学 | Multi-stage semiconductor process virtual metering method based on Gaussian process and convolutional neural network |
WO2021127994A1 (en) * | 2019-12-24 | 2021-07-01 | 广州国音智能科技有限公司 | Voiceprint recognition method, apparatus and device, and storage medium |
CN111149154A (en) * | 2019-12-24 | 2020-05-12 | 广州国音智能科技有限公司 | Voiceprint recognition method, device, equipment and storage medium |
CN111149154B (en) * | 2019-12-24 | 2021-08-24 | 广州国音智能科技有限公司 | Voiceprint recognition method, device, equipment and storage medium |
CN111461173A (en) * | 2020-03-06 | 2020-07-28 | 华南理工大学 | Attention mechanism-based multi-speaker clustering system and method |
CN111461173B (en) * | 2020-03-06 | 2023-06-20 | 华南理工大学 | Multi-speaker clustering system and method based on attention mechanism |
CN111402901A (en) * | 2020-03-27 | 2020-07-10 | 广东外语外贸大学 | CNN voiceprint recognition method and system based on RGB mapping characteristics of color image |
CN111402901B (en) * | 2020-03-27 | 2023-04-18 | 广东外语外贸大学 | CNN voiceprint recognition method and system based on RGB mapping characteristics of color image |
CN111666996A (en) * | 2020-05-29 | 2020-09-15 | 湖北工业大学 | High-precision equipment source identification method based on attention mechanism |
CN111666996B (en) * | 2020-05-29 | 2023-09-19 | 湖北工业大学 | High-precision equipment source identification method based on attention mechanism |
CN111755012A (en) * | 2020-06-24 | 2020-10-09 | 湖北工业大学 | Robust speaker recognition method based on depth layer feature fusion |
CN111933155B (en) * | 2020-09-18 | 2020-12-25 | 北京爱数智慧科技有限公司 | Voiceprint recognition model training method and device and computer system |
CN111933155A (en) * | 2020-09-18 | 2020-11-13 | 北京爱数智慧科技有限公司 | Voiceprint recognition model training method and device and computer system |
CN112151067A (en) * | 2020-09-27 | 2020-12-29 | 湖北工业大学 | Passive detection method for digital audio tampering based on convolutional neural network |
CN112259106A (en) * | 2020-10-20 | 2021-01-22 | 网易(杭州)网络有限公司 | Voiceprint recognition method and device, storage medium and computer equipment |
CN112992125A (en) * | 2021-04-20 | 2021-06-18 | 北京沃丰时代数据科技有限公司 | Voice recognition method and device, electronic equipment and readable storage medium |
CN112992125B (en) * | 2021-04-20 | 2021-08-03 | 北京沃丰时代数据科技有限公司 | Voice recognition method and device, electronic equipment and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110111797A (en) | Method for distinguishing speek person based on Gauss super vector and deep neural network | |
CN110400579B (en) | Speech emotion recognition based on direction self-attention mechanism and bidirectional long-time and short-time network | |
Zhang et al. | Text-independent speaker verification based on triplet convolutional neural network embeddings | |
WO2021208287A1 (en) | Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium | |
CN110853680B (en) | double-BiLSTM speech emotion recognition method with multi-input multi-fusion strategy | |
CN106952643A (en) | A kind of sound pick-up outfit clustering method based on Gaussian mean super vector and spectral clustering | |
CN111583964B (en) | Natural voice emotion recognition method based on multimode deep feature learning | |
CN107731233A (en) | A kind of method for recognizing sound-groove based on RNN | |
CN110085263B (en) | Music emotion classification and machine composition method | |
Zhou et al. | Deep learning based affective model for speech emotion recognition | |
CN110827857B (en) | Speech emotion recognition method based on spectral features and ELM | |
CN109559736A (en) | A kind of film performer's automatic dubbing method based on confrontation network | |
Ghai et al. | Emotion recognition on speech signals using machine learning | |
CN110148408A (en) | A kind of Chinese speech recognition method based on depth residual error | |
CN109346084A (en) | Method for distinguishing speek person based on depth storehouse autoencoder network | |
Zhang et al. | A pairwise algorithm using the deep stacking network for speech separation and pitch estimation | |
CN110349588A (en) | A kind of LSTM network method for recognizing sound-groove of word-based insertion | |
Sarkar et al. | Time-contrastive learning based deep bottleneck features for text-dependent speaker verification | |
US20180277146A1 (en) | System and method for anhedonia measurement using acoustic and contextual cues | |
CN106297769B (en) | A kind of distinctive feature extracting method applied to languages identification | |
CN113763965A (en) | Speaker identification method with multiple attention characteristics fused | |
Ng et al. | Teacher-student training for text-independent speaker recognition | |
Mishra et al. | Gender differentiated convolutional neural networks for speech emotion recognition | |
Wu et al. | The DKU-LENOVO Systems for the INTERSPEECH 2019 Computational Paralinguistic Challenge. | |
CN114678030A (en) | Voiceprint identification method and device based on depth residual error network and attention mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20190809 |
|
WW01 | Invention patent application withdrawn after publication |