CN110111797A - Method for distinguishing speek person based on Gauss super vector and deep neural network - Google Patents

Method for distinguishing speek person based on Gauss super vector and deep neural network Download PDF

Info

Publication number
CN110111797A
CN110111797A CN201910271166.XA CN201910271166A CN110111797A CN 110111797 A CN110111797 A CN 110111797A CN 201910271166 A CN201910271166 A CN 201910271166A CN 110111797 A CN110111797 A CN 110111797A
Authority
CN
China
Prior art keywords
layer
neural network
model
parameter
speaker
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201910271166.XA
Other languages
Chinese (zh)
Inventor
曾春艳
马超峰
武明虎
朱栋梁
赵楠
朱莉
王娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei University of Technology
Original Assignee
Hubei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei University of Technology filed Critical Hubei University of Technology
Priority to CN201910271166.XA priority Critical patent/CN110111797A/en
Publication of CN110111797A publication Critical patent/CN110111797A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a kind of method for distinguishing speek person based on Gauss super vector and deep neural network, the stage is extracted including speaker characteristic, the deep neural network design phase, Speaker Identification and decision phase, the present invention is blended by deep neural network and Speaker Recognition System model, in conjunction with the remarkable result of Gauss super vector and the multilayered structure of deep neural network in terms of the characterization ability for improving evaluation model, and method for distinguishing speek person proposed by the present invention is capable of the recognition performance of effective lifting system in the environment of ambient noise, system performance is influenced reducing noise, while improving system noise robustness, optimization system structure, improve the competitiveness of corresponding Speaker Identification product.

Description

Method for distinguishing speek person based on Gauss super vector and deep neural network
Technical field
The present invention relates to technical field of voice recognition more particularly to a kind of based on Gauss super vector and deep neural network Method for distinguishing speek person.
Background technique
Speaker Identification is a kind of special biological identification technology realized based on voice messaging.After decades of development, Speaker Recognition Technology have been relatively mature under noiseless disturbed condition at present.The method of mainstream has GMM-UBM, GMM- at present SVM and i-vector.However under actual application environment, due to the presence of ambient noise and interchannel noise, Speaker Identification is calculated Method performance can be decreased obviously.Therefore, the noise robustness for how improving existing Speaker Recognition System becomes the field in recent years Research hotspot.
To solve this problem, researcher makes trial in the different level of Speech processing.Pertinent literature card It is real, the effect that can the Classical correlation algorithm of field of signal processing obtain depending on noise type and signal-to-noise ratio it is big It is small.For voice, the true probability distribution of feature is dependent on specific speaker and is multi-modal.However, in reality In the application scenarios of border, the factors such as the mismatch of channel and additive noise can the true probability distribution of destructive characteristics.Correlative study is logical It crosses in conjunction with the technologies such as phonetic feature and cepstral mean normalized square mean with noise robustness, can adjust under certain condition The probability distribution of whole feature, achieving the purpose that, which reduces noise, influences system performance.Feature bends algorithm (feature It warping is) that will train with the distribution map of the feature vector of tested speech into unified probability distribution, after mapping Feature vector all obeys standardized normal distribution per one-dimensional, compensates for channel to a certain extent and mismatches with additive noise to spy It is influenced caused by sign distribution.But the recognizer based on different phonetic feature is compared it can be found that recognition performance is No improvement and the type and signal-to-noise ratio of noise are also to be closely related.When containing a small amount of noise in environment, based on property field Related algorithm considers influence of the noise to feature distribution characteristic, and adjusting feature distribution by modes such as distribution maps can be improved The noise robustness of system.But with the reduction of signal-to-noise ratio, while influence of noise feature distribution characteristic, it can also change language The relevant information of speaker in sound, system performance can sharply decline, by adjusting mentioning in feature distribution bring system performance Rising just seems insignificant.
In recent years, with the promotion of machine learning algorithm performance and computer storage, the raising of computing capability, depth nerve Network (Deep NeuralNetwork, DNN) is applied in Speaker Identification field and achieves significant effect.Because of people The generation of speech-like signal and perception are exactly a complicated process, and are being biologically to have significantly at many levels Or profound processing structure.So sophisticated signal this for voice, being handled using shallow structure model it is obviously had very greatly Limitation, and use deep layer structure, utilize multilayer nonlinear transformation extract voice signal in structured message and height Layer information, is more reasonable selection.
MFCCs (Mel Frequency Cepstral Coefficents) is one kind in automatic speech and Speaker Identification Middle widely used feature the advantage is that the property independent of signal, not do any hypothesis and limitation to input signal. The time span of collected voice data is inconsistent in data set, this results in the MFCC feature sizes of every section of voice to be also It is different.The input of usual neural network to guarantee it is in the same size, if MFCC feature operated by interception or zero padding It can satisfy this requirement, but this operation can destroy the relevance between data, reduce the ability to express of feature, cause System recognition rate is greatly reduced.Therefore the present invention is further processed MFCC feature using MAP technology, extracts the super arrow of Gauss Amount extracts result for as new robust features, and combines deep neural network, to propose a kind of speaking for strong robustness People's identifying system.
Summary of the invention
The present invention is directed to solve at least some of the technical problems in related technologies.For this purpose, of the invention One purpose is to propose a kind of method for distinguishing speek person based on Gauss super vector and deep neural network, to improve evaluation mould The characterization ability of type, and while reducing noise influences system performance, improves system noise robustness, optimization system knot Structure improves the competitiveness of corresponding Speaker Identification product.
A kind of method for distinguishing speek person based on Gauss super vector and deep neural network according to an embodiment of the present invention, packet It includes:
S1: speaker characteristic extracts;
1-1) acquire primary speech signal and successively preemphasis, framing, adding window, Fast Fourier Transform (FFT) (FFT), quarter window It filters, ask logarithm, discrete Fourier transform (DCT), differential parameter, cepstral mean and normalized square mean (CMVN);
1-11) preemphasis: in order to eliminate in voiced process, effect caused by vocal cords and lip, come compensate voice signal by The high frequency section constrained to articulatory system
Y=x (n)-a*x (n-1), 0.95 < a < 0.97 (1)
X (n) indicates input signal in formula;
1-12) framing: N number of sampling point set is synthesized into an observation unit, referred to as frame;
1-13) adding window: by each frame multiplied by Hamming window, to increase the continuity of frame left end and right end, x (n) indicates framing Signal later
1-14) Fast Fourier Transform (FFT) (FFT): time-domain signal is transformed into frequency domain and carries out subsequent frequency analysis
S (n) indicates that the voice signal of input, N indicate the frame number of Fourier transformation in formula;
The triangle filter group that energy spectrum 1-15) is passed through to one group of Mel scale, being defined as one has M triangle filtering The filter group of device, centre frequency are f (m), m=1,2 ..., M;Interval between each f (m) is directly proportional to m value;
1-16) MFCC coefficient is obtained through discrete cosine transform (DCT):
Bring above-mentioned logarithmic energy into discrete cosine transform, M is the number of triangular filter in formula;L is MFCC coefficient Order, take 12-16;
1-17) difference: in order to make feature that can more embody time domain continuity, can before and after characteristic dimension increase frame information Dimension, the most commonly used is first-order differences and second differnce;
1-18) cepstral mean and normalized square mean can eliminate stationary channel influence, the robustness of lifting feature;
It 1-2) provides one group of training and extracts MFCC feature, training universal background model by step 1-1) (UniversalBackgroundModel, UBM);
If 1-21) the corresponding feature of certain voice data is X, wherein X={ x1,x2,…xT, and assume that its dimension is D, For calculating the formula of its likelihood function are as follows:
The density function is by K single Gaussian density function p in formulak(Xt) weighting obtain, wherein each Gaussian component is equal Value μkWith covariance ∑kSize be respectively as follows: 1 × D and D × D;
Wherein hybrid weight wkMeetAssuming that λ indicates the set of model parameter, then there is λ={ wki,∑k, K=1,2 ..., K, the model are obtained by expectation maximization (EM) repetitive exercise;
It 1-22) is generally got parms λ with EM algorithm, first gives mono- initial value of λ, then estimate new parameter λ ', so that Likelihood score at λ ' is higher, i.e. and p (X | λ ') >=p (X | λ), new parameter is re-used as parameter current and is trained, continuous iteration, respectively The revaluation formula of parameter are as follows:
1-3) the voice application step 1-1 first to target speaker and speaker to be identified), MFCC feature is extracted, Then applying step 1-2) in UBM model to each feature vector carry out maximum a posteriori probability (Maximum aposteriori, MAP it) operates, extracts Gauss super vector;
1-31) traditional GMM-UBM model is respectively trained to obtain specific to the feature vector of S people first in this stage Speaker GMM, is denoted as λ12,…,λs, in cognitive phase, by the characteristic sequence X={ x of target speakert, t=1,2 ... T } It is matched respectively with GMM model, probability P (λ is calculated according to MAPi| X), model corresponding to maximum probability is to identify knot Fruit;
Wherein, P (X) is constant, if premise is equal for everyone probabilityFormula (8) can With abbreviation are as follows:
If assuming between every frame phonetic feature independently of each other, and formula (10) are finally obtained to its abbreviation:
1-32) present invention is using each feature vector as a classification, actually to MFCC feature in this stage Re-start extraction operation;
S2: deep neural network design;
2-1) DNN is the extension of conventional feed forward artificial neural network (Artificalneuralnetwork, ANN), is had The more hiding number of plies and stronger ability to express use stochastic parameter common in shallow-layer network initialization and backpropagation (Back-Propagation, BP) algorithm trains this multilayered structure to be easy to make model to fall into locally optimal solution, DNN at Function has benefited from the unsupervised production pre-training algorithm of one kind proposed in recent years, and it is preferably initial which obtain model Parameter, then on this basis, using the mode of Training to the further tuning of model parameter;
2-11) based on the parameter pre-training of limited Boltzmann machine;
Pre-training (Pre-training), limited Boltzmann machine is trained using the algorithm of unsupervised learning (RestrictedBoltzmannmachine, RBM), RBM is by successively training and is stacked into depth confidence network (DBN), RBM It is made of in structure one layer of visible layer and one layer of hidden layer, onrelevant between the node of identical layer, it is assumed that the visible layer of RBM is V, hidden layer h, the joint probability distribution of (v, h) is defined as:
Wherein, connection matrix of the W between visible layer and hidden layer, b and c are respectively visible layer and hidden layer biasing, and Z is Normalization factor is declined using gradient and to sdpecific dispersion (ContrastiveDivergence, CD) learning algorithm, passes through maximum Change visible node layer probability distribution P (v) to obtain model parameter;
2-12) the small parameter perturbations based on back-propagation algorithm (Fine-tuning)
After completing the pre-training of DBN, using its each layer network parameter model parameter initial as DNN, in the last layer It is upper to increase by one layer softmax layers, then using the data with mark, utilize the learning algorithm (such as BP algorithm) of traditional neural network To learn the model parameter of DNN;
Assuming that the 0th layer is input layer, L layer are output layer, and 1 to L-1 is hidden layer, for hidden layer l (l=1,2 ..., ), L-1 node output drive value may be calculated:
zl=Wl-1hl-1+bl-1
hl=σ (zl) (12)
Wherein, Wl-1And bl-1For weight matrix and biasing, zlFor the weighted sum of l layers of input value, σ () is activation primitive, Generally use sigmoid or tanh function;
2-13) convolutional neural networks (Convolutional Neural Network, CNN) are another famous depth Learning model has been widely used in image domains, and compared to DNN, CNN is by using part filter and maximum pond skill Art directly can obtain the feature of more robust property from study sound spectrograph and compare with traditional voice feature, reduce in time domain and Information loss on frequency domain, simultaneously as the CNN feature that locally connection and weight are shared, so that CNN has translation invariance, It can overcome the problems, such as that voice signal itself is multifarious, convolution sum pond is added in network by the present invention, builds new DNN;
S3: Speaker Identification and decision (softmax):
3-1) in the back end test stage, after providing the Gauss super vector an of tested speech, first by the voice and institute There is speaker model to be compared, obtain test probability, is i.e. test score;
For output layer, using Softmax function:
K is the other index of output class in formula, i.e. the classification index of target speaker, psIndicate speaker to be identified in s The output valve of class, i.e. output probability;
3-2) the corresponding label of maximum score is compared with the label claimed, if they are the same, being considered as this section of voice is its institute Otherwise the voice of the speaker claimed is just refused;
3-3) calculate the probability that all tested speech correctly identify, the i.e. discrimination of system.
In the present invention, blended by deep neural network and Speaker Recognition System model, in conjunction with Gauss super vector and Remarkable result of the multilayered structure of deep neural network in terms of the characterization ability for improving evaluation model, and it is proposed by the present invention Method for distinguishing speek person is capable of the recognition performance of effective lifting system in the environment of ambient noise, is reducing noise to systematicness While capable of influencing, improve system noise robustness, optimization system structure improves the competition of corresponding Speaker Identification product Power.
Detailed description of the invention
Attached drawing is used to provide further understanding of the present invention, and constitutes part of specification, with reality of the invention It applies example to be used to explain the present invention together, not be construed as limiting the invention.In the accompanying drawings:
Fig. 1 is a kind of stream of the method for distinguishing speek person based on Gauss super vector and deep neural network proposed by the present invention Journey block diagram;
Fig. 2 is MFCC feature extraction flow diagram proposed by the present invention;
Fig. 3 is that Gauss super vector proposed by the present invention extracts flow diagram;
Fig. 4 is the system block diagram of deep neural network proposed by the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.
Examples of the embodiments are shown in the accompanying drawings, and in which the same or similar labels are throughly indicated identical or classes As element or element with the same or similar functions.The embodiments described below with reference to the accompanying drawings are exemplary, purport It is being used to explain the present invention, and is being not considered as limiting the invention.
Referring to Fig.1-4, a kind of method for distinguishing speek person based on Gauss super vector and deep neural network, comprising:
S1: speaker characteristic extracts;
1-1) acquire primary speech signal and successively preemphasis, framing, adding window, Fast Fourier Transform (FFT) (FFT), quarter window It filters, ask logarithm, discrete Fourier transform (DCT), differential parameter, cepstral mean and normalized square mean (CMVN);
1-11) preemphasis: in order to eliminate in voiced process, effect caused by vocal cords and lip, come compensate voice signal by The high frequency section constrained to articulatory system
Y=x (n)-a*x (n-1), 0.95 < a < 0.97 (1)
X (n) indicates input signal in formula;
1-12) framing: N number of sampling point set is synthesized into an observation unit, referred to as frame;
1-13) adding window: by each frame multiplied by Hamming window, to increase the continuity of frame left end and right end, x (n) indicates framing Signal later
1-14) Fast Fourier Transform (FFT) (FFT): time-domain signal is transformed into frequency domain and carries out subsequent frequency analysis
S (n) indicates that the voice signal of input, N indicate the frame number of Fourier transformation in formula;
The triangle filter group that energy spectrum 1-15) is passed through to one group of Mel scale, being defined as one has M triangle filtering The filter group of device, centre frequency are f (m), m=1,2 ..., M;Interval between each f (m) is directly proportional to m value;
1-16) MFCC coefficient is obtained through discrete cosine transform (DCT):
Bring above-mentioned logarithmic energy into discrete cosine transform, M is the number of triangular filter in formula;L is MFCC coefficient Order, take 12-16;
1-17) difference: in order to make feature that can more embody time domain continuity, can before and after characteristic dimension increase frame information Dimension, the most commonly used is first-order differences and second differnce;
1-18) cepstral mean and normalized square mean can eliminate stationary channel influence, the robustness of lifting feature;
It 1-2) provides one group of training and extracts MFCC feature, training universal background model by step 1-1) (UniversalBackgroundModel, UBM);
If 1-21) the corresponding feature of certain voice data is X, wherein X={ x1,x2,…xT, and assume that its dimension is D, For calculating the formula of its likelihood function are as follows:
The density function is by K single Gaussian density function p in formulak(Xt) weighting obtain, wherein each Gaussian component is equal Value μkWith covariance ∑kSize be respectively as follows: 1 × D and D × D;
Wherein hybrid weight wkMeetAssuming that λ indicates the set of model parameter, then there is λ={ wki,∑k, K=1,2 ..., K, the model are obtained by expectation maximization (EM) repetitive exercise;
It 1-22) is generally got parms λ with EM algorithm, first gives mono- initial value of λ, then estimate new parameter λ ', so that Likelihood score at λ ' is higher, i.e. and p (X | λ ') >=p (X | λ), new parameter is re-used as parameter current and is trained, continuous iteration, respectively The revaluation formula of parameter are as follows:
1-3) the voice application step 1-1 first to target speaker and speaker to be identified), MFCC feature is extracted, Then applying step 1-2) in UBM model to each feature vector carry out maximum a posteriori probability (Maximum aposteriori, MAP it) operates, extracts Gauss super vector;
1-31) traditional GMM-UBM model is respectively trained to obtain specific to the feature vector of S people first in this stage Speaker GMM, is denoted as λ12,…,λs, in cognitive phase, by the characteristic sequence X={ x of target speakert, t=1,2 ... T } It is matched respectively with GMM model, probability P (λ is calculated according to MAPi| X), model corresponding to maximum probability is to identify knot Fruit;
Wherein, P (X) is constant, if premise is equal for everyone probabilityFormula (8) can With abbreviation are as follows:
If assuming between every frame phonetic feature independently of each other, and formula (10) are finally obtained to its abbreviation:
1-32) present invention is using each feature vector as a classification, actually to MFCC feature in this stage Re-start extraction operation;
S2: deep neural network design;
2-1) DNN is the extension of conventional feed forward artificial neural network (Artifical neural network, ANN), tool There are the more hiding number of plies and stronger ability to express, uses stochastic parameter common in shallow-layer network initialization and backpropagation (Back-Propagation, BP) algorithm trains this multilayered structure to be easy to make model to fall into locally optimal solution, DNN at Function has benefited from the unsupervised production pre-training algorithm of one kind proposed in recent years, and it is preferably initial which obtain model Parameter, then on this basis, using the mode of Training to the further tuning of model parameter;
2-11) based on the parameter pre-training of limited Boltzmann machine;
Pre-training (Pre-training), limited Boltzmann machine is trained using the algorithm of unsupervised learning (Restricted Boltzmann machine, RBM), RBM is by successively training and is stacked into depth confidence network (DBN), RBM is made of in structure one layer of visible layer and one layer of hidden layer, onrelevant between the node of identical layer, it is assumed that RBM's is visible Layer is v, hidden layer h, the joint probability distribution of (v, h) is defined as:
Wherein, connection matrix of the W between visible layer and hidden layer, b and c are respectively visible layer and hidden layer biasing, and Z is Normalization factor is declined using gradient and to sdpecific dispersion (Contrastive Divergence, CD) learning algorithm, passes through maximum Change visible node layer probability distribution P (v) to obtain model parameter;
2-12) the small parameter perturbations based on back-propagation algorithm (Fine-tuning)
After completing the pre-training of DBN, using its each layer network parameter model parameter initial as DNN, in the last layer It is upper to increase by one layer softmax layers, then using the data with mark, utilize the learning algorithm (such as BP algorithm) of traditional neural network To learn the model parameter of DNN;
Assuming that the 0th layer is input layer, L layer are output layer, and 1 to L-1 is hidden layer, for hidden layer l (l=1,2 ..., ), L-1 node output drive value may be calculated:
zl=Wl-1hl-1+bl-1
hl=σ (zl) (12)
Wherein, Wl-1And bl-1For weight matrix and biasing, zlFor the weighted sum of l layers of input value, σ () is activation primitive, Generally use sigmoid or tanh function;
2-13) convolutional neural networks (Convolutional Neural Network, CNN) are another famous depth Learning model has been widely used in image domains, and compared to DNN, CNN is by using part filter and maximum pond skill Art directly can obtain the feature of more robust property from study sound spectrograph and compare with traditional voice feature, reduce in time domain and Information loss on frequency domain, simultaneously as the CNN feature that locally connection and weight are shared, so that CNN has translation invariance, It can overcome the problems, such as that voice signal itself is multifarious, convolution sum pond is added in network by the present invention, builds new DNN;
S3: Speaker Identification and decision (softmax):
3-1) in the back end test stage, after providing the Gauss super vector an of tested speech, first by the voice and institute There is speaker model to be compared, obtain test probability, is i.e. test score;
For output layer, using Softmax function:
K is the other index of output class in formula, i.e. the classification index of target speaker, psIndicate speaker to be identified in s The output valve of class, i.e. output probability;
3-2) the corresponding label of maximum score is compared with the label claimed, if they are the same, being considered as this section of voice is its institute Otherwise the voice of the speaker claimed is just refused;
3-3) calculate the probability that all tested speech correctly identify, the i.e. discrimination of system.
In conclusion depth nerve net should be passed through based on the method for distinguishing speek person of Gauss super vector and deep neural network Network is blended with Speaker Recognition System model, is improving evaluation in conjunction with Gauss super vector and the multilayered structure of deep neural network Remarkable result in terms of the characterization ability of model, and method for distinguishing speek person proposed by the present invention is in the environment of ambient noise It is capable of the recognition performance of effective lifting system, while reducing noise influences system performance, improves system noise robustness, Optimization system structure improves the competitiveness of corresponding Speaker Identification product.
In order to verify the recognition effect that the present invention is implemented, the present invention is ambient noise using white noise, and test macro exists Signal-to-noise ratio is respectively the recognition performance under 10,20,30, selects the system of GMM-UBM and GSV-SVM as a comparison.The present invention makes With the clean subset in Librispeech data set, the data of wherein 150 people is selected to train the UBM that Gaussage is 256, and In addition 34 people and its corresponding 50 voices are randomly selected as used in later period identification.Homologous ray is not under the conditions of three signal-to-noise ratio The accuracy rate comparison of identification is as shown in table 1.
Accuracy rate (%) of 1 Speaker Recognition System of table under white noise
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not Centainly refer to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be any One or more embodiment or examples in can be combined in any suitable manner.
The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, Anyone skilled in the art in the technical scope disclosed by the present invention, according to the technique and scheme of the present invention and its Inventive concept is subject to equivalent substitution or change, should be covered by the protection scope of the present invention.

Claims (1)

1. a kind of method for distinguishing speek person based on Gauss super vector and deep neural network is applied to Speaker Identification, special Sign is that the method for distinguishing speek person based on Gauss super vector and deep neural network includes:
S1: speaker characteristic extracts;
1-1) acquire primary speech signal and successively preemphasis, framing, adding window, Fast Fourier Transform (FFT) (FFT), quarter window filter Wave asks logarithm, discrete Fourier transform (DCT), differential parameter, cepstral mean and normalized square mean (CMVN);
1-11) preemphasis: in order to eliminate in voiced process, effect caused by vocal cords and lip is sent out to compensate voice signal The high frequency section that system for electrical teaching constrains
Y=x (n)-a*x (n-1), 0.95 < a < 0.97 (1)
X (n) indicates input signal in formula;
1-12) framing: N number of sampling point set is synthesized into an observation unit, referred to as frame;
1-13) adding window: by each frame multiplied by Hamming window, to increase the continuity of frame left end and right end, x (n) is indicated after framing Signal
1-14) Fast Fourier Transform (FFT) (FFT): time-domain signal is transformed into frequency domain and carries out subsequent frequency analysis
S (n) indicates that the voice signal of input, N indicate the frame number of Fourier transformation in formula;
Energy spectrum 1-15) is passed through to the triangle filter group of one group of Mel scale, being defined as one has M triangular filter Filter group, centre frequency are f (m), m=1,2 ..., M;Interval between each f (m) is directly proportional to m value;
1-16) MFCC coefficient is obtained through discrete cosine transform (DCT):
Bring above-mentioned logarithmic energy into discrete cosine transform, M is the number of triangular filter in formula;L is the rank of MFCC coefficient Number;
1-17) difference: in order to make feature that can more embody time domain continuity, can before and after characteristic dimension increase frame information dimension Degree, the most commonly used is first-order differences and second differnce;
1-18) cepstral mean and normalized square mean can eliminate stationary channel influence, the robustness of lifting feature;
It 1-2) provides one group of training and extracts MFCC feature, training universal background model (Universal by step 1-1) Background Model, UBM);
If 1-21) the corresponding feature of certain voice data is X, wherein X={ x1,x2,…xT, and assume that its dimension is D, it is used for Calculate the formula of its likelihood function are as follows:
The density function is by K single Gaussian density function p in formulak(Xt) weighting obtains, the wherein mean μ of each Gaussian componentk With covariance ∑kSize be respectively as follows: 1 × D and D × D;
Wherein hybrid weight wkMeetAssuming that λ indicates the set of model parameter, then there is λ={ wki,∑k, k= 1,2 ..., K, the model are obtained by expectation maximization (EM) repetitive exercise;
It 1-22) is generally got parms λ with EM algorithm, first gives mono- initial value of λ, new parameter λ ' is then estimated, so that in λ ' Under likelihood score it is higher, i.e. p (X | λ ') >=p (X | λ), new parameter is re-used as parameter current and is trained, continuous iteration, each parameter Revaluation formula are as follows:
1-3) the voice application step 1-1 first to target speaker and speaker to be identified), MFCC feature is extracted, then Applying step 1-2) in UBM model to each feature vector carry out maximum a posteriori probability (Maximum a posteriori, MAP) Operation, extracts Gauss super vector;
1-31) traditional GMM-UBM model is respectively trained the feature vector of S people to obtain specific speak first in this stage People GMM, is denoted as λ12,…,λs, in cognitive phase, by the characteristic sequence X={ x of target speakert, t=1,2 ... T } and GMM Model is matched respectively, calculates probability P (λ according to MAPi| X), model corresponding to maximum probability is recognition result;
Wherein, P (X) is constant, if premise is equal for everyone probabilityFormula (8) can be changed Letter are as follows:
If assuming between every frame phonetic feature independently of each other, and formula (10) are finally obtained to its abbreviation:
1-32) present invention is using each feature vector as a classification, actually again to MFCC feature in this stage Extract operation;
S2: deep neural network design;
2-1) DNN is the extension of conventional feed forward artificial neural network (Artifical neural network, ANN), in this base On plinth, using the mode of Training to the further tuning of model parameter;
2-11) based on the parameter pre-training of limited Boltzmann machine;
Pre-training (Pre-training) trains limited Boltzmann machine (Restricted using the algorithm of unsupervised learning Boltzmann machine, RBM), RBM is by successively training and is stacked into depth confidence network (DBN), RBM in structure by One layer of visible layer and one layer of hidden layer form, onrelevant between the node of identical layer, it is assumed that the visible layer of RBM is v, and hidden layer is H, the joint probability distribution of (v, h) is defined as:
Wherein, connection matrix of the W between visible layer and hidden layer, b and c are respectively visible layer and hidden layer biasing, and Z is normalizing Change the factor, it, can by maximizing using gradient decline and to sdpecific dispersion (Contrastive Divergence, CD) learning algorithm Node layer probability distribution P (v) is seen to obtain model parameter;
2-12) the small parameter perturbations based on back-propagation algorithm (Fine-tuning)
After completing the pre-training of DBN, its each layer network parameter model parameter initial as DNN increases in the last layer Add one layer softmax layers, then using the data with mark, is learned using the learning algorithm (such as BP algorithm) of traditional neural network Practise the model parameter of DNN;
Assuming that the 0th layer is input layer, L layers are output layer, and 1 to L-1 is hidden layer, for hidden layer l (l=1,2 ..., L- 1), node output drive value may be calculated:
zl=Wl-1hl-1+bl-1
hl=σ (zl) (12)
Wherein, Wl-1And bl-1For weight matrix and biasing, zlFor the weighted sum of l layers of input value, σ () is activation primitive, is generally made With sigmoid or tanh function;
2-13) convolutional neural networks (Convolutional Neural Network, CNN) are another famous deep learnings Convolution sum pond is added in network by model, the present invention, builds new DNN;
S3: Speaker Identification and decision (softmax):
3-1) in the back end test stage, after providing the Gauss super vector an of tested speech, first by the voice and all theorys Words people's model is compared, and obtains test probability, i.e. test score;
For output layer, using Softmax function:
K is the other index of output class in formula, i.e. the classification index of target speaker, psIndicate speaker to be identified in the defeated of s class It is worth out, i.e. output probability;
3-2) the corresponding label of maximum score is compared with the label claimed, if they are the same, being considered as this section of voice is that it is claimed Speaker voice, otherwise just refuse;
3-3) calculate the probability that all tested speech correctly identify, the i.e. discrimination of system.
CN201910271166.XA 2019-04-04 2019-04-04 Method for distinguishing speek person based on Gauss super vector and deep neural network Withdrawn CN110111797A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910271166.XA CN110111797A (en) 2019-04-04 2019-04-04 Method for distinguishing speek person based on Gauss super vector and deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910271166.XA CN110111797A (en) 2019-04-04 2019-04-04 Method for distinguishing speek person based on Gauss super vector and deep neural network

Publications (1)

Publication Number Publication Date
CN110111797A true CN110111797A (en) 2019-08-09

Family

ID=67485160

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910271166.XA Withdrawn CN110111797A (en) 2019-04-04 2019-04-04 Method for distinguishing speek person based on Gauss super vector and deep neural network

Country Status (1)

Country Link
CN (1) CN110111797A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111149154A (en) * 2019-12-24 2020-05-12 广州国音智能科技有限公司 Voiceprint recognition method, device, equipment and storage medium
CN111161744A (en) * 2019-12-06 2020-05-15 华南理工大学 Speaker clustering method for simultaneously optimizing deep characterization learning and speaker classification estimation
CN111177970A (en) * 2019-12-10 2020-05-19 浙江大学 Multi-stage semiconductor process virtual metering method based on Gaussian process and convolutional neural network
CN111402901A (en) * 2020-03-27 2020-07-10 广东外语外贸大学 CNN voiceprint recognition method and system based on RGB mapping characteristics of color image
CN111461173A (en) * 2020-03-06 2020-07-28 华南理工大学 Attention mechanism-based multi-speaker clustering system and method
CN111666996A (en) * 2020-05-29 2020-09-15 湖北工业大学 High-precision equipment source identification method based on attention mechanism
CN111755012A (en) * 2020-06-24 2020-10-09 湖北工业大学 Robust speaker recognition method based on depth layer feature fusion
CN111933155A (en) * 2020-09-18 2020-11-13 北京爱数智慧科技有限公司 Voiceprint recognition model training method and device and computer system
CN112151067A (en) * 2020-09-27 2020-12-29 湖北工业大学 Passive detection method for digital audio tampering based on convolutional neural network
CN112259106A (en) * 2020-10-20 2021-01-22 网易(杭州)网络有限公司 Voiceprint recognition method and device, storage medium and computer equipment
CN112992125A (en) * 2021-04-20 2021-06-18 北京沃丰时代数据科技有限公司 Voice recognition method and device, electronic equipment and readable storage medium

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140114660A1 (en) * 2011-12-16 2014-04-24 Huawei Technologies Co., Ltd. Method and Device for Speaker Recognition
CN103810999A (en) * 2014-02-27 2014-05-21 清华大学 Linguistic model training method and system based on distributed neural networks
US20150301796A1 (en) * 2014-04-17 2015-10-22 Qualcomm Incorporated Speaker verification
CN106469560A (en) * 2016-07-27 2017-03-01 江苏大学 A kind of speech-emotion recognition method being adapted to based on unsupervised domain
CN106683661A (en) * 2015-11-05 2017-05-17 阿里巴巴集团控股有限公司 Role separation method and device based on voice
CN106782518A (en) * 2016-11-25 2017-05-31 深圳市唯特视科技有限公司 A kind of audio recognition method based on layered circulation neutral net language model
CN107293291A (en) * 2016-03-30 2017-10-24 中国科学院声学研究所 A kind of audio recognition method end to end based on autoadapted learning rate
CN107301864A (en) * 2017-08-16 2017-10-27 重庆邮电大学 A kind of two-way LSTM acoustic models of depth based on Maxout neurons
CN108831486A (en) * 2018-05-25 2018-11-16 南京邮电大学 Method for distinguishing speek person based on DNN and GMM model
CN108877775A (en) * 2018-06-04 2018-11-23 平安科技(深圳)有限公司 Voice data processing method, device, computer equipment and storage medium
CN108922559A (en) * 2018-07-06 2018-11-30 华南理工大学 Recording terminal clustering method based on voice time-frequency conversion feature and integral linear programming
CN109074822A (en) * 2017-10-24 2018-12-21 深圳和而泰智能控制股份有限公司 Specific sound recognition methods, equipment and storage medium
CN109192199A (en) * 2018-06-30 2019-01-11 中国人民解放军战略支援部队信息工程大学 A kind of data processing method of combination bottleneck characteristic acoustic model
CN109346084A (en) * 2018-09-19 2019-02-15 湖北工业大学 Method for distinguishing speek person based on depth storehouse autoencoder network

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140114660A1 (en) * 2011-12-16 2014-04-24 Huawei Technologies Co., Ltd. Method and Device for Speaker Recognition
CN103810999A (en) * 2014-02-27 2014-05-21 清华大学 Linguistic model training method and system based on distributed neural networks
US20150301796A1 (en) * 2014-04-17 2015-10-22 Qualcomm Incorporated Speaker verification
CN106683661A (en) * 2015-11-05 2017-05-17 阿里巴巴集团控股有限公司 Role separation method and device based on voice
CN107293291A (en) * 2016-03-30 2017-10-24 中国科学院声学研究所 A kind of audio recognition method end to end based on autoadapted learning rate
CN106469560A (en) * 2016-07-27 2017-03-01 江苏大学 A kind of speech-emotion recognition method being adapted to based on unsupervised domain
CN106782518A (en) * 2016-11-25 2017-05-31 深圳市唯特视科技有限公司 A kind of audio recognition method based on layered circulation neutral net language model
CN107301864A (en) * 2017-08-16 2017-10-27 重庆邮电大学 A kind of two-way LSTM acoustic models of depth based on Maxout neurons
CN109074822A (en) * 2017-10-24 2018-12-21 深圳和而泰智能控制股份有限公司 Specific sound recognition methods, equipment and storage medium
CN108831486A (en) * 2018-05-25 2018-11-16 南京邮电大学 Method for distinguishing speek person based on DNN and GMM model
CN108877775A (en) * 2018-06-04 2018-11-23 平安科技(深圳)有限公司 Voice data processing method, device, computer equipment and storage medium
CN109192199A (en) * 2018-06-30 2019-01-11 中国人民解放军战略支援部队信息工程大学 A kind of data processing method of combination bottleneck characteristic acoustic model
CN108922559A (en) * 2018-07-06 2018-11-30 华南理工大学 Recording terminal clustering method based on voice time-frequency conversion feature and integral linear programming
CN109346084A (en) * 2018-09-19 2019-02-15 湖北工业大学 Method for distinguishing speek person based on depth storehouse autoencoder network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
酆勇: "基于深度学习的说话人识别建模研究", 《中国博士学位论文全文数据库,信息科技辑》 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111161744A (en) * 2019-12-06 2020-05-15 华南理工大学 Speaker clustering method for simultaneously optimizing deep characterization learning and speaker classification estimation
CN111161744B (en) * 2019-12-06 2023-04-28 华南理工大学 Speaker clustering method for simultaneously optimizing deep characterization learning and speaker identification estimation
CN111177970A (en) * 2019-12-10 2020-05-19 浙江大学 Multi-stage semiconductor process virtual metering method based on Gaussian process and convolutional neural network
CN111177970B (en) * 2019-12-10 2021-11-19 浙江大学 Multi-stage semiconductor process virtual metering method based on Gaussian process and convolutional neural network
WO2021127994A1 (en) * 2019-12-24 2021-07-01 广州国音智能科技有限公司 Voiceprint recognition method, apparatus and device, and storage medium
CN111149154A (en) * 2019-12-24 2020-05-12 广州国音智能科技有限公司 Voiceprint recognition method, device, equipment and storage medium
CN111149154B (en) * 2019-12-24 2021-08-24 广州国音智能科技有限公司 Voiceprint recognition method, device, equipment and storage medium
CN111461173A (en) * 2020-03-06 2020-07-28 华南理工大学 Attention mechanism-based multi-speaker clustering system and method
CN111461173B (en) * 2020-03-06 2023-06-20 华南理工大学 Multi-speaker clustering system and method based on attention mechanism
CN111402901A (en) * 2020-03-27 2020-07-10 广东外语外贸大学 CNN voiceprint recognition method and system based on RGB mapping characteristics of color image
CN111402901B (en) * 2020-03-27 2023-04-18 广东外语外贸大学 CNN voiceprint recognition method and system based on RGB mapping characteristics of color image
CN111666996A (en) * 2020-05-29 2020-09-15 湖北工业大学 High-precision equipment source identification method based on attention mechanism
CN111666996B (en) * 2020-05-29 2023-09-19 湖北工业大学 High-precision equipment source identification method based on attention mechanism
CN111755012A (en) * 2020-06-24 2020-10-09 湖北工业大学 Robust speaker recognition method based on depth layer feature fusion
CN111933155B (en) * 2020-09-18 2020-12-25 北京爱数智慧科技有限公司 Voiceprint recognition model training method and device and computer system
CN111933155A (en) * 2020-09-18 2020-11-13 北京爱数智慧科技有限公司 Voiceprint recognition model training method and device and computer system
CN112151067A (en) * 2020-09-27 2020-12-29 湖北工业大学 Passive detection method for digital audio tampering based on convolutional neural network
CN112259106A (en) * 2020-10-20 2021-01-22 网易(杭州)网络有限公司 Voiceprint recognition method and device, storage medium and computer equipment
CN112992125A (en) * 2021-04-20 2021-06-18 北京沃丰时代数据科技有限公司 Voice recognition method and device, electronic equipment and readable storage medium
CN112992125B (en) * 2021-04-20 2021-08-03 北京沃丰时代数据科技有限公司 Voice recognition method and device, electronic equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN110111797A (en) Method for distinguishing speek person based on Gauss super vector and deep neural network
CN110400579B (en) Speech emotion recognition based on direction self-attention mechanism and bidirectional long-time and short-time network
Zhang et al. Text-independent speaker verification based on triplet convolutional neural network embeddings
WO2021208287A1 (en) Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
CN110853680B (en) double-BiLSTM speech emotion recognition method with multi-input multi-fusion strategy
CN106952643A (en) A kind of sound pick-up outfit clustering method based on Gaussian mean super vector and spectral clustering
CN111583964B (en) Natural voice emotion recognition method based on multimode deep feature learning
CN107731233A (en) A kind of method for recognizing sound-groove based on RNN
CN110085263B (en) Music emotion classification and machine composition method
Zhou et al. Deep learning based affective model for speech emotion recognition
CN110827857B (en) Speech emotion recognition method based on spectral features and ELM
CN109559736A (en) A kind of film performer&#39;s automatic dubbing method based on confrontation network
Ghai et al. Emotion recognition on speech signals using machine learning
CN110148408A (en) A kind of Chinese speech recognition method based on depth residual error
CN109346084A (en) Method for distinguishing speek person based on depth storehouse autoencoder network
Zhang et al. A pairwise algorithm using the deep stacking network for speech separation and pitch estimation
CN110349588A (en) A kind of LSTM network method for recognizing sound-groove of word-based insertion
Sarkar et al. Time-contrastive learning based deep bottleneck features for text-dependent speaker verification
US20180277146A1 (en) System and method for anhedonia measurement using acoustic and contextual cues
CN106297769B (en) A kind of distinctive feature extracting method applied to languages identification
CN113763965A (en) Speaker identification method with multiple attention characteristics fused
Ng et al. Teacher-student training for text-independent speaker recognition
Mishra et al. Gender differentiated convolutional neural networks for speech emotion recognition
Wu et al. The DKU-LENOVO Systems for the INTERSPEECH 2019 Computational Paralinguistic Challenge.
CN114678030A (en) Voiceprint identification method and device based on depth residual error network and attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20190809

WW01 Invention patent application withdrawn after publication