CN105989843A - Method and device of realizing missing feature reconstruction - Google Patents

Method and device of realizing missing feature reconstruction Download PDF

Info

Publication number
CN105989843A
CN105989843A CN201510044910.4A CN201510044910A CN105989843A CN 105989843 A CN105989843 A CN 105989843A CN 201510044910 A CN201510044910 A CN 201510044910A CN 105989843 A CN105989843 A CN 105989843A
Authority
CN
China
Prior art keywords
speech
frame
speech frame
tested
tested speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201510044910.4A
Other languages
Chinese (zh)
Inventor
王金明
尹海明
朱长宝
袁浩
徐志军
唐景山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN201510044910.4A priority Critical patent/CN105989843A/en
Priority to PCT/CN2015/093901 priority patent/WO2016119501A1/en
Publication of CN105989843A publication Critical patent/CN105989843A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech

Abstract

The present invention discloses a method and device of realizing missing feature reconstruction The method comprises the steps of pre-obtaining the weights, the mean vectors and the covariance matrixes of two or more than two Gaussian hybrid clusters; dividing the test voice into two or more than two voice frames, for each voice frame of the test voice, adopting an IMCRA algorithm to calculate the signal to noise ratio of the dimensions of the voice frames of the test voice; dividing the voice frames of the test voice into a reliable part and an unreliable part according to the signal to noise ratio of the dimensions of the voice frames of the test voice, determining that the voice frames of the test voice need the missing feature reconstruction according to the unreliable part of the voice frames of the test voice, and carrying out the missing feature reconstruction on the unreliable part of the voice frames of the test voice according to the weights, the mean vectors and the covariance matrixes of the Gaussian hybrid clusters and the reliable part of the voice frames of the test voice. According to the scheme of the present invention, the precision of the missing feature reconstruction is improved.

Description

A kind of method and apparatus realizing lacking feature reconstruction
Technical field
The present invention relates to Application on Voiceprint Recognition (VPR, Voiceprint Recognition) technology, espespecially a kind of real The method and apparatus now lacking feature reconstruction.
Background technology
The recognition performance of Voiceprint Recognition System can drastically reduce along with the enhancing of environment noise, for improving system System discrimination in a noisy environment, as the effective front-end processing side of one of field of speech recognition Method, the method for disappearance feature reconstruction has been applied to Application on Voiceprint Recognition field, and has achieved under experimental conditions Good effect.
The existing method realizing disappearance feature reconstruction generally comprises:
Tested speech is divided into two or more speech frames, for the voice of each tested speech Frame, uses the training speech frame that the speech frame composing the acquisition tested speech that cuts algorithm is corresponding, and according to acquisition Training speech frame calculates the signal to noise ratio of each dimension of speech frame of tested speech;Speech frame according to tested speech The speech frame of tested speech is divided into reliable parts and unreliable part by the signal to noise ratio of each dimension;Obtain each Weight, mean vector and the covariance matrix that training speech frame is corresponding;Corresponding according to each training speech frame Weight, mean vector and covariance matrix obtain weight corresponding to the speech frame of tested speech, mean vector And covariance matrix;Weights, mean vector and covariance matrix that speech frame according to tested speech is corresponding, Reliable parts in the speech frame of tested speech use maximum a posteriori in the speech frame of tested speech not Reliable parts carries out lacking feature reconstruction.
In the existing method realizing disappearance feature reconstruction, owing to spectrum cuts algorithm, to be defaulted as noise be stable, And actual noise be non-smoothly, cause carrying out lack feature reconstruction process exist bigger error.
Summary of the invention
In order to solve the problems referred to above, the present invention proposes a kind of method and apparatus realizing and lacking feature reconstruction, Error can be reduced, thus improve disappearance feature reconstruction precision.
In order to achieve the above object, the present invention proposes a kind of method realizing and lacking feature reconstruction, including:
Obtain weight, mean vector and the covariance matrix of two or more Gaussian Mixture cluster in advance;
Tested speech is divided into two or more speech frames, for the voice of each tested speech Frame, the speech frame using the minima control recursive average IMCRA algorithm calculating tested speech improved is each The signal to noise ratio of dimension;
The speech frame of tested speech is divided into reliably by the signal to noise ratio of each dimension of speech frame according to tested speech Part and unreliable part, and judge tested speech according to the unreliable part of the speech frame of tested speech Speech frame need carry out lack feature reconstruction, according to obtain each Gaussian Mixture cluster weight, average Vector sum covariance matrix, tested speech speech frame reliable parts to the speech frame of tested speech not Reliable parts carries out lacking feature reconstruction.
Preferably, the unreliable part at the speech frame according to described tested speech judges described test language When the speech frame of sound is made without lacking feature reconstruction, the method also includes: abandon described tested speech Speech frame.
Preferably, described obtain in advance two or more Gaussian Mixture cluster weight, mean vector Include with covariance matrix:
Obtain two or more training voices in advance, voice is trained for each, by described training Voice is divided into two or more speech frames;
Obtain the Mel territory log power spectrum character vector of the speech frame of each described training voice, The Mel territory log power spectrum character vector of the speech frame according to each training voice obtains each Gaussian Mixture Cluster weight, mean vector and covariance matrix.
Preferably, the Mel territory log power spectrum Characteristic Vectors of the speech frame of each training voice of described acquisition Amount parameter includes:
The speech frame of described training voice is carried out Fourier transformation, to the training voice after Fourier transformation Speech frame delivery obtain the amplitude spectrum of speech frame of described training voice;
To the squared speech frame obtaining described training voice of amplitude spectrum of the speech frame of described training voice Power spectrum;
The power spectrum of the speech frame of described training voice is obtained described training language through Mel comb filter The Mel territory power spectrum characteristic vector parameters of the speech frame of sound, the Mel to the speech frame of described training voice Territory power spectrum characteristic vector parameters is taken the logarithm the Mel territory log power of the speech frame obtaining described training voice Spectrum signature vector parameters.
Preferably, the Mel territory log power spectrum characteristic vector ginseng of the described speech frame according to each training voice Number obtains each Gaussian Mixture cluster weight, mean vector and covariance matrix and includes:
The number of described Gaussian Mixture cluster is set, initializes the mean vector of each Gaussian Mixture cluster, association Variance matrix and weight;
According to initialized each Gaussian Mixture cluster mean vector, covariance matrix and weight, respectively train The Mel territory log power spectrum character vector of the speech frame of voice uses Gaussian Mixture clustering algorithm to obtain Each Gaussian Mixture cluster weight, mean vector and covariance matrix.
Preferably, the signal to noise ratio of each dimension of speech frame of described employing IMCRA algorithm calculating tested speech Including:
Obtain the Mel territory power spectrum characteristic vector parameters of the speech frame of described tested speech;
The Mel territory power spectrum characteristic vector parameters of the speech frame according to described tested speech uses described IMCRA algorithm calculates the signal to noise ratio of each dimension of speech frame of described tested speech.
Preferably, the Mel territory power spectrum characteristic vector parameters of the speech frame of described acquisition tested speech includes:
The speech frame of described tested speech is carried out Fourier transformation, to the tested speech after Fourier transformation Speech frame delivery obtain the amplitude spectrum of speech frame of described tested speech;
To the squared speech frame obtaining described tested speech of amplitude spectrum of the speech frame of described tested speech Power spectrum, obtains described survey by the power spectrum of the speech frame of described tested speech through Mel comb filter The Mel territory power spectrum characteristic vector parameters of the speech frame of examination voice.
Preferably, the Mel territory power spectrum characteristic vector parameters of the described speech frame according to tested speech uses The signal to noise ratio of each dimension of speech frame that IMCRA algorithm calculates tested speech includes:
According to formula D2(λ, k2)=αd(λ,k2)D2(λ-1,k2)+[1-αd(λ,k2)]Y2(λ, k2) calculates described test The noise power of each dimension of speech frame of voice, according to formula SNR (λ, k2)=20log10(Y(λ,k2)-D(λ,k2))-20log10D (λ, k2) calculates the voice of described tested speech The signal to noise ratio of each dimension of frame;
Wherein, D2(λ, k2) is the of the Mel territory noise power of λ speech frame of tested speech The value of k2 dimension, k2 is the dimension of the Mel territory power spectrum characteristic vector parameters of the speech frame of described tested speech Sequence number, λ is the speech frame sequence number of described tested speech, αdFor smoothing parameter, Y is described tested speech Kth 2 dimension values of Mel territory power spectrum characteristic vector parameters of speech frame, Y2(λ, k2) is test The value of kth 2 dimension of the Mel territory power spectrum characteristic vector parameters of λ speech frame of voice.
Preferably, the signal to noise ratio of the described each dimension of the speech frame according to tested speech is by the voice of tested speech Frame is divided into reliable parts and unreliable part includes:
Judge that the signal to noise ratio of speech frame dimension of described tested speech, more than predetermined threshold value, determines institute State the reliable parts of the speech frame that this dimension is described tested speech of the speech frame of tested speech;
Judge that the signal to noise ratio of the speech frame dimension of described tested speech is less than or equal to predetermined threshold value, Determine the speech frame of described tested speech the speech frame that this dimension is described tested speech can not relying part Point.
Preferably, before the unreliable part of the described speech frame to tested speech carries out lacking feature reconstruction Also include: judge the speech frame whether need of tested speech according to the unreliable part of the speech frame of tested speech Carry out lacking feature reconstruction, including:
The number of dimensions of unreliable part of speech frame and the voice of described tested speech when described tested speech Ratio between total number of dimensions of frame is more than or equal to when presetting ratio, it is judged that the language of described tested speech Sound frame needs to carry out lacking feature reconstruction;
The number of dimensions of unreliable part of speech frame and the voice of described tested speech when described tested speech Ratio between total number of dimensions of frame is less than when presetting ratio, it is judged that the speech frame of described tested speech is not Needs carry out lacking feature reconstruction.
Preferably, described weight, mean vector and covariance square according to each Gaussian Mixture cluster obtained The unreliable part of the speech frame of tested speech is lacked by battle array, the reliable parts of speech frame of tested speech Mistake feature reconstruction includes:
Weight, mean vector and covariance matrix according to described each Gaussian Mixture cluster judge described survey Gaussian Mixture belonging to the speech frame cluster of examination voice;
Weight, mean vector and the association of the Gaussian Mixture cluster belonging to the speech frame according to described tested speech Variance matrix and formulaUnreliable part to the speech frame of tested speech Carry out lacking feature reconstruction;
Wherein,Unreliable part for the speech frame of described tested speech;XoFor described tested speech The reliable parts of speech frame;UkmGaussian Mixture cluster belonging to the speech frame of described tested speech Part corresponding with the unreliable part of the speech frame of described tested speech in mean vector;UkoFor institute State the language with described tested speech in the mean vector of the Gaussian Mixture cluster belonging to the speech frame of tested speech The part that the reliable parts of sound frame is corresponding;θkmoFor the Gauss belonging to the speech frame by described tested speech Mix in the covariance matrix of cluster corresponding with the unreliable part of the speech frame of described tested speech OK, and the row corresponding with the reliable parts of the speech frame of described tested speech intersect elementary composition Matrix;θkooIn the covariance matrix of the Gaussian Mixture cluster belonging to the speech frame by described tested speech The row corresponding with the reliable parts of the speech frame of described tested speech, and with the voice of described tested speech The elementary composition matrix that the corresponding row of the reliable parts of frame intersect.
Preferably, described weight, mean vector and the covariance matrix clustered according to each Gaussian Mixture judges Go out the cluster of the Gaussian Mixture belonging to the speech frame of tested speech to include:
According to formulaJudge the height belonging to the speech frame of described tested speech This mixes cluster;
Wherein, X is the Mel territory log power spectrum character vector of the speech frame of tested speech, λk4 Cluster for 4 Gaussian Mixture of kth,For the k4 value that maximum likelihood value is corresponding, P is X and λk4Between Likelihood value, argmax be P be k4 value during maximum;
Wherein, P ( X | λ k 4 ) = ω k 4 Π y ∈ X 0 N 2 ( x , μ k 4 , y , σ k 4 , y ) Π y ∈ X m 0.5 ( 1 + erf ( x - μ k 4 , y 2 σ k 4 , y ) ) ;
N 2 ( y , μ k 4 , y , σ k 4 , y ) = 1 2 πσ k 4 , y 2 exp ( - 0.5 ( y - μ k 4 , y ) 2 σ k 4 , y 2 ) ;
Wherein, y is certain one-dimensional data of X, ωk4λ is clustered for Gaussian Mixturek4Weight, μk4, x is λk4 The average corresponding with y dimension, σk4, y is λk4In with the y corresponding covariance matrix diagonal values of dimension. By compare X the most each Gaussian Mixture cluster likelihood value P (X | λk4) size, select likelihood value maximum Gaussian Mixture cluster λk4
The invention allows for a kind of device realizing lacking feature reconstruction, at least include:
Acquisition module, for obtain in advance the weight of two or more Gaussian Mixture cluster, average to Amount and covariance matrix;
Computing module, for being divided into two or more speech frames by tested speech, for each The speech frame of tested speech, uses the minima improved to control recursive average IMCRA algorithm and calculates test The signal to noise ratio of each dimension of speech frame of voice;
Rebuild module, be used for the signal to noise ratio of each dimension of speech frame according to tested speech by the language of tested speech Sound frame is divided into reliable parts and unreliable part, and the unreliable part of the speech frame according to tested speech Judge that the speech frame of tested speech needs to carry out lacking feature reconstruction, gather according to each Gaussian Mixture obtained The weight of class, mean vector and covariance matrix, tested speech speech frame reliable parts to test language The unreliable part of the speech frame of sound carries out lacking feature reconstruction.
Preferably, described reconstruction module is additionally operable to:
Unreliable part at the speech frame according to described tested speech judges the voice of described tested speech When frame is made without lacking feature reconstruction, abandon the speech frame of described tested speech.
Preferably, described acquisition module specifically for:
Obtain two or more training voices in advance, voice is trained for each, by described training Voice is divided into two or more speech frames;Obtain the prunus mume (sieb.) sieb.et zucc. of the speech frame of each described training voice That territory log power spectrum character vector, according to the Mel territory log power of the speech frame of each training voice Spectrum signature vector parameters obtains each Gaussian Mixture cluster weight, mean vector and covariance matrix.
Preferably, described computing module specifically for:
Tested speech is divided into two or more speech frames, for the voice of each tested speech Frame, obtains the Mel territory power spectrum characteristic vector parameters of the speech frame of described tested speech;According to described survey The Mel territory power spectrum characteristic vector parameters of the speech frame of examination voice uses described IMCRA algorithm to calculate institute State the signal to noise ratio of each dimension of speech frame of tested speech.
Preferably, described reconstruction module specifically for:
Judge that the signal to noise ratio of speech frame dimension of described tested speech, more than predetermined threshold value, determines institute State the reliable parts of the speech frame that this dimension is described tested speech of the speech frame of tested speech;
Judge that the signal to noise ratio of the speech frame dimension of described tested speech is less than or equal to predetermined threshold value, Determine the speech frame of described tested speech the speech frame that this dimension is described tested speech can not relying part Point;
The unreliable part of the speech frame according to tested speech judges that the speech frame of tested speech needs to carry out Disappearance feature reconstruction, according to obtain each Gaussian Mixture cluster weight, mean vector and covariance matrix, The reliable parts of the speech frame of tested speech carries out disappearance spy to the unreliable part of the speech frame of tested speech Levy reconstruction.
Preferably, described reconstruction module specifically for:
The speech frame of tested speech is divided into reliably by the signal to noise ratio of each dimension of speech frame according to tested speech Part and unreliable part;
The number of dimensions of unreliable part of speech frame and the voice of described tested speech when described tested speech Ratio between total number of dimensions of frame is more than or equal to when presetting ratio, it is judged that the language of described tested speech Sound frame needs to carry out lacking feature reconstruction;
According to weight, mean vector and the covariance matrix of each Gaussian Mixture cluster obtained, tested speech The reliable parts of speech frame carry out the unreliable part of the speech frame of tested speech lacking feature reconstruction.
Preferably, described reconstruction module specifically for:
The speech frame of tested speech is divided into reliably by the signal to noise ratio of each dimension of speech frame according to tested speech Part and unreliable part, judge tested speech according to the unreliable part of the speech frame of tested speech Speech frame needs to carry out lacking feature reconstruction;
Weight, mean vector and covariance matrix according to described each Gaussian Mixture cluster judge described survey Gaussian Mixture belonging to the speech frame cluster of examination voice;
Weight, mean vector and the association of the Gaussian Mixture cluster belonging to the speech frame according to described tested speech Variance matrix and formulaUnreliable part to the speech frame of tested speech Carry out lacking feature reconstruction;
Wherein,Unreliable part for the speech frame of described tested speech;XoFor described tested speech The reliable parts of speech frame;UkmGaussian Mixture cluster belonging to the speech frame of described tested speech Part corresponding with the unreliable part of the speech frame of described tested speech in mean vector;UkoFor institute State the language with described tested speech in the mean vector of the Gaussian Mixture cluster belonging to the speech frame of tested speech The part that the reliable parts of sound frame is corresponding;θkmoFor the Gauss belonging to the speech frame by described tested speech Mix in the covariance matrix of cluster corresponding with the unreliable part of the speech frame of described tested speech OK, and the row corresponding with the reliable parts of the speech frame of described tested speech intersect elementary composition Matrix;θkooIn the covariance matrix of the Gaussian Mixture cluster belonging to the speech frame by described tested speech The row corresponding with the reliable parts of the speech frame of described tested speech, and with the voice of described tested speech The elementary composition matrix that the corresponding row of the reliable parts of frame intersect.
Compared with prior art, the present invention includes: obtain two or more Gaussian Mixture cluster in advance Weight, mean vector and covariance matrix;Tested speech is divided into two or more speech frames, For the speech frame of each tested speech, the speech frame using IMCRA algorithm calculating tested speech is each The signal to noise ratio of dimension;The signal to noise ratio of each dimension of speech frame according to tested speech is by the speech frame of tested speech It is divided into reliable parts and unreliable part, and judges according to the unreliable part of the speech frame of tested speech The speech frame going out tested speech needs to carry out lacking feature reconstruction, according to each Gaussian Mixture cluster obtained Weight, mean vector and covariance matrix, tested speech the reliable parts of speech frame to tested speech The unreliable part of speech frame carries out lacking feature reconstruction.By the solution of the present invention, due to IMCRA Algorithm be can under steady and nonstationary noise environment the method for effective tracking noise, therefore improve flat Surely and lack the precision of feature reconstruction under nonstationary noise environment, thus improve Voiceprint Recognition System identification Rate.
Accompanying drawing explanation
Illustrating the accompanying drawing in the embodiment of the present invention below, the accompanying drawing in embodiment is for this Bright is further appreciated by, and is used for explaining the present invention, is not intended that and the present invention is protected model together with description The restriction enclosed.
Fig. 1 is the flow chart of the method realizing disappearance feature reconstruction of the present invention;
Fig. 2 is the structure composition schematic diagram of the device realizing disappearance feature reconstruction of the present invention.
Detailed description of the invention
For the ease of the understanding of those skilled in the art, below in conjunction with the accompanying drawings the present invention is further retouched State, can not be used for limiting the scope of the invention.It should be noted that in the case of not conflicting, Embodiment in the application and the various modes in embodiment can be mutually combined.
Seeing Fig. 1, the present invention proposes a kind of method realizing and lacking feature reconstruction, including:
Step 100, in advance obtain two or more Gaussian Mixture cluster weight, mean vector and Covariance matrix.Specifically include:
Obtain two or more training voices in advance, voice is trained for each, voice will be trained It is divided into two or more speech frames;Obtain the Mel territory logarithm of the speech frame of each training voice Power spectrum characteristic vector parameters, according to the Mel territory log power spectrum Characteristic Vectors of the speech frame of each training voice Amount parameter acquiring each Gaussian Mixture cluster weight, mean vector and covariance matrix.
Wherein, when training voice is divided into two or more speech frames, training voice can be made Between a length of 20 milliseconds (ms) of speech frame to 30ms, adjacent two training voices speech frame it Between have 25% to 50% overlap.
Wherein, the Mel territory log power spectrum character vector of the speech frame of each training voice is obtained Including:
The speech frame of training voice is carried out Fourier transformation, the language to the training voice after Fourier transformation Sound frame delivery obtains training the amplitude spectrum of the speech frame of voice, takes the amplitude spectrum of the speech frame of training voice Square obtain training the power spectrum of the speech frame of voice, will the power spectrum of speech frame of training voice through prunus mume (sieb.) sieb.et zucc. Your comb filter obtains training the Mel territory power spectrum characteristic vector parameters of the speech frame of voice, to training The Mel territory power spectrum characteristic vector parameters of the speech frame of voice is taken the logarithm and is obtained training the speech frame of voice Mel territory log power spectrum character vector.
Wherein, Mel comb filter can be triangular filter or hyperbola wave filter.
Wherein, triangular filter can use following methods to design.
According to formula (1), temporal frequency gone to Mel territory frequency.
f mel ( f ) = 2595 lg ( 1 + f 700 ) - - - ( 1 )
Wherein, f is temporal frequency, fmelFor Mel territory frequency.
Calculate voice signal peak frequency again:
f g = f s 2 - - - ( 2 )
Wherein, fsFor sample frequency, sample frequency is generally 8000Hz, fgOriginal frequency for voice signal The maximum of rate.
In conjunction with formula (1) and formula (2), the maximum Mel territory frequency of voice is to make f=fg draw:
f max [ mel ] = 2595 lg ( 1 + f s 1400 ) mel - - - ( 3 )
Wherein, fmaxFor maximum Mel territory frequency.
Mel territory, K rank power spectrum characteristic vector parameters is:
Δmel = f max K + 1 mel - - - ( 4 )
Utilize the inverse function of formula (1) by vector M=Δ mel, 2 Δ mel, 3 Δ mel ..., (K+1) Δ mel} is converted into Temporal frequency obtains fcenter={ f1,f2,f3,…,fK+1, finally utilize fcenterDesign triangle wave filter.Design Triangular filter to meet gain be 1, each triangle center frequency is fn, n ∈ 1, K}, triangle The mid frequency left side a width of f of bandn—fn—1, a width of f of the right bandn+1—fn?.First triangle center The frequency left side a width of f of band1
Wherein, obtain according to the Mel territory log power spectrum character vector of the speech frame of each training voice Each Gaussian Mixture cluster weight, mean vector and covariance matrix include:
The number of Gaussian Mixture cluster is set, initializes the mean vector of each Gaussian Mixture cluster, covariance Matrix and weight, according to initialized each Gaussian Mixture cluster mean vector, covariance matrix and weight, The Mel territory log power spectrum character vector of the speech frame of each training voice uses Gaussian Mixture cluster to calculate Method obtains each Gaussian Mixture cluster weight, mean vector and covariance matrix.
Wherein it is possible to rule of thumb the number that Gaussian Mixture clusters is set to 128.
Wherein it is possible to use LBG-vector quantization (VQ, Vector Quantization) algorithm initialization The mean vector of each Gaussian Mixture cluster, implements the common knowledge belonging to those skilled in the art, and It is not intended to limit the scope of the invention, repeats no more here.
Wherein it is possible to the covariance matrix clustered by each Gaussian Mixture of several random initializtions between 0 to 2.
Wherein, during the weight that random initializtion Gaussian Mixture clusters, should ensure that what all Gaussian Mixture clustered Weight sum is 1.
Wherein, Gaussian Mixture clustering algorithm can be EM algorithm.EM algorithm belongs to existing algorithm, The protection domain that concrete implementation is not intended to limit the present invention.EM algorithm is implemented as follows:
Clustering for each Gaussian Mixture, circulation performs formula (5) to formula (9).
γ ( i , k 1 ) = ω k 1 N 1 ( x i , μ k 1 , θ k 1 ) Σ j = 1 M ω j N 1 ( x i , μ j , θ j ) - - - ( 5 )
Wherein, i represents the sequence number of the speech frame of training voice, and N1 is Gaussian function, ωk1For kth 1 The weight of Gaussian Mixture cluster, xiMel territory log power spectrum for the speech frame of i-th training voice is special Levy vector parameters, μk1For the mean vector of 1 Gaussian Mixture cluster of kth, θk1Mix for 1 Gauss of kth Close the covariance matrix of cluster, μjFor the mean vector of jth Gaussian Mixture cluster, θjFor jth Gauss The covariance matrix of mixing cluster, j, k1 are the sequence number of Gaussian Mixture cluster, and M is Gaussian Mixture cluster Number.
S k 1 = Σ i - 1 n γ ( i , k 1 ) - - - ( 6 )
Wherein, n is the number of the speech frame training voice.
μ k 1 = 1 S k 1 Σ i = 1 n γ ( i , k 1 ) x i - - - ( 7 )
θ k 1 = 1 S k 1 Σ i = 1 n γ ( i , k 1 ) ( x i - μ k 1 ) ( x i - μ k 1 ) T - - - ( 8 )
ω k 1 = S k 1 Σ k = 1 M S k 1 - - - ( 9 )
Wherein, the number of times of circulation can preset, and the number of times of circulation is the most, and precision is the highest, such as, Could be arranged to 10 times.
Wherein,
N 1 ( X , U , θ ) = 1 ( 2 π ) d | θ | exp [ - 0.5 ( X - U ) T θ - 1 ( X - U ) ] - - - ( 10 )
Wherein, X is the Mel territory log power spectrum character vector of the speech frame of training voice, and U is The mean vector of Gaussian Mixture cluster, θ is the covariance matrix of Gaussian Mixture cluster, and d is training voice The dimension of the Mel territory log power spectrum character vector of speech frame.
128 groups of ω, μ, the θ obtained after 10 loop ends are preserved, i.e. thinks these parameters Represent Human voice's common feature.
Step 101, tested speech is divided into two or more speech frames, each is tested The speech frame of voice, uses the minima improved to control recursive average (IMCRA, The Improved Minima Controlled Recursive Averaging) algorithm calculates each dimension of speech frame of tested speech Signal to noise ratio.
In this step, a dimension of the speech frame of tested speech refers to carry from the speech frame of tested speech Obtain a subband of the Mel territory power spectrum characteristic vector parameters arrived.Such as, when the Mel of tested speech When territory power spectrum characteristic vector parameters is [2,3,4], the speech frame of tested speech includes three dimensions, respectively It is 2,3,4.
In this step, when tested speech is divided into two or more speech frames, test language can be made Between a length of 20ms to the 30ms of the speech frame of sound, between the speech frame of adjacent two tested speech There is the overlap of 25% to 50%.
In this step, use the signal to noise ratio bag of each dimension of speech frame of IMCRA algorithm calculating tested speech Include:
Obtain the Mel territory power spectrum characteristic vector parameters of the speech frame of tested speech, according to tested speech The Mel territory power spectrum characteristic vector parameters of speech frame uses IMCRA algorithm to calculate the voice of tested speech The signal to noise ratio of each dimension of frame.
Wherein, the Mel territory power spectrum characteristic vector parameters of the speech frame obtaining tested speech includes:
The speech frame of tested speech is carried out Fourier transformation, the language to the tested speech after Fourier transformation Sound frame delivery obtains the amplitude spectrum of the speech frame of tested speech, takes the amplitude spectrum of the speech frame of tested speech Square obtain the power spectrum of the speech frame of tested speech, by the power spectrum of the speech frame of tested speech through prunus mume (sieb.) sieb.et zucc. Your comb filter obtains the Mel territory power spectrum characteristic vector parameters of the speech frame of tested speech.
Wherein, IMCRA is used according to the Mel territory power spectrum characteristic vector parameters of the speech frame of tested speech The signal to noise ratio of each dimension of speech frame that algorithm calculates tested speech includes:
According to formula D2(λ, k2)=αd(λ,k2)D2(λ-1,k2)+[1-αd(λ,k2)]Y2(λ, k2) calculates tested speech The noise power of each dimension of speech frame, according to formula SNR (λ, k2)=20log10(Y(λ,k2)-D(λ,k2))-20log10The speech frame that D (λ, k2) calculates tested speech is each The signal to noise ratio of dimension.
Wherein, D2(λ, k2) is the of the Mel territory noise power of λ speech frame of tested speech The value of k2 dimension, k2 is the dimension sequence of the Mel territory power spectrum characteristic vector parameters of the speech frame of tested speech Number, λ is the speech frame sequence number of tested speech, αdFor smoothing parameter, Y is the speech frame of tested speech Kth 2 dimension values of Mel territory power spectrum characteristic vector parameters, Y2(λ, k2) is the λ of tested speech The value of kth 2 dimension of the Mel territory power spectrum characteristic vector parameters of individual speech frame.
Wherein,
αd(λ, k2)=α+(1-α) p (λ, k2) (11)
Wherein, α is constant, p (λ, k2) be the speech frame of λ tested speech kth 2 dimension in exist The probability of voice.
Wherein, q (λ, k2) be λ tested speech speech frame kth 2 dimension in there is not voice Probability,γ (λ, k2) is the prunus mume (sieb.) sieb.et zucc. of the speech frame of λ tested speech The posteriori SNR of kth 2 Wei Chu of your territory power spectrum characteristic vector parameters, ζ (λ, k2) is λ The prior weight of kth 2 Wei Chu of the Mel territory power spectrum characteristic vector parameters of the speech frame of tested speech.
γ ( λ , k ) = | Y ( λ , k 2 ) | 2 B min S min ( λ , k 2 ) - - - ( 13 )
Wherein, BminFor deviation factors, SminMinimum for S (λ, k2-1-k3) to S (λ, k2-1) Value.K3 can preset.
Wherein,
S (λ, k)=αSS(λ-1,k)+(1-αS)Sf(λ,k) (15)
Wherein, αsFor constant smoothing factor, and
S f ( λ , k ) = Σ i = - L w L w w ( i ) | Y ( λ , k - 1 ) | 2 - - - ( 16 )
Wherein, w (i) is Hanning window function, the long 2L of windoww+1
Wherein, γ1It it is a constant threshold.
γ ~ ( λ , k 2 ) = | Y ( λ , k 2 ) | 2 B min S ~ min ( λ , k 2 ) - - - ( 18 )
S ~ ( λ , k 2 ) = α S S ~ ( λ - 1 , k 2 ) + ( 1 - α S ) S ~ f ( λ , k 2 ) - - - ( 20 )
Wherein, γ0And ζ0For constant threshold.
Step 102, according to the signal to noise ratio of each dimension of speech frame of tested speech by the speech frame of tested speech It is divided into reliable parts and unreliable part, and judges according to the unreliable part of the speech frame of tested speech The speech frame going out tested speech needs to carry out lacking feature reconstruction, according to each Gaussian Mixture cluster obtained Weight, mean vector and covariance matrix, tested speech the reliable parts of speech frame to tested speech The unreliable part of speech frame carries out lacking feature reconstruction.
In this step, the unreliable part at the speech frame according to tested speech judges the language of tested speech When sound frame is made without lacking feature reconstruction, abandon the speech frame of tested speech.
In this step, according to the signal to noise ratio of each dimension of the speech frame of tested speech by the speech frame of tested speech It is divided into reliable parts and unreliable part includes:
Judge that the signal to noise ratio of speech frame dimension of tested speech, more than predetermined threshold value, determines test language The reliable parts of the speech frame that this dimension is tested speech of the speech frame of sound;Judge the language of tested speech The signal to noise ratio of sound frame dimension is less than or equal to predetermined threshold value, determines this dimension of the speech frame of tested speech Degree is the unreliable part of the speech frame of tested speech.
Wherein it is possible to use the mode of labelling that the speech frame of tested speech is divided into reliable parts and can not Relying part is divided, and specifically can use formulaWherein, L is predetermined threshold value, M (λ, k2) is mark value.
Wherein, whether the speech frame of tested speech is judged according to the unreliable part of the speech frame of tested speech Needs carry out lacking feature reconstruction and include:
When the number of dimensions of unreliable part of speech frame of tested speech and total dimension of the speech frame of tested speech Ratio between the number of degrees is more than or equal to when presetting ratio, it is judged that the speech frame of tested speech needs to carry out Disappearance feature reconstruction;The number of dimensions of unreliable part of speech frame and the language of tested speech when tested speech Ratio between total number of dimensions of sound frame is less than when presetting ratio, it is judged that the speech frame of tested speech is not required to Carry out lacking feature reconstruction.
In this step, according to obtain each Gaussian Mixture cluster weight, mean vector and covariance matrix, The reliable parts of the speech frame of tested speech carries out disappearance spy to the unreliable part of the speech frame of tested speech Levy reconstruction to include:
Tested speech is judged according to weight, mean vector and the covariance matrix that each Gaussian Mixture clusters Gaussian Mixture cluster belonging to speech frame, clusters according to the Gaussian Mixture belonging to the speech frame of tested speech Weight, mean vector and covariance matrix and formulaTo tested speech The unreliable part of speech frame carries out lacking feature reconstruction.
Wherein,Unreliable part for the speech frame of tested speech;XoSpeech frame for tested speech Reliable parts;UkmBelonging to the speech frame of tested speech Gaussian Mixture cluster mean vector in The part that the unreliable part of the speech frame of tested speech is corresponding;UkoSpeech frame institute for tested speech Portion corresponding with the reliable parts of the speech frame of tested speech in the mean vector of the Gaussian Mixture cluster belonged to Point;θkmoFor by belonging to the speech frame of tested speech Gaussian Mixture cluster covariance matrix in test The row that the unreliable part of the speech frame of voice is corresponding, and the reliable parts of the speech frame with tested speech The elementary composition matrix that corresponding row intersect;θkooFor by the height belonging to the speech frame of tested speech Row corresponding with the reliable parts of the speech frame of tested speech in the covariance matrix of this mixing cluster, and The elementary composition matrix that the row corresponding with the reliable parts of the speech frame of tested speech intersect.
It is to say, after obtaining the cluster of the Gaussian Mixture described in the speech frame of tested speech, can will survey The mean vector of the speech frame of examination voice is rearranged for Uk=[Uko,Ukm], by the speech frame of tested speech Covariance matrix is rearranged for θ k = θ koo θ kom θ kmo θ kmm .
Such as, total number of dimensions of the speech frame of tested speech is 6, and wherein 1,3,5 dimensions are reliable parts, 2,4,6 Dimension is unreliable part, then UkoThe average of the Gaussian Mixture cluster belonging to the speech frame of tested speech 1,3,5th dimension in vector;θkmoFor the covariance clustered by the Gaussian Mixture belonging to the speech frame of tested speech In matrix, the 2nd, 4,6 row and the 1st, 3,5 arranges the elementary composition matrix intersected;θkooFor by test language In the covariance matrix of the Gaussian Mixture cluster belonging to the speech frame of sound, the 1,3,5th row and the 1,3,5th row intersect The elementary composition matrix of fork.
Wherein, weight, mean vector and the covariance matrix clustered according to each Gaussian Mixture judges test Gaussian Mixture cluster belonging to the speech frame of voice includes:
According to formulaJudge that the Gauss belonging to the speech frame of tested speech mixes Close cluster.
Wherein, X is the Mel territory log power spectrum character vector of the speech frame of tested speech, λk4 Cluster for 4 Gaussian Mixture of kth,For the k4 value that maximum likelihood value is corresponding, P is X and λk4Between Likelihood value, argmax be P be k4 value during maximum.
Wherein,
P ( X | λ k 4 ) = ω k 4 Π y ∈ X 0 N 2 ( x , μ k 4 , y , σ k 4 , y ) Π y ∈ X m 0.5 ( 1 + erf ( x - μ k 4 , y 2 σ k 4 , y ) ) - - - ( 23 )
N 2 ( y , μ k 4 , y , σ k 4 , y ) = 1 2 πσ k 4 , y 2 exp ( - 0.5 ( y - μ k 4 , y ) 2 σ k 4 , y 2 ) - - - ( 24 )
Y is certain one-dimensional data of X, ωk4λ is clustered for Gaussian Mixturek4Weight, μk4, x is λk4With The average that y dimension is corresponding, σk4, y is that Gaussian Mixture clusters λk4In with the y corresponding covariance matrix pair of dimension Angle value.
By the method for the present invention, owing to IMCRA algorithm is can under nonstationary noise environment effectively The method of tracking noise, thus improve the precision of disappearance feature reconstruction.
See Fig. 2, the invention allows for a kind of device realizing lacking feature reconstruction, at least include:
Acquisition module, for obtain in advance the weight of two or more Gaussian Mixture cluster, average to Amount and covariance matrix;
Computing module, for being divided into two or more speech frames by tested speech, for each The speech frame of tested speech, uses the noise of each dimension of speech frame of IMCRA algorithm calculating tested speech Ratio;
Rebuild module, be used for the signal to noise ratio of each dimension of speech frame according to tested speech by the language of tested speech Sound frame is divided into reliable parts and unreliable part, and the unreliable part of the speech frame according to tested speech Judge that the speech frame of tested speech needs to carry out lacking feature reconstruction, gather according to each Gaussian Mixture obtained The weight of class, mean vector and covariance matrix, tested speech speech frame reliable parts to test language The unreliable part of the speech frame of sound carries out lacking feature reconstruction.
In assembly of the invention, rebuild module and be additionally operable to:
Unreliable part at the speech frame according to tested speech judges that the speech frame of tested speech need not When carrying out lacking feature reconstruction, abandon the speech frame of tested speech.
In assembly of the invention, acquisition module specifically for:
Obtain two or more training voices in advance, voice is trained for each, voice will be trained It is divided into two or more speech frames;Obtain the Mel territory logarithm of the speech frame of each training voice Power spectrum characteristic vector parameters, according to the Mel territory log power spectrum Characteristic Vectors of the speech frame of each training voice Amount parameter acquiring each Gaussian Mixture cluster weight, mean vector and covariance matrix.
In assembly of the invention, computing module specifically for:
Tested speech is divided into two or more speech frames, for the voice of each tested speech Frame, obtains the Mel territory power spectrum characteristic vector parameters of the speech frame of tested speech;According to tested speech The Mel territory power spectrum characteristic vector parameters of speech frame uses IMCRA algorithm to calculate the voice of tested speech The signal to noise ratio of each dimension of frame.
In assembly of the invention, rebuild module specifically for:
Judge that the signal to noise ratio of speech frame dimension of tested speech, more than predetermined threshold value, determines test language The reliable parts of the speech frame that this dimension is tested speech of the speech frame of sound;
Judge that the signal to noise ratio of speech frame dimension of tested speech, less than or equal to predetermined threshold value, determines The unreliable part of the speech frame that this dimension is tested speech of the speech frame of tested speech;
The unreliable part of the speech frame according to tested speech judges that the speech frame of tested speech needs to carry out Disappearance feature reconstruction, according to obtain each Gaussian Mixture cluster weight, mean vector and covariance matrix, The reliable parts of the speech frame of tested speech carries out disappearance spy to the unreliable part of the speech frame of tested speech Levy reconstruction.
In assembly of the invention, rebuild module specifically for:
The speech frame of tested speech is divided into reliably by the signal to noise ratio of each dimension of speech frame according to tested speech Part and unreliable part;
When the number of dimensions of unreliable part of speech frame of tested speech and total dimension of the speech frame of tested speech Ratio between the number of degrees is more than or equal to when presetting ratio, it is judged that the speech frame of tested speech needs to carry out Disappearance feature reconstruction;
According to weight, mean vector and the covariance matrix of each Gaussian Mixture cluster obtained, tested speech The reliable parts of speech frame carry out the unreliable part of the speech frame of tested speech lacking feature reconstruction.
In assembly of the invention, rebuild module specifically for:
The speech frame of tested speech is divided into reliably by the signal to noise ratio of each dimension of speech frame according to tested speech Part and unreliable part, judge tested speech according to the unreliable part of the speech frame of tested speech Speech frame needs to carry out lacking feature reconstruction;
Tested speech is judged according to weight, mean vector and the covariance matrix that each Gaussian Mixture clusters Gaussian Mixture cluster belonging to speech frame;
Weight, mean vector and the covariance clustered according to the Gaussian Mixture belonging to the speech frame of tested speech Matrix and formulaThe unreliable part of the speech frame of tested speech is carried out Disappearance feature reconstruction;
Wherein,Unreliable part for the speech frame of tested speech;XoSpeech frame for tested speech Reliable parts;UkmBelonging to the speech frame of tested speech Gaussian Mixture cluster mean vector in The part that the unreliable part of the speech frame of tested speech is corresponding;UkoSpeech frame institute for tested speech Portion corresponding with the reliable parts of the speech frame of tested speech in the mean vector of the Gaussian Mixture cluster belonged to Point;θkmoFor by belonging to the speech frame of tested speech Gaussian Mixture cluster covariance matrix in test The row that the unreliable part of the speech frame of voice is corresponding, and the reliable parts of the speech frame with tested speech The elementary composition matrix that corresponding row intersect;θkooFor by the height belonging to the speech frame of tested speech Row corresponding with the reliable parts of the speech frame of tested speech in the covariance matrix of this mixing cluster, and The elementary composition matrix that the row corresponding with the reliable parts of the speech frame of tested speech intersect.
Understand it should be noted that embodiment described above is for only for ease of those skilled in the art , it is not limited to protection scope of the present invention, in the premise of the inventive concept without departing from the present invention Under, any obvious replacement that the present invention is made by those skilled in the art and improvement etc. are all at this Within the protection domain of invention.

Claims (19)

1. the method realizing lacking feature reconstruction, it is characterised in that including:
Obtain weight, mean vector and the covariance matrix of two or more Gaussian Mixture cluster in advance;
Tested speech is divided into two or more speech frames, for the voice of each tested speech Frame, the speech frame using the minima control recursive average IMCRA algorithm calculating tested speech improved is each The signal to noise ratio of dimension;
The speech frame of tested speech is divided into reliably by the signal to noise ratio of each dimension of speech frame according to tested speech Part and unreliable part, and judge tested speech according to the unreliable part of the speech frame of tested speech Speech frame need carry out lack feature reconstruction, according to obtain each Gaussian Mixture cluster weight, average Vector sum covariance matrix, tested speech speech frame reliable parts to the speech frame of tested speech not Reliable parts carries out lacking feature reconstruction.
Method the most according to claim 1, it is characterised in that at the language according to described tested speech The unreliable part of sound frame judges that the speech frame of described tested speech is made without lacking feature reconstruction Time, the method also includes: abandon the speech frame of described tested speech.
Method the most according to claim 1 and 2, it is characterised in that described in advance obtain two or Weight, mean vector and the covariance matrix of two or more Gaussian Mixture cluster include:
Obtain two or more training voices in advance, voice is trained for each, by described training Voice is divided into two or more speech frames;
Obtain the Mel territory log power spectrum character vector of the speech frame of each described training voice, The Mel territory log power spectrum character vector of the speech frame according to each training voice obtains each Gaussian Mixture Cluster weight, mean vector and covariance matrix.
Method the most according to claim 3, it is characterised in that each training voice of described acquisition The Mel territory log power spectrum character vector of speech frame include:
The speech frame of described training voice is carried out Fourier transformation, to the training voice after Fourier transformation Speech frame delivery obtain the amplitude spectrum of speech frame of described training voice;
To the squared speech frame obtaining described training voice of amplitude spectrum of the speech frame of described training voice Power spectrum;
The power spectrum of the speech frame of described training voice is obtained described training language through Mel comb filter The Mel territory power spectrum characteristic vector parameters of the speech frame of sound, the Mel to the speech frame of described training voice Territory power spectrum characteristic vector parameters is taken the logarithm the Mel territory log power of the speech frame obtaining described training voice Spectrum signature vector parameters.
Method the most according to claim 3, it is characterised in that the described language according to each training voice The Mel territory log power spectrum character vector of sound frame obtains each Gaussian Mixture cluster weight, mean vector Include with covariance matrix:
The number of described Gaussian Mixture cluster is set, initializes the mean vector of each Gaussian Mixture cluster, association Variance matrix and weight;
According to initialized each Gaussian Mixture cluster mean vector, covariance matrix and weight, respectively train The Mel territory log power spectrum character vector of the speech frame of voice uses Gaussian Mixture clustering algorithm to obtain Each Gaussian Mixture cluster weight, mean vector and covariance matrix.
Method the most according to claim 1 and 2, it is characterised in that described employing IMCRA calculates The signal to noise ratio of each dimension of speech frame that method calculates tested speech includes:
Obtain the Mel territory power spectrum characteristic vector parameters of the speech frame of described tested speech;
The Mel territory power spectrum characteristic vector parameters of the speech frame according to described tested speech uses described IMCRA algorithm calculates the signal to noise ratio of each dimension of speech frame of described tested speech.
Method the most according to claim 6, it is characterised in that the voice of described acquisition tested speech The Mel territory power spectrum characteristic vector parameters of frame includes:
The speech frame of described tested speech is carried out Fourier transformation, to the tested speech after Fourier transformation Speech frame delivery obtain the amplitude spectrum of speech frame of described tested speech;
To the squared speech frame obtaining described tested speech of amplitude spectrum of the speech frame of described tested speech Power spectrum, obtains described survey by the power spectrum of the speech frame of described tested speech through Mel comb filter The Mel territory power spectrum characteristic vector parameters of the speech frame of examination voice.
Method the most according to claim 6, it is characterised in that the described voice according to tested speech The speech frame that the Mel territory power spectrum characteristic vector parameters employing IMCRA algorithm of frame calculates tested speech is each The signal to noise ratio of dimension includes:
According to formula D2(λ, k2)=αd(λ,k2)D2(λ-1,k2)+[1-αd(λ,k2)]Y2(λ, k2) calculates described test The noise power of each dimension of speech frame of voice, according to formula SNR (λ, k2)=20log10(Y(λ,k2)-D(λ,k2))-20log10D (λ, k2) calculates the voice of described tested speech The signal to noise ratio of each dimension of frame;
Wherein, D2(λ, k2) is the of the Mel territory noise power of λ speech frame of tested speech The value of k2 dimension, k2 is the dimension of the Mel territory power spectrum characteristic vector parameters of the speech frame of described tested speech Sequence number, λ is the speech frame sequence number of described tested speech, αdFor smoothing parameter, Y is described tested speech Kth 2 dimension values of Mel territory power spectrum characteristic vector parameters of speech frame, Y2(λ, k2) is test The value of kth 2 dimension of the Mel territory power spectrum characteristic vector parameters of λ speech frame of voice.
Method the most according to claim 1 and 2, it is characterised in that described according to tested speech The speech frame of tested speech is divided into reliable parts and unreliable part bag by the signal to noise ratio of each dimension of speech frame Include:
Judge that the signal to noise ratio of speech frame dimension of described tested speech, more than predetermined threshold value, determines institute State the reliable parts of the speech frame that this dimension is described tested speech of the speech frame of tested speech;
Judge that the signal to noise ratio of the speech frame dimension of described tested speech is less than or equal to predetermined threshold value, Determine the speech frame of described tested speech the speech frame that this dimension is described tested speech can not relying part Point.
Method the most according to claim 1 and 2, it is characterised in that described to tested speech The unreliable part of speech frame also includes before carrying out lacking feature reconstruction: according to the speech frame of tested speech Unreliable part judge the speech frame of tested speech the need of carrying out lacking feature reconstruction, including:
The number of dimensions of unreliable part of speech frame and the voice of described tested speech when described tested speech Ratio between total number of dimensions of frame is more than or equal to when presetting ratio, it is judged that the language of described tested speech Sound frame needs to carry out lacking feature reconstruction;
The number of dimensions of unreliable part of speech frame and the voice of described tested speech when described tested speech Ratio between total number of dimensions of frame is less than when presetting ratio, it is judged that the speech frame of described tested speech is not Needs carry out lacking feature reconstruction.
11. methods according to claim 1 and 2, it is characterised in that described each according to obtain Gaussian Mixture cluster weight, mean vector and covariance matrix, tested speech speech frame can relying part Divide the unreliable part to the speech frame of tested speech to carry out lacking feature reconstruction to include:
Weight, mean vector and covariance matrix according to described each Gaussian Mixture cluster judge described survey Gaussian Mixture belonging to the speech frame cluster of examination voice;
Weight, mean vector and the association of the Gaussian Mixture cluster belonging to the speech frame according to described tested speech Variance matrix and formulaUnreliable part to the speech frame of tested speech Carry out lacking feature reconstruction;
Wherein,Unreliable part for the speech frame of described tested speech;XoFor described tested speech The reliable parts of speech frame;UkmGaussian Mixture cluster belonging to the speech frame of described tested speech Part corresponding with the unreliable part of the speech frame of described tested speech in mean vector;UkoFor institute State the language with described tested speech in the mean vector of the Gaussian Mixture cluster belonging to the speech frame of tested speech The part that the reliable parts of sound frame is corresponding;θkmoFor the Gauss belonging to the speech frame by described tested speech Mix in the covariance matrix of cluster corresponding with the unreliable part of the speech frame of described tested speech OK, and the row corresponding with the reliable parts of the speech frame of described tested speech intersect elementary composition Matrix;θkooIn the covariance matrix of the Gaussian Mixture cluster belonging to the speech frame by described tested speech The row corresponding with the reliable parts of the speech frame of described tested speech, and with the voice of described tested speech The elementary composition matrix that the corresponding row of the reliable parts of frame intersect.
12. methods according to claim 11, it is characterised in that described poly-according to each Gaussian Mixture The weight of class, mean vector and covariance matrix judge the Gaussian Mixture belonging to the speech frame of tested speech Cluster includes:
According to formulaJudge the height belonging to the speech frame of described tested speech This mixes cluster;
Wherein, X is the Mel territory log power spectrum character vector of the speech frame of tested speech, λk4 Cluster for 4 Gaussian Mixture of kth,For the k4 value that maximum likelihood value is corresponding, P is X and λk4Between Likelihood value, argmax be P be k4 value during maximum;
Wherein, P ( X | λ k 4 ) = ω k 4 Π y ∈ X 0 N 2 ( x , μ k 4 , y , σ k 4 , y ) Π y ∈ X m 0.5 ( 1 + erf ( x - μ k 4 , y 2 σ k 4 , y ) ) ;
N 2 ( y , μ k 4 , y , σ k 4 , y ) = 1 2 π σ k 4 , y 2 exp ( - 0.5 ( x - μ k 4 , y ) 2 σ k 4 , y 2 ) ;
Wherein, y is certain one-dimensional data of X, ωk4λ is clustered for Gaussian Mixturek4Weight, μk4,xFor λk4 The average corresponding with y dimension, σK4, yFor λk4In with the y corresponding covariance matrix diagonal values of dimension;
By compare X the most each Gaussian Mixture cluster likelihood value P (X | λk4) size, select likelihood value Maximum Gaussian Mixture cluster λk4
13. 1 kinds of devices realizing lacking feature reconstruction, it is characterised in that at least include:
Acquisition module, for obtain in advance the weight of two or more Gaussian Mixture cluster, average to Amount and covariance matrix;
Computing module, for being divided into two or more speech frames by tested speech, for each The speech frame of tested speech, uses the minima improved to control recursive average IMCRA algorithm and calculates test The signal to noise ratio of each dimension of speech frame of voice;
Rebuild module, be used for the signal to noise ratio of each dimension of speech frame according to tested speech by the language of tested speech Sound frame is divided into reliable parts and unreliable part, and the unreliable part of the speech frame according to tested speech Judge that the speech frame of tested speech needs to carry out lacking feature reconstruction, gather according to each Gaussian Mixture obtained The weight of class, mean vector and covariance matrix, tested speech speech frame reliable parts to test language The unreliable part of the speech frame of sound carries out lacking feature reconstruction.
14. devices according to claim 13, it is characterised in that described reconstruction module is additionally operable to:
Unreliable part at the speech frame according to described tested speech judges the voice of described tested speech When frame is made without lacking feature reconstruction, abandon the speech frame of described tested speech.
15. according to the device described in claim 13 or 14, it is characterised in that described acquisition module has Body is used for:
Obtain two or more training voices in advance, voice is trained for each, by described training Voice is divided into two or more speech frames;Obtain the prunus mume (sieb.) sieb.et zucc. of the speech frame of each described training voice That territory log power spectrum character vector, according to the Mel territory log power of the speech frame of each training voice Spectrum signature vector parameters obtains each Gaussian Mixture cluster weight, mean vector and covariance matrix.
16. according to the device described in claim 13 or 14, it is characterised in that described computing module has Body is used for:
Tested speech is divided into two or more speech frames, for the voice of each tested speech Frame, obtains the Mel territory power spectrum characteristic vector parameters of the speech frame of described tested speech;According to described survey The Mel territory power spectrum characteristic vector parameters of the speech frame of examination voice uses described IMCRA algorithm to calculate institute State the signal to noise ratio of each dimension of speech frame of tested speech.
17. according to the device described in claim 13 or 14, it is characterised in that described reconstruction module has Body is used for:
Judge that the signal to noise ratio of speech frame dimension of described tested speech, more than predetermined threshold value, determines institute State the reliable parts of the speech frame that this dimension is described tested speech of the speech frame of tested speech;
Judge that the signal to noise ratio of the speech frame dimension of described tested speech is less than or equal to predetermined threshold value, Determine the speech frame of described tested speech the speech frame that this dimension is described tested speech can not relying part Point;
The unreliable part of the speech frame according to tested speech judges that the speech frame of tested speech needs to carry out Disappearance feature reconstruction, according to obtain each Gaussian Mixture cluster weight, mean vector and covariance matrix, The reliable parts of the speech frame of tested speech carries out disappearance spy to the unreliable part of the speech frame of tested speech Levy reconstruction.
18. according to the device described in claim 13 or 14, it is characterised in that described reconstruction module has Body is used for:
The speech frame of tested speech is divided into reliably by the signal to noise ratio of each dimension of speech frame according to tested speech Part and unreliable part;
The number of dimensions of unreliable part of speech frame and the voice of described tested speech when described tested speech Ratio between total number of dimensions of frame is more than or equal to when presetting ratio, it is judged that the language of described tested speech Sound frame needs to carry out lacking feature reconstruction;
According to weight, mean vector and the covariance matrix of each Gaussian Mixture cluster obtained, tested speech The reliable parts of speech frame carry out the unreliable part of the speech frame of tested speech lacking feature reconstruction.
19. according to the device described in claim 13 or 14, it is characterised in that described reconstruction module has Body is used for:
The speech frame of tested speech is divided into reliably by the signal to noise ratio of each dimension of speech frame according to tested speech Part and unreliable part, judge tested speech according to the unreliable part of the speech frame of tested speech Speech frame needs to carry out lacking feature reconstruction;
Weight, mean vector and covariance matrix according to described each Gaussian Mixture cluster judge described survey Gaussian Mixture belonging to the speech frame cluster of examination voice;
Weight, mean vector and the association of the Gaussian Mixture cluster belonging to the speech frame according to described tested speech Variance matrix and formulaUnreliable part to the speech frame of tested speech Carry out lacking feature reconstruction;
Wherein,Unreliable part for the speech frame of described tested speech;XoFor described tested speech The reliable parts of speech frame;UkmGaussian Mixture cluster belonging to the speech frame of described tested speech Part corresponding with the unreliable part of the speech frame of described tested speech in mean vector;UkoFor institute State the language with described tested speech in the mean vector of the Gaussian Mixture cluster belonging to the speech frame of tested speech The part that the reliable parts of sound frame is corresponding;θkmoFor the Gauss belonging to the speech frame by described tested speech Mix in the covariance matrix of cluster corresponding with the unreliable part of the speech frame of described tested speech OK, and the row corresponding with the reliable parts of the speech frame of described tested speech intersect elementary composition Matrix;θkooIn the covariance matrix of the Gaussian Mixture cluster belonging to the speech frame by described tested speech The row corresponding with the reliable parts of the speech frame of described tested speech, and with the voice of described tested speech The elementary composition matrix that the corresponding row of the reliable parts of frame intersect.
CN201510044910.4A 2015-01-28 2015-01-28 Method and device of realizing missing feature reconstruction Withdrawn CN105989843A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201510044910.4A CN105989843A (en) 2015-01-28 2015-01-28 Method and device of realizing missing feature reconstruction
PCT/CN2015/093901 WO2016119501A1 (en) 2015-01-28 2015-11-05 Method and apparatus for implementing missing feature reconstruction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510044910.4A CN105989843A (en) 2015-01-28 2015-01-28 Method and device of realizing missing feature reconstruction

Publications (1)

Publication Number Publication Date
CN105989843A true CN105989843A (en) 2016-10-05

Family

ID=56542342

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510044910.4A Withdrawn CN105989843A (en) 2015-01-28 2015-01-28 Method and device of realizing missing feature reconstruction

Country Status (2)

Country Link
CN (1) CN105989843A (en)
WO (1) WO2016119501A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106653056A (en) * 2016-11-16 2017-05-10 中国科学院自动化研究所 Fundamental frequency extraction model based on LSTM recurrent neural network and training method thereof
CN108899032A (en) * 2018-06-06 2018-11-27 平安科技(深圳)有限公司 Method for recognizing sound-groove, device, computer equipment and storage medium
WO2020034593A1 (en) * 2018-08-13 2020-02-20 平安科技(深圳)有限公司 Method and apparatus for processing missing feature in crowd performance feature prediction

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1475987A (en) * 2003-07-14 2004-02-18 中国科学院声学研究所 Hiaden Markov model edge decipher data reconstitution method f speech sound identification
CN1571012A (en) * 2003-07-11 2005-01-26 中国科学院声学研究所 Method for rebuilding probability weighted average deletion characteristic data of speech recognition
CN101236742A (en) * 2008-03-03 2008-08-06 中兴通讯股份有限公司 Music/ non-music real-time detection method and device
WO2009123387A1 (en) * 2008-03-31 2009-10-08 Transono Inc. Procedure for processing noisy speech signals, and apparatus and computer program therefor
CN101853661A (en) * 2010-05-14 2010-10-06 中国科学院声学研究所 Noise spectrum estimation and voice mobility detection method based on unsupervised learning
CN102820033A (en) * 2012-08-17 2012-12-12 南京大学 Voiceprint identification method
CN103456310A (en) * 2013-08-28 2013-12-18 大连理工大学 Transient noise suppression method based on spectrum estimation
CN103778920A (en) * 2014-02-12 2014-05-07 北京工业大学 Speech enhancing and frequency response compensation fusion method in digital hearing-aid
CN104143327A (en) * 2013-07-10 2014-11-12 腾讯科技(深圳)有限公司 Acoustic model training method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20050071656A (en) * 2002-11-05 2005-07-07 코닌클리케 필립스 일렉트로닉스 엔.브이. Spectrogram reconstruction by means of a codebook
EP1918910B1 (en) * 2006-10-31 2009-03-11 Harman Becker Automotive Systems GmbH Model-based enhancement of speech signals
CN103650040B (en) * 2011-05-16 2017-08-25 谷歌公司 Use the noise suppressing method and device of multiple features modeling analysis speech/noise possibility
US9786275B2 (en) * 2012-03-16 2017-10-10 Yale University System and method for anomaly detection and extraction

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1571012A (en) * 2003-07-11 2005-01-26 中国科学院声学研究所 Method for rebuilding probability weighted average deletion characteristic data of speech recognition
CN1475987A (en) * 2003-07-14 2004-02-18 中国科学院声学研究所 Hiaden Markov model edge decipher data reconstitution method f speech sound identification
CN101236742A (en) * 2008-03-03 2008-08-06 中兴通讯股份有限公司 Music/ non-music real-time detection method and device
WO2009123387A1 (en) * 2008-03-31 2009-10-08 Transono Inc. Procedure for processing noisy speech signals, and apparatus and computer program therefor
CN101853661A (en) * 2010-05-14 2010-10-06 中国科学院声学研究所 Noise spectrum estimation and voice mobility detection method based on unsupervised learning
CN102820033A (en) * 2012-08-17 2012-12-12 南京大学 Voiceprint identification method
CN104143327A (en) * 2013-07-10 2014-11-12 腾讯科技(深圳)有限公司 Acoustic model training method and device
CN103456310A (en) * 2013-08-28 2013-12-18 大连理工大学 Transient noise suppression method based on spectrum estimation
CN103778920A (en) * 2014-02-12 2014-05-07 北京工业大学 Speech enhancing and frequency response compensation fusion method in digital hearing-aid

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
尹海明,王金明,李欢欢: "基于信噪比估计的说话人识别前端处理", 《军事通信技术》 *
王宁: "基于缺失特征重建的说话人识别", 《万方学术期刊数据库》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106653056A (en) * 2016-11-16 2017-05-10 中国科学院自动化研究所 Fundamental frequency extraction model based on LSTM recurrent neural network and training method thereof
CN106653056B (en) * 2016-11-16 2020-04-24 中国科学院自动化研究所 Fundamental frequency extraction model and training method based on LSTM recurrent neural network
CN108899032A (en) * 2018-06-06 2018-11-27 平安科技(深圳)有限公司 Method for recognizing sound-groove, device, computer equipment and storage medium
WO2020034593A1 (en) * 2018-08-13 2020-02-20 平安科技(深圳)有限公司 Method and apparatus for processing missing feature in crowd performance feature prediction

Also Published As

Publication number Publication date
WO2016119501A1 (en) 2016-08-04

Similar Documents

Publication Publication Date Title
Calhoun et al. A method for making group inferences from functional MRI data using independent component analysis
DE102017102134B4 (en) Globally optimized post-filtering using the least squares method for speech enhancement
Mitzner Numerical solution for transient scattering from a hard surface of arbitrary shape—retarded potential technique
CN104680495B (en) The self-adaptive solution method of ultrasonoscopy
CN105989843A (en) Method and device of realizing missing feature reconstruction
US20120243763A1 (en) Signal-to-noise enhancement in imaging applications using a time-series of images
CN105044701B (en) Ground target sorting technique based on robustness time-frequency characteristics
CN107358945A (en) A kind of more people's conversation audio recognition methods and system based on machine learning
CN108107475A (en) A kind of borehole microseismic denoising method based on experience wavelet transformation and multi-threshold function
Hsu et al. Online recursive independent component analysis for real-time source separation of high-density EEG
CN104360316B (en) A kind of array antenna Adaptive beamformer method being tapered based on covariance matrix
Calhoun et al. Group ICA of functional MRI data: separability, stationarity, and inference
CN101571949A (en) PCNN-based method for de-noising wavelet domain ultrasonic medical image
CN109101890A (en) Electrical energy power quality disturbance recognition methods and device based on wavelet transformation
CN106228045A (en) A kind of identification system
CN106019256A (en) Radar signal adaptive detection method based on autoregressive model
CN107067407A (en) Profile testing method based on non-classical receptive field and linear non-linear modulation
CN104515984A (en) Broadband radar target reecho denoising method based on Bayes compressed sensing
Vrins et al. Improving independent component analysis performances by variable selection
CN105093189B (en) Airborne radar object detection method based on GCV
CN104360338B (en) A kind of array antenna Adaptive beamformer method loaded based on diagonal angle
CN108613737A (en) The discrimination method of aircraft multifrequency vibration signal based on wavelet packet and STFT
Bonettini et al. Primal-dual first order methods for total variation image restoration in presence of Poisson noise
CN104156925B (en) Speckle and the enhanced processing method in border and system are removed to ultrasonoscopy
CN104007429B (en) Steady-noise complete-polarization broadband target recognition method based on polarization decomposition

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20161005

WW01 Invention patent application withdrawn after publication