CN105989843A

CN105989843A - Method and device of realizing missing feature reconstruction

Info

Publication number: CN105989843A
Application number: CN201510044910.4A
Authority: CN
Inventors: 王金明; 尹海明; 朱长宝; 袁浩; 徐志军; 唐景山
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2015-01-28
Filing date: 2015-01-28
Publication date: 2016-10-05
Also published as: WO2016119501A1

Abstract

The present invention discloses a method and device of realizing missing feature reconstruction The method comprises the steps of pre-obtaining the weights, the mean vectors and the covariance matrixes of two or more than two Gaussian hybrid clusters; dividing the test voice into two or more than two voice frames, for each voice frame of the test voice, adopting an IMCRA algorithm to calculate the signal to noise ratio of the dimensions of the voice frames of the test voice; dividing the voice frames of the test voice into a reliable part and an unreliable part according to the signal to noise ratio of the dimensions of the voice frames of the test voice, determining that the voice frames of the test voice need the missing feature reconstruction according to the unreliable part of the voice frames of the test voice, and carrying out the missing feature reconstruction on the unreliable part of the voice frames of the test voice according to the weights, the mean vectors and the covariance matrixes of the Gaussian hybrid clusters and the reliable part of the voice frames of the test voice. According to the scheme of the present invention, the precision of the missing feature reconstruction is improved.

Description

A kind of method and apparatus realizing lacking feature reconstruction

Technical field

The present invention relates to Application on Voiceprint Recognition (VPR, Voiceprint Recognition) technology, espespecially a kind of real The method and apparatus now lacking feature reconstruction.

Background technology

The recognition performance of Voiceprint Recognition System can drastically reduce along with the enhancing of environment noise, for improving system System discrimination in a noisy environment, as the effective front-end processing side of one of field of speech recognition Method, the method for disappearance feature reconstruction has been applied to Application on Voiceprint Recognition field, and has achieved under experimental conditions Good effect.

The existing method realizing disappearance feature reconstruction generally comprises:

Tested speech is divided into two or more speech frames, for the voice of each tested speech Frame, uses the training speech frame that the speech frame composing the acquisition tested speech that cuts algorithm is corresponding, and according to acquisition Training speech frame calculates the signal to noise ratio of each dimension of speech frame of tested speech；Speech frame according to tested speech The speech frame of tested speech is divided into reliable parts and unreliable part by the signal to noise ratio of each dimension；Obtain each Weight, mean vector and the covariance matrix that training speech frame is corresponding；Corresponding according to each training speech frame Weight, mean vector and covariance matrix obtain weight corresponding to the speech frame of tested speech, mean vector And covariance matrix；Weights, mean vector and covariance matrix that speech frame according to tested speech is corresponding, Reliable parts in the speech frame of tested speech use maximum a posteriori in the speech frame of tested speech not Reliable parts carries out lacking feature reconstruction.

In the existing method realizing disappearance feature reconstruction, owing to spectrum cuts algorithm, to be defaulted as noise be stable, And actual noise be non-smoothly, cause carrying out lack feature reconstruction process exist bigger error.

Summary of the invention

In order to solve the problems referred to above, the present invention proposes a kind of method and apparatus realizing and lacking feature reconstruction, Error can be reduced, thus improve disappearance feature reconstruction precision.

In order to achieve the above object, the present invention proposes a kind of method realizing and lacking feature reconstruction, including:

Obtain weight, mean vector and the covariance matrix of two or more Gaussian Mixture cluster in advance；

Tested speech is divided into two or more speech frames, for the voice of each tested speech Frame, the speech frame using the minima control recursive average IMCRA algorithm calculating tested speech improved is each The signal to noise ratio of dimension；

The speech frame of tested speech is divided into reliably by the signal to noise ratio of each dimension of speech frame according to tested speech Part and unreliable part, and judge tested speech according to the unreliable part of the speech frame of tested speech Speech frame need carry out lack feature reconstruction, according to obtain each Gaussian Mixture cluster weight, average Vector sum covariance matrix, tested speech speech frame reliable parts to the speech frame of tested speech not Reliable parts carries out lacking feature reconstruction.

Preferably, the unreliable part at the speech frame according to described tested speech judges described test language When the speech frame of sound is made without lacking feature reconstruction, the method also includes: abandon described tested speech Speech frame.

Preferably, described obtain in advance two or more Gaussian Mixture cluster weight, mean vector Include with covariance matrix:

Obtain two or more training voices in advance, voice is trained for each, by described training Voice is divided into two or more speech frames；

Obtain the Mel territory log power spectrum character vector of the speech frame of each described training voice, The Mel territory log power spectrum character vector of the speech frame according to each training voice obtains each Gaussian Mixture Cluster weight, mean vector and covariance matrix.

Preferably, the Mel territory log power spectrum Characteristic Vectors of the speech frame of each training voice of described acquisition Amount parameter includes:

The speech frame of described training voice is carried out Fourier transformation, to the training voice after Fourier transformation Speech frame delivery obtain the amplitude spectrum of speech frame of described training voice；

To the squared speech frame obtaining described training voice of amplitude spectrum of the speech frame of described training voice Power spectrum；

The power spectrum of the speech frame of described training voice is obtained described training language through Mel comb filter The Mel territory power spectrum characteristic vector parameters of the speech frame of sound, the Mel to the speech frame of described training voice Territory power spectrum characteristic vector parameters is taken the logarithm the Mel territory log power of the speech frame obtaining described training voice Spectrum signature vector parameters.

Preferably, the Mel territory log power spectrum characteristic vector ginseng of the described speech frame according to each training voice Number obtains each Gaussian Mixture cluster weight, mean vector and covariance matrix and includes:

The number of described Gaussian Mixture cluster is set, initializes the mean vector of each Gaussian Mixture cluster, association Variance matrix and weight；

According to initialized each Gaussian Mixture cluster mean vector, covariance matrix and weight, respectively train The Mel territory log power spectrum character vector of the speech frame of voice uses Gaussian Mixture clustering algorithm to obtain Each Gaussian Mixture cluster weight, mean vector and covariance matrix.

Preferably, the signal to noise ratio of each dimension of speech frame of described employing IMCRA algorithm calculating tested speech Including:

Obtain the Mel territory power spectrum characteristic vector parameters of the speech frame of described tested speech；

The Mel territory power spectrum characteristic vector parameters of the speech frame according to described tested speech uses described IMCRA algorithm calculates the signal to noise ratio of each dimension of speech frame of described tested speech.

Preferably, the Mel territory power spectrum characteristic vector parameters of the speech frame of described acquisition tested speech includes:

The speech frame of described tested speech is carried out Fourier transformation, to the tested speech after Fourier transformation Speech frame delivery obtain the amplitude spectrum of speech frame of described tested speech；

To the squared speech frame obtaining described tested speech of amplitude spectrum of the speech frame of described tested speech Power spectrum, obtains described survey by the power spectrum of the speech frame of described tested speech through Mel comb filter The Mel territory power spectrum characteristic vector parameters of the speech frame of examination voice.

Preferably, the Mel territory power spectrum characteristic vector parameters of the described speech frame according to tested speech uses The signal to noise ratio of each dimension of speech frame that IMCRA algorithm calculates tested speech includes:

According to formula D²(λ, k2)=α_d(λ,k2)D²(λ-1,k2)+[1-α_d(λ,k2)]Y²(λ, k2) calculates described test The noise power of each dimension of speech frame of voice, according to formula SNR (λ, k2)=20log₁₀(Y(λ,k2)-D(λ,k2))-20log₁₀D (λ, k2) calculates the voice of described tested speech The signal to noise ratio of each dimension of frame；

Wherein, D²(λ, k2) is the of the Mel territory noise power of λ speech frame of tested speech The value of k2 dimension, k2 is the dimension of the Mel territory power spectrum characteristic vector parameters of the speech frame of described tested speech Sequence number, λ is the speech frame sequence number of described tested speech, α_dFor smoothing parameter, Y is described tested speech Kth 2 dimension values of Mel territory power spectrum characteristic vector parameters of speech frame, Y²(λ, k2) is test The value of kth 2 dimension of the Mel territory power spectrum characteristic vector parameters of λ speech frame of voice.

Preferably, the signal to noise ratio of the described each dimension of the speech frame according to tested speech is by the voice of tested speech Frame is divided into reliable parts and unreliable part includes:

Judge that the signal to noise ratio of speech frame dimension of described tested speech, more than predetermined threshold value, determines institute State the reliable parts of the speech frame that this dimension is described tested speech of the speech frame of tested speech；

Judge that the signal to noise ratio of the speech frame dimension of described tested speech is less than or equal to predetermined threshold value, Determine the speech frame of described tested speech the speech frame that this dimension is described tested speech can not relying part Point.

Preferably, before the unreliable part of the described speech frame to tested speech carries out lacking feature reconstruction Also include: judge the speech frame whether need of tested speech according to the unreliable part of the speech frame of tested speech Carry out lacking feature reconstruction, including:

The number of dimensions of unreliable part of speech frame and the voice of described tested speech when described tested speech Ratio between total number of dimensions of frame is more than or equal to when presetting ratio, it is judged that the language of described tested speech Sound frame needs to carry out lacking feature reconstruction；

The number of dimensions of unreliable part of speech frame and the voice of described tested speech when described tested speech Ratio between total number of dimensions of frame is less than when presetting ratio, it is judged that the speech frame of described tested speech is not Needs carry out lacking feature reconstruction.

Preferably, described weight, mean vector and covariance square according to each Gaussian Mixture cluster obtained The unreliable part of the speech frame of tested speech is lacked by battle array, the reliable parts of speech frame of tested speech Mistake feature reconstruction includes:

Weight, mean vector and covariance matrix according to described each Gaussian Mixture cluster judge described survey Gaussian Mixture belonging to the speech frame cluster of examination voice；

Weight, mean vector and the association of the Gaussian Mixture cluster belonging to the speech frame according to described tested speech Variance matrix and formulaUnreliable part to the speech frame of tested speech Carry out lacking feature reconstruction；

Wherein,Unreliable part for the speech frame of described tested speech；X_oFor described tested speech The reliable parts of speech frame；U_kmGaussian Mixture cluster belonging to the speech frame of described tested speech Part corresponding with the unreliable part of the speech frame of described tested speech in mean vector；U_koFor institute State the language with described tested speech in the mean vector of the Gaussian Mixture cluster belonging to the speech frame of tested speech The part that the reliable parts of sound frame is corresponding；θ_kmoFor the Gauss belonging to the speech frame by described tested speech Mix in the covariance matrix of cluster corresponding with the unreliable part of the speech frame of described tested speech OK, and the row corresponding with the reliable parts of the speech frame of described tested speech intersect elementary composition Matrix；θ_kooIn the covariance matrix of the Gaussian Mixture cluster belonging to the speech frame by described tested speech The row corresponding with the reliable parts of the speech frame of described tested speech, and with the voice of described tested speech The elementary composition matrix that the corresponding row of the reliable parts of frame intersect.

Preferably, described weight, mean vector and the covariance matrix clustered according to each Gaussian Mixture judges Go out the cluster of the Gaussian Mixture belonging to the speech frame of tested speech to include:

According to formulaJudge the height belonging to the speech frame of described tested speech This mixes cluster；

Wherein, X is the Mel territory log power spectrum character vector of the speech frame of tested speech, λ_k4 Cluster for 4 Gaussian Mixture of kth,For the k4 value that maximum likelihood value is corresponding, P is X and λ_k4Between Likelihood value, argmax be P be k4 value during maximum；

Wherein,

P (X | λ_{k 4}) = ω_{k 4} \underset{y &Element; X_{0}}{Π} N 2 (x, μ_{k 4, y}, σ_{k 4, y}) \underset{y &Element; X_{m}}{Π} 0.5 (1 + \erf (\frac{x - μ_{k 4, y}}{\sqrt{2} σ_{k 4, y}}));

N 2 (y, μ_{k 4, y}, σ_{k 4, y}) = \frac{1}{\sqrt{2 {πσ}_{k 4, y}^{2}}} \exp (- 0.5 \frac{{(y - μ_{k 4, y})}^{2}}{σ_{k 4, y}^{2}});

Wherein, y is certain one-dimensional data of X, ω_k4λ is clustered for Gaussian Mixture_k4Weight, μ_k4, x is λ_k4 The average corresponding with y dimension, σ_k4, y is λ_k4In with the y corresponding covariance matrix diagonal values of dimension. By compare X the most each Gaussian Mixture cluster likelihood value P (X | λ_k4) size, select likelihood value maximum Gaussian Mixture cluster λ_k4。

The invention allows for a kind of device realizing lacking feature reconstruction, at least include:

Acquisition module, for obtain in advance the weight of two or more Gaussian Mixture cluster, average to Amount and covariance matrix；

Computing module, for being divided into two or more speech frames by tested speech, for each The speech frame of tested speech, uses the minima improved to control recursive average IMCRA algorithm and calculates test The signal to noise ratio of each dimension of speech frame of voice；

Rebuild module, be used for the signal to noise ratio of each dimension of speech frame according to tested speech by the language of tested speech Sound frame is divided into reliable parts and unreliable part, and the unreliable part of the speech frame according to tested speech Judge that the speech frame of tested speech needs to carry out lacking feature reconstruction, gather according to each Gaussian Mixture obtained The weight of class, mean vector and covariance matrix, tested speech speech frame reliable parts to test language The unreliable part of the speech frame of sound carries out lacking feature reconstruction.

Preferably, described reconstruction module is additionally operable to:

Unreliable part at the speech frame according to described tested speech judges the voice of described tested speech When frame is made without lacking feature reconstruction, abandon the speech frame of described tested speech.

Preferably, described acquisition module specifically for:

Obtain two or more training voices in advance, voice is trained for each, by described training Voice is divided into two or more speech frames；Obtain the prunus mume (sieb.) sieb.et zucc. of the speech frame of each described training voice That territory log power spectrum character vector, according to the Mel territory log power of the speech frame of each training voice Spectrum signature vector parameters obtains each Gaussian Mixture cluster weight, mean vector and covariance matrix.

Preferably, described computing module specifically for:

Tested speech is divided into two or more speech frames, for the voice of each tested speech Frame, obtains the Mel territory power spectrum characteristic vector parameters of the speech frame of described tested speech；According to described survey The Mel territory power spectrum characteristic vector parameters of the speech frame of examination voice uses described IMCRA algorithm to calculate institute State the signal to noise ratio of each dimension of speech frame of tested speech.

Preferably, described reconstruction module specifically for:

Judge that the signal to noise ratio of the speech frame dimension of described tested speech is less than or equal to predetermined threshold value, Determine the speech frame of described tested speech the speech frame that this dimension is described tested speech can not relying part Point；

The unreliable part of the speech frame according to tested speech judges that the speech frame of tested speech needs to carry out Disappearance feature reconstruction, according to obtain each Gaussian Mixture cluster weight, mean vector and covariance matrix, The reliable parts of the speech frame of tested speech carries out disappearance spy to the unreliable part of the speech frame of tested speech Levy reconstruction.

Preferably, described reconstruction module specifically for:

The speech frame of tested speech is divided into reliably by the signal to noise ratio of each dimension of speech frame according to tested speech Part and unreliable part；

According to weight, mean vector and the covariance matrix of each Gaussian Mixture cluster obtained, tested speech The reliable parts of speech frame carry out the unreliable part of the speech frame of tested speech lacking feature reconstruction.

Preferably, described reconstruction module specifically for:

The speech frame of tested speech is divided into reliably by the signal to noise ratio of each dimension of speech frame according to tested speech Part and unreliable part, judge tested speech according to the unreliable part of the speech frame of tested speech Speech frame needs to carry out lacking feature reconstruction；

Compared with prior art, the present invention includes: obtain two or more Gaussian Mixture cluster in advance Weight, mean vector and covariance matrix；Tested speech is divided into two or more speech frames, For the speech frame of each tested speech, the speech frame using IMCRA algorithm calculating tested speech is each The signal to noise ratio of dimension；The signal to noise ratio of each dimension of speech frame according to tested speech is by the speech frame of tested speech It is divided into reliable parts and unreliable part, and judges according to the unreliable part of the speech frame of tested speech The speech frame going out tested speech needs to carry out lacking feature reconstruction, according to each Gaussian Mixture cluster obtained Weight, mean vector and covariance matrix, tested speech the reliable parts of speech frame to tested speech The unreliable part of speech frame carries out lacking feature reconstruction.By the solution of the present invention, due to IMCRA Algorithm be can under steady and nonstationary noise environment the method for effective tracking noise, therefore improve flat Surely and lack the precision of feature reconstruction under nonstationary noise environment, thus improve Voiceprint Recognition System identification Rate.

Accompanying drawing explanation

Illustrating the accompanying drawing in the embodiment of the present invention below, the accompanying drawing in embodiment is for this Bright is further appreciated by, and is used for explaining the present invention, is not intended that and the present invention is protected model together with description The restriction enclosed.

Fig. 1 is the flow chart of the method realizing disappearance feature reconstruction of the present invention；

Fig. 2 is the structure composition schematic diagram of the device realizing disappearance feature reconstruction of the present invention.

Detailed description of the invention

For the ease of the understanding of those skilled in the art, below in conjunction with the accompanying drawings the present invention is further retouched State, can not be used for limiting the scope of the invention.It should be noted that in the case of not conflicting, Embodiment in the application and the various modes in embodiment can be mutually combined.

Seeing Fig. 1, the present invention proposes a kind of method realizing and lacking feature reconstruction, including:

Step 100, in advance obtain two or more Gaussian Mixture cluster weight, mean vector and Covariance matrix.Specifically include:

Obtain two or more training voices in advance, voice is trained for each, voice will be trained It is divided into two or more speech frames；Obtain the Mel territory logarithm of the speech frame of each training voice Power spectrum characteristic vector parameters, according to the Mel territory log power spectrum Characteristic Vectors of the speech frame of each training voice Amount parameter acquiring each Gaussian Mixture cluster weight, mean vector and covariance matrix.

Wherein, when training voice is divided into two or more speech frames, training voice can be made Between a length of 20 milliseconds (ms) of speech frame to 30ms, adjacent two training voices speech frame it Between have 25% to 50% overlap.

Wherein, the Mel territory log power spectrum character vector of the speech frame of each training voice is obtained Including:

The speech frame of training voice is carried out Fourier transformation, the language to the training voice after Fourier transformation Sound frame delivery obtains training the amplitude spectrum of the speech frame of voice, takes the amplitude spectrum of the speech frame of training voice Square obtain training the power spectrum of the speech frame of voice, will the power spectrum of speech frame of training voice through prunus mume (sieb.) sieb.et zucc. Your comb filter obtains training the Mel territory power spectrum characteristic vector parameters of the speech frame of voice, to training The Mel territory power spectrum characteristic vector parameters of the speech frame of voice is taken the logarithm and is obtained training the speech frame of voice Mel territory log power spectrum character vector.

Wherein, Mel comb filter can be triangular filter or hyperbola wave filter.

Wherein, triangular filter can use following methods to design.

According to formula (1), temporal frequency gone to Mel territory frequency.

f_{mel} (f) = 2595 \lg (1 + \frac{f}{700}) - - - (1)

Wherein, f is temporal frequency, f_melFor Mel territory frequency.

Calculate voice signal peak frequency again:

f_{g} = \frac{f_{s}}{2} - - - (2)

Wherein, f_sFor sample frequency, sample frequency is generally 8000Hz, f_gOriginal frequency for voice signal The maximum of rate.

In conjunction with formula (1) and formula (2), the maximum Mel territory frequency of voice is to make f=fg draw:

f_{\max} [mel] = 2595 \lg (1 + \frac{f_{s}}{1400}) mel - - - (3)

Wherein, f_maxFor maximum Mel territory frequency.

Mel territory, K rank power spectrum characteristic vector parameters is:

Δmel = \frac{f_{\max}}{K + 1} mel - - - (4)

Utilize the inverse function of formula (1) by vector M=Δ mel, 2 Δ mel, 3 Δ mel ..., (K+1) Δ mel} is converted into Temporal frequency obtains f_center={ f₁,f₂,f₃,…,f_K+1, finally utilize f_centerDesign triangle wave filter.Design Triangular filter to meet gain be 1, each triangle center frequency is f_n, n ∈ 1, K}, triangle The mid frequency left side a width of f of band_n—f_n—1, a width of f of the right band_n+1—f_n?.First triangle center The frequency left side a width of f of band₁。

Wherein, obtain according to the Mel territory log power spectrum character vector of the speech frame of each training voice Each Gaussian Mixture cluster weight, mean vector and covariance matrix include:

The number of Gaussian Mixture cluster is set, initializes the mean vector of each Gaussian Mixture cluster, covariance Matrix and weight, according to initialized each Gaussian Mixture cluster mean vector, covariance matrix and weight, The Mel territory log power spectrum character vector of the speech frame of each training voice uses Gaussian Mixture cluster to calculate Method obtains each Gaussian Mixture cluster weight, mean vector and covariance matrix.

Wherein it is possible to rule of thumb the number that Gaussian Mixture clusters is set to 128.

Wherein it is possible to use LBG-vector quantization (VQ, Vector Quantization) algorithm initialization The mean vector of each Gaussian Mixture cluster, implements the common knowledge belonging to those skilled in the art, and It is not intended to limit the scope of the invention, repeats no more here.

Wherein it is possible to the covariance matrix clustered by each Gaussian Mixture of several random initializtions between 0 to 2.

Wherein, during the weight that random initializtion Gaussian Mixture clusters, should ensure that what all Gaussian Mixture clustered Weight sum is 1.

Wherein, Gaussian Mixture clustering algorithm can be EM algorithm.EM algorithm belongs to existing algorithm, The protection domain that concrete implementation is not intended to limit the present invention.EM algorithm is implemented as follows:

Clustering for each Gaussian Mixture, circulation performs formula (5) to formula (9).

γ (i, k 1) = \frac{ω_{k 1} N 1 (x_{i}, μ_{k 1}, θ_{k 1})}{Σ_{j = 1}^{M} ω_{j} N 1 (x_{i}, μ_{j}, θ_{j})} - - - (5)

Wherein, i represents the sequence number of the speech frame of training voice, and N1 is Gaussian function, ω_k1For kth 1 The weight of Gaussian Mixture cluster, x_iMel territory log power spectrum for the speech frame of i-th training voice is special Levy vector parameters, μ_k1For the mean vector of 1 Gaussian Mixture cluster of kth, θ_k1Mix for 1 Gauss of kth Close the covariance matrix of cluster, μ_jFor the mean vector of jth Gaussian Mixture cluster, θ_jFor jth Gauss The covariance matrix of mixing cluster, j, k1 are the sequence number of Gaussian Mixture cluster, and M is Gaussian Mixture cluster Number.

S_{k 1} = Σ_{i - 1}^{n} γ (i, k 1) - - - (6)

Wherein, n is the number of the speech frame training voice.

μ_{k 1} = \frac{1}{S_{k 1}} Σ_{i = 1}^{n} γ (i, k 1) x_{i} - - - (7)

θ_{k 1} = \frac{1}{S_{k 1}} Σ_{i = 1}^{n} γ (i, k 1) (x_{i} - μ_{k 1}) {(x_{i} - μ_{k 1})}^{T} - - - (8)

ω_{k 1} = \frac{S_{k 1}}{Σ_{k = 1}^{M} S_{k 1}} - - - (9)

Wherein, the number of times of circulation can preset, and the number of times of circulation is the most, and precision is the highest, such as, Could be arranged to 10 times.

Wherein,

N 1 (X, U, θ) = \frac{1}{\sqrt{{(2 π)}^{d} | θ |}} \exp [- 0.5 {(X - U)}^{T} θ^{- 1} (X - U)] - - - (10)

Wherein, X is the Mel territory log power spectrum character vector of the speech frame of training voice, and U is The mean vector of Gaussian Mixture cluster, θ is the covariance matrix of Gaussian Mixture cluster, and d is training voice The dimension of the Mel territory log power spectrum character vector of speech frame.

128 groups of ω, μ, the θ obtained after 10 loop ends are preserved, i.e. thinks these parameters Represent Human voice's common feature.

Step 101, tested speech is divided into two or more speech frames, each is tested The speech frame of voice, uses the minima improved to control recursive average (IMCRA, The Improved Minima Controlled Recursive Averaging) algorithm calculates each dimension of speech frame of tested speech Signal to noise ratio.

In this step, a dimension of the speech frame of tested speech refers to carry from the speech frame of tested speech Obtain a subband of the Mel territory power spectrum characteristic vector parameters arrived.Such as, when the Mel of tested speech When territory power spectrum characteristic vector parameters is [2,3,4], the speech frame of tested speech includes three dimensions, respectively It is 2,3,4.

In this step, when tested speech is divided into two or more speech frames, test language can be made Between a length of 20ms to the 30ms of the speech frame of sound, between the speech frame of adjacent two tested speech There is the overlap of 25% to 50%.

In this step, use the signal to noise ratio bag of each dimension of speech frame of IMCRA algorithm calculating tested speech Include:

Obtain the Mel territory power spectrum characteristic vector parameters of the speech frame of tested speech, according to tested speech The Mel territory power spectrum characteristic vector parameters of speech frame uses IMCRA algorithm to calculate the voice of tested speech The signal to noise ratio of each dimension of frame.

Wherein, the Mel territory power spectrum characteristic vector parameters of the speech frame obtaining tested speech includes:

The speech frame of tested speech is carried out Fourier transformation, the language to the tested speech after Fourier transformation Sound frame delivery obtains the amplitude spectrum of the speech frame of tested speech, takes the amplitude spectrum of the speech frame of tested speech Square obtain the power spectrum of the speech frame of tested speech, by the power spectrum of the speech frame of tested speech through prunus mume (sieb.) sieb.et zucc. Your comb filter obtains the Mel territory power spectrum characteristic vector parameters of the speech frame of tested speech.

Wherein, IMCRA is used according to the Mel territory power spectrum characteristic vector parameters of the speech frame of tested speech The signal to noise ratio of each dimension of speech frame that algorithm calculates tested speech includes:

According to formula D²(λ, k2)=α_d(λ,k2)D²(λ-1,k2)+[1-α_d(λ,k2)]Y²(λ, k2) calculates tested speech The noise power of each dimension of speech frame, according to formula SNR (λ, k2)=20log₁₀(Y(λ,k2)-D(λ,k2))-20log₁₀The speech frame that D (λ, k2) calculates tested speech is each The signal to noise ratio of dimension.

Wherein, D²(λ, k2) is the of the Mel territory noise power of λ speech frame of tested speech The value of k2 dimension, k2 is the dimension sequence of the Mel territory power spectrum characteristic vector parameters of the speech frame of tested speech Number, λ is the speech frame sequence number of tested speech, α_dFor smoothing parameter, Y is the speech frame of tested speech Kth 2 dimension values of Mel territory power spectrum characteristic vector parameters, Y²(λ, k2) is the λ of tested speech The value of kth 2 dimension of the Mel territory power spectrum characteristic vector parameters of individual speech frame.

Wherein,

α_d(λ, k2)=α+(1-α) p (λ, k2) (11)

Wherein, α is constant, p (λ, k2) be the speech frame of λ tested speech kth 2 dimension in exist The probability of voice.

Wherein, q (λ, k2) be λ tested speech speech frame kth 2 dimension in there is not voice Probability,γ (λ, k2) is the prunus mume (sieb.) sieb.et zucc. of the speech frame of λ tested speech The posteriori SNR of kth 2 Wei Chu of your territory power spectrum characteristic vector parameters, ζ (λ, k2) is λ The prior weight of kth 2 Wei Chu of the Mel territory power spectrum characteristic vector parameters of the speech frame of tested speech.

γ (λ, k) = \frac{{| Y (λ, k 2) |}^{2}}{B_{\min} S_{\min} (λ, k 2)} - - - (13)

Wherein, B_minFor deviation factors, S_minMinimum for S (λ, k2-1-k3) to S (λ, k2-1) Value.K3 can preset.

Wherein,

S (λ, k)=α_SS(λ-1,k)+(1-α_S)S_f(λ,k) (15)

Wherein, α_sFor constant smoothing factor, and

S_{f} (λ, k) = Σ_{i = - L_{w}}^{L_{w}} w (i) {| Y (λ, k - 1) |}^{2} - - - (16)

Wherein, w (i) is Hanning window function, the long 2L of window_w+1。

Wherein, γ₁It it is a constant threshold.

\tilde{γ} (λ, k 2) = \frac{{| Y (λ, k 2) |}^{2}}{B_{\min} {\tilde{S}}_{\min} (λ, k 2)} - - - (18)

\tilde{S} (λ, k 2) = α_{S} \tilde{S} (λ - 1, k 2) + (1 - α_{S}) {\tilde{S}}_{f} (λ, k 2) - - - (20)

Wherein, γ₀And ζ₀For constant threshold.

Step 102, according to the signal to noise ratio of each dimension of speech frame of tested speech by the speech frame of tested speech It is divided into reliable parts and unreliable part, and judges according to the unreliable part of the speech frame of tested speech The speech frame going out tested speech needs to carry out lacking feature reconstruction, according to each Gaussian Mixture cluster obtained Weight, mean vector and covariance matrix, tested speech the reliable parts of speech frame to tested speech The unreliable part of speech frame carries out lacking feature reconstruction.

In this step, the unreliable part at the speech frame according to tested speech judges the language of tested speech When sound frame is made without lacking feature reconstruction, abandon the speech frame of tested speech.

In this step, according to the signal to noise ratio of each dimension of the speech frame of tested speech by the speech frame of tested speech It is divided into reliable parts and unreliable part includes:

Judge that the signal to noise ratio of speech frame dimension of tested speech, more than predetermined threshold value, determines test language The reliable parts of the speech frame that this dimension is tested speech of the speech frame of sound；Judge the language of tested speech The signal to noise ratio of sound frame dimension is less than or equal to predetermined threshold value, determines this dimension of the speech frame of tested speech Degree is the unreliable part of the speech frame of tested speech.

Wherein it is possible to use the mode of labelling that the speech frame of tested speech is divided into reliable parts and can not Relying part is divided, and specifically can use formulaWherein, L is predetermined threshold value, M (λ, k2) is mark value.

Wherein, whether the speech frame of tested speech is judged according to the unreliable part of the speech frame of tested speech Needs carry out lacking feature reconstruction and include:

When the number of dimensions of unreliable part of speech frame of tested speech and total dimension of the speech frame of tested speech Ratio between the number of degrees is more than or equal to when presetting ratio, it is judged that the speech frame of tested speech needs to carry out Disappearance feature reconstruction；The number of dimensions of unreliable part of speech frame and the language of tested speech when tested speech Ratio between total number of dimensions of sound frame is less than when presetting ratio, it is judged that the speech frame of tested speech is not required to Carry out lacking feature reconstruction.

In this step, according to obtain each Gaussian Mixture cluster weight, mean vector and covariance matrix, The reliable parts of the speech frame of tested speech carries out disappearance spy to the unreliable part of the speech frame of tested speech Levy reconstruction to include:

Tested speech is judged according to weight, mean vector and the covariance matrix that each Gaussian Mixture clusters Gaussian Mixture cluster belonging to speech frame, clusters according to the Gaussian Mixture belonging to the speech frame of tested speech Weight, mean vector and covariance matrix and formulaTo tested speech The unreliable part of speech frame carries out lacking feature reconstruction.

Wherein,Unreliable part for the speech frame of tested speech；X_oSpeech frame for tested speech Reliable parts；U_kmBelonging to the speech frame of tested speech Gaussian Mixture cluster mean vector in The part that the unreliable part of the speech frame of tested speech is corresponding；U_koSpeech frame institute for tested speech Portion corresponding with the reliable parts of the speech frame of tested speech in the mean vector of the Gaussian Mixture cluster belonged to Point；θ_kmoFor by belonging to the speech frame of tested speech Gaussian Mixture cluster covariance matrix in test The row that the unreliable part of the speech frame of voice is corresponding, and the reliable parts of the speech frame with tested speech The elementary composition matrix that corresponding row intersect；θ_kooFor by the height belonging to the speech frame of tested speech Row corresponding with the reliable parts of the speech frame of tested speech in the covariance matrix of this mixing cluster, and The elementary composition matrix that the row corresponding with the reliable parts of the speech frame of tested speech intersect.

It is to say, after obtaining the cluster of the Gaussian Mixture described in the speech frame of tested speech, can will survey The mean vector of the speech frame of examination voice is rearranged for U_k=[U_ko,U_km], by the speech frame of tested speech Covariance matrix is rearranged for

θ_{k} = [\begin{matrix} θ_{koo} & θ_{kom} \\ θ_{kmo} & θ_{kmm} \end{matrix}] .

Such as, total number of dimensions of the speech frame of tested speech is 6, and wherein 1,3,5 dimensions are reliable parts, 2,4,6 Dimension is unreliable part, then U_koThe average of the Gaussian Mixture cluster belonging to the speech frame of tested speech 1,3,5th dimension in vector；θ_kmoFor the covariance clustered by the Gaussian Mixture belonging to the speech frame of tested speech In matrix, the 2nd, 4,6 row and the 1st, 3,5 arranges the elementary composition matrix intersected；θ_kooFor by test language In the covariance matrix of the Gaussian Mixture cluster belonging to the speech frame of sound, the 1,3,5th row and the 1,3,5th row intersect The elementary composition matrix of fork.

Wherein, weight, mean vector and the covariance matrix clustered according to each Gaussian Mixture judges test Gaussian Mixture cluster belonging to the speech frame of voice includes:

According to formulaJudge that the Gauss belonging to the speech frame of tested speech mixes Close cluster.

Wherein, X is the Mel territory log power spectrum character vector of the speech frame of tested speech, λ_k4 Cluster for 4 Gaussian Mixture of kth,For the k4 value that maximum likelihood value is corresponding, P is X and λ_k4Between Likelihood value, argmax be P be k4 value during maximum.

Wherein,

P (X | λ_{k 4}) = ω_{k 4} \underset{y &Element; X_{0}}{Π} N 2 (x, μ_{k 4, y}, σ_{k 4, y}) \underset{y &Element; X_{m}}{Π} 0.5 (1 + \erf (\frac{x - μ_{k 4, y}}{\sqrt{2} σ_{k 4, y}})) - - - (23)

N 2 (y, μ_{k 4, y}, σ_{k 4, y}) = \frac{1}{\sqrt{2 {πσ}_{k 4, y}^{2}}} \exp (- 0.5 \frac{{(y - μ_{k 4, y})}^{2}}{σ_{k 4, y}^{2}}) - - - (24)

Y is certain one-dimensional data of X, ω_k4λ is clustered for Gaussian Mixture_k4Weight, μ_k4, x is λ_k4With The average that y dimension is corresponding, σ_k4, y is that Gaussian Mixture clusters λ_k4In with the y corresponding covariance matrix pair of dimension Angle value.

By the method for the present invention, owing to IMCRA algorithm is can under nonstationary noise environment effectively The method of tracking noise, thus improve the precision of disappearance feature reconstruction.

See Fig. 2, the invention allows for a kind of device realizing lacking feature reconstruction, at least include:

Computing module, for being divided into two or more speech frames by tested speech, for each The speech frame of tested speech, uses the noise of each dimension of speech frame of IMCRA algorithm calculating tested speech Ratio；

In assembly of the invention, rebuild module and be additionally operable to:

Unreliable part at the speech frame according to tested speech judges that the speech frame of tested speech need not When carrying out lacking feature reconstruction, abandon the speech frame of tested speech.

In assembly of the invention, acquisition module specifically for:

In assembly of the invention, computing module specifically for:

Tested speech is divided into two or more speech frames, for the voice of each tested speech Frame, obtains the Mel territory power spectrum characteristic vector parameters of the speech frame of tested speech；According to tested speech The Mel territory power spectrum characteristic vector parameters of speech frame uses IMCRA algorithm to calculate the voice of tested speech The signal to noise ratio of each dimension of frame.

In assembly of the invention, rebuild module specifically for:

Judge that the signal to noise ratio of speech frame dimension of tested speech, more than predetermined threshold value, determines test language The reliable parts of the speech frame that this dimension is tested speech of the speech frame of sound；

Judge that the signal to noise ratio of speech frame dimension of tested speech, less than or equal to predetermined threshold value, determines The unreliable part of the speech frame that this dimension is tested speech of the speech frame of tested speech；

In assembly of the invention, rebuild module specifically for:

When the number of dimensions of unreliable part of speech frame of tested speech and total dimension of the speech frame of tested speech Ratio between the number of degrees is more than or equal to when presetting ratio, it is judged that the speech frame of tested speech needs to carry out Disappearance feature reconstruction；

In assembly of the invention, rebuild module specifically for:

Tested speech is judged according to weight, mean vector and the covariance matrix that each Gaussian Mixture clusters Gaussian Mixture cluster belonging to speech frame；

Weight, mean vector and the covariance clustered according to the Gaussian Mixture belonging to the speech frame of tested speech Matrix and formulaThe unreliable part of the speech frame of tested speech is carried out Disappearance feature reconstruction；

Understand it should be noted that embodiment described above is for only for ease of those skilled in the art , it is not limited to protection scope of the present invention, in the premise of the inventive concept without departing from the present invention Under, any obvious replacement that the present invention is made by those skilled in the art and improvement etc. are all at this Within the protection domain of invention.

Claims

1. the method realizing lacking feature reconstruction, it is characterised in that including:

Method the most according to claim 1, it is characterised in that at the language according to described tested speech The unreliable part of sound frame judges that the speech frame of described tested speech is made without lacking feature reconstruction Time, the method also includes: abandon the speech frame of described tested speech.

Method the most according to claim 1 and 2, it is characterised in that described in advance obtain two or Weight, mean vector and the covariance matrix of two or more Gaussian Mixture cluster include:

Method the most according to claim 3, it is characterised in that each training voice of described acquisition The Mel territory log power spectrum character vector of speech frame include:

Method the most according to claim 3, it is characterised in that the described language according to each training voice The Mel territory log power spectrum character vector of sound frame obtains each Gaussian Mixture cluster weight, mean vector Include with covariance matrix:

Method the most according to claim 1 and 2, it is characterised in that described employing IMCRA calculates The signal to noise ratio of each dimension of speech frame that method calculates tested speech includes:

Method the most according to claim 6, it is characterised in that the voice of described acquisition tested speech The Mel territory power spectrum characteristic vector parameters of frame includes:

Method the most according to claim 6, it is characterised in that the described voice according to tested speech The speech frame that the Mel territory power spectrum characteristic vector parameters employing IMCRA algorithm of frame calculates tested speech is each The signal to noise ratio of dimension includes:

Method the most according to claim 1 and 2, it is characterised in that described according to tested speech The speech frame of tested speech is divided into reliable parts and unreliable part bag by the signal to noise ratio of each dimension of speech frame Include:

Method the most according to claim 1 and 2, it is characterised in that described to tested speech The unreliable part of speech frame also includes before carrying out lacking feature reconstruction: according to the speech frame of tested speech Unreliable part judge the speech frame of tested speech the need of carrying out lacking feature reconstruction, including:

11. methods according to claim 1 and 2, it is characterised in that described each according to obtain Gaussian Mixture cluster weight, mean vector and covariance matrix, tested speech speech frame can relying part Divide the unreliable part to the speech frame of tested speech to carry out lacking feature reconstruction to include:

12. methods according to claim 11, it is characterised in that described poly-according to each Gaussian Mixture The weight of class, mean vector and covariance matrix judge the Gaussian Mixture belonging to the speech frame of tested speech Cluster includes:

Wherein,

P (X | λ_{k 4}) = ω_{k 4} \underset{y &Element; X_{0}}{Π} N 2 (x, μ_{k 4, y}, σ_{k 4, y}) \underset{y &Element; X_{m}}{Π} 0.5 (1 + \erf (\frac{x - μ_{k 4, y}}{\sqrt{2} σ_{k 4, y}}));

N 2 (y, μ_{k 4, y}, σ_{k 4, y}) = \frac{1}{\sqrt{2 π σ_{k 4, y}^{2}}} \exp (- 0.5 \frac{{(x - μ_{k 4, y})}^{2}}{σ_{k 4, y}^{2}});

Wherein, y is certain one-dimensional data of X, ω_k4λ is clustered for Gaussian Mixture_k4Weight, μ_k4,_xFor λ_k4 The average corresponding with y dimension, σ_{K4, y}For λ_k4In with the y corresponding covariance matrix diagonal values of dimension；

By compare X the most each Gaussian Mixture cluster likelihood value P (X | λ_k4) size, select likelihood value Maximum Gaussian Mixture cluster λ_k4。

13. 1 kinds of devices realizing lacking feature reconstruction, it is characterised in that at least include:

14. devices according to claim 13, it is characterised in that described reconstruction module is additionally operable to:

15. according to the device described in claim 13 or 14, it is characterised in that described acquisition module has Body is used for:

16. according to the device described in claim 13 or 14, it is characterised in that described computing module has Body is used for:

17. according to the device described in claim 13 or 14, it is characterised in that described reconstruction module has Body is used for:

18. according to the device described in claim 13 or 14, it is characterised in that described reconstruction module has Body is used for:

19. according to the device described in claim 13 or 14, it is characterised in that described reconstruction module has Body is used for: