CN103559888A

CN103559888A - Speech enhancement method based on non-negative low-rank and sparse matrix decomposition principle

Info

Publication number: CN103559888A
Application number: CN201310548773.9A
Authority: CN
Inventors: 孙成立; 须明; 王希敏; 谢坚筱
Original assignee: KEY LABORATORY OF SCIENCE AND TECHNOLOGY ON AVIONICS INTEGRATION TECHNOLOGIES
Current assignee: KEY LABORATORY OF SCIENCE AND TECHNOLOGY ON AVIONICS INTEGRATION TECHNOLOGIES
Priority date: 2013-11-07
Filing date: 2013-11-07
Publication date: 2014-02-05
Anticipated expiration: 2033-11-07
Also published as: CN103559888B

Abstract

The invention discloses a speech enhancement method based on the non-negative low-rank and sparse matrix decomposition principle. The method includes the first step of firstly carrying out smoothing, framing and discrete Fourier transformation on noisy speech signals to obtain noisy speech frequency spectra, the second step of allowing the noisy speech magnitude spectra of frames to serve as column vectors which are arranged in chronological order to form a noisy speech time-frequency matrix and then carrying out non-negative low-rank and sparse matrix decomposition on the noisy speech time-frequency matrix to obtain a non-negative low-rank and sparse matrix, and the third step of utilizing the sparse matrix and reconstruction of noisy speech phase positions to enhance the speech spectra and finally obtaining the enhanced speech in a time domain form through inverse Fourier transformation. By the adoption of the method, noise adaptability is high, endpoint detection and model training are not needed, parameters are fewer and easy to regulate, strong noise environmental performance is good, and therefore the method has a good application prospect.

Description

Sound enhancement method based on non-negative low-rank and sparse matrix decomposition principle

Technical field

The present invention relates to signal process field, be applicable to the squelch of noisy speech, particularly the sound enhancement method based on non-negative low-rank and sparse matrix decomposition principle.

Background technology

Voice signal is that mankind's exchange of information is the most natural, the most effective means.Along with the mankind enter the information age, in the urgent need to the voice processing technology with advanced, promote human society intelligent.As far back as 2000, Bill Gates just once proposed " coming 10 years is the epoch of voice ".Recent years, along with the companies such as apple, Google, Microsoft successively release intelligent sound service, intelligent sound industry has become the new industry in areas of information technology, and user cognition degree and market scale expand gradually.The smart mobile phone that particularly apple is released recently has voice assistant function, and the voice " cloud " that University of Science and Technology news fly release, and makes intelligent sound technology face more wide application.Yet, in voice communication and application process, be inevitably subject to the interference from surrounding environment, communication media and inside communication equipment noise, had a strong impact on the practical application of intelligent sound technology.

It is the effective technology that solves noise pollution that voice strengthen.Voice strengthen by suppressing the interference of noise to voice, make to strengthen the voice signal of processing minimum with the distortion between original clean speech signal.Come in the past few decades, emerged in large numbers many voice enhancement algorithms, typical algorithm comprises spectrum-subtraction, based on spectral amplitude least mean-square error, subspace method, wavelet de-noising method.Under the higher environment of signal to noise ratio (S/N ratio), voice enhancing has obtained effective solution.Yet, due to the diversity of noise in physical environment and the complicacy of voice signal itself, voice enhancement algorithm is according to the difference of applied environment and difference, and this makes its research work difficulty very large, and the Speech Enhancement problem of very noisy and multiple noise circumstance does not still obtain fine solution.

In existing voice enhancement algorithm, many methods attempt to remove to the full extent noise signal with the pdf model of voice signal and noise signal, yet research in recent years shows certain single distribution and can not be applicable to all voice or noise, need more flexibly mathematical model and model algorithm for estimating with the feature of adaptation signal self.In addition, at existing voice, strengthen in algorithm, noise estimates it is the indispensable work in early stage of voice enhancement algorithm.By noise, estimate to obtain the priori signal to noise ratio (S/N ratio) of noise power spectrum and voice signal, the improvement that voice is strengthened to effect is most important.Existing sound enhancement method is detected the voice signal collecting is divided into noise segment and noisy speech section by sound end, utilize noise segment to estimate and upgrade noise estimator, yet this is a kind of suboptimum estimation mode, in reality, the instantaneous noise of noise segment and noisy speech section also not exclusively conforms to, therefore, this noise estimation method always brings error, moreover existing voice end-point detection technology is still immature under low signal-to-noise ratio and nonstationary noise environment, easily cause erroneous judgement, can cause the very large residual noise of existence in voice.

In recent years compressive sensing theory research shows, the observed quantity of many reality can be summed up as the pattern of a low-rank component and the addition of sparse component, low-rank and sparse matrix by matrix decompose, and can from large noise or exceptional value contamination data, recover primary data information (pdi).The low-rank of matrix and sparse matrix decompose for many sciemtifec and technical spheres such as figure image intensifying, video object detection, data minings.

Steadily random noise and periodic noise are modal two kinds of noise class.Steadily random noise is described its stochastic process with single order and second-order statistic, its average and autocorrelation function and time-independent, because the Fourier transform of random signal autocorrelation function is power spectrum, therefore steadily the time-frequency matrix of random noise is the low-rank matrix that an order number is 1.Equally, if noise is periodic noise, because its time-frequency matrix only has value at some fixed frequency place, its rectangular array vector has stronger correlativity, and inevitable is also a low-rank matrix.

In sum, the time-frequency rectangular array vector of ground unrest has very strong correlativity, so the time-frequency matrix of noise has low-rank.Relative ground unrest, speech source signal when major part on frequency value be zero or close to zero, only have a small amount of samples point place value larger, so speech source signal has certain sparse property, be applicable to describing with sparse matrix.Therefore, can consider that low-rank and the sparse resolution theory of using for reference matrix solve Speech Enhancement problem.Chinese patent discloses a kind of single channel decomposing based on low-rank and sparse matrix without the supervision language separation method (publication number: CN102915742A) of making an uproar.First the method is used Short Time Fourier Transform that thereby noisy speech time domain waveform is transformed to the amplitude spectrum that time-frequency domain obtains noisy speech; Utilize low-rank and sparse matrix decomposition algorithm that the amplitude spectrum of noisy speech is decomposed into noise amplitude spectrum, voice amplitude spectrum and residual noise amplitude spectrum three sum; Finally, utilize the voice time domain waveform that inverse Fourier transform reconstructs from the amplitude spectrum of voice in short-term.The deficiency of the method is low-rank and sparse matrix decomposition not to be added to non-negativity constraint, easily causes the separated voice amplitude spectrum obtaining from noisy speech amplitude spectrum to contain negative value result.And actual amplitude spectrum is non-negative physical quantity, should there is not negative value phenomenon.Negative value amplitude spectrum not only causes resolution error, and can produce people's ear and feel the music noise of feeling bad, thereby affects phonetic hearing quality.

The present invention has designed a kind of sound enhancement method based on non-negative low-rank and sparse matrix decomposition principle, the method adopts non-negative low-rank and sparse matrix decomposition principle to decompose noisy speech amplitude spectrum, can make to decompose the voice amplitude spectrum obtaining and meet nonnegativity, effectively improve low-rank and sparse matrix and decompose effect.The method has strong robustness, do not need to carry out end-point detection and the parameter easy advantage such as adjusting less, and the voice that are applicable under strong noise environment strengthen task.

Summary of the invention

Technical matters to be solved by this invention is to provide a kind of sound enhancement method based on non-negative low-rank and sparse matrix decomposition principle, by introducing low-rank and the sparse constraint of noise and voice and low-rank is carried out in non-negativity constraint and sparse matrix decomposes in time-frequency domain, the separation of making an uproar of the language of realizing noisy speech.

The present invention takes following technical scheme, and the sound enhancement method based on non-negative low-rank and sparse matrix decomposition principle is isolated voice signal with non-negative low-rank and sparse matrix decomposition method from noisy speech, and implementation step is as follows:

(1) discrete noisy speech signal is carried out to pre-service, pre-service comprises signal smoothing and minute frame;

(2) noisy speech signal after minute frame is carried out to discrete Fourier transformation, obtain noisy speech frequency spectrum;

(3) in frequency domain, using the spectrum amplitude of every frame voice as column vector, arrange in chronological order, by several speech frames, form noisy speech time-frequency matrix;

(4) utilize non-negative low-rank and sparse matrix decomposition algorithm to decompose noisy speech time-frequency matrix, obtain non-negative low-rank matrix and sparse matrix; Decomposing expression formula is:

Y=L+S+E meets rank (L)≤r, || S|| ₀≤ h, L>=0, S>=0;

Wherein: Y is noisy speech time-frequency matrix; L is low-rank matrix, the amplitude spectrum of corresponding noise; S is sparse matrix, the amplitude spectrum of corresponding voice, || S|| ₀represent the non-zero element number that sparse matrix S contains, the order of rank (L) representing matrix L, E is residual matrix, r and h represent low-rank and sparse constraint upper limit parameter;

(5) utilize the phase spectrum reconstruct of sparse matrix S and noisy speech to strengthen voice spectrum, then by inverse Fourier transform, obtain the enhancing voice of forms of time and space.

In described step (1), discrete noisy speech signal being carried out to pretreated processing procedure is:

(1) adopt P point arest neighbors signal average to carry out signal smoothing, in order to the amplitude wave-shape of level and smooth noisy speech;

(2) to noisy speech signal, divide frame, the window function that minute frame adopts is Hamming window, and window length is 200 points, and overlapping the counting that each interframe moves is 80 points.

The step of calculating low-rank matrix L and sparse matrix S is as follows:

(1) initialization: Y ₀=Y; L ₀=S ₀=[0] _{n * K};

Iterations initial value i=1; Maximum iteration time imax=10 ³; Relative error threshold value δ=10 ^-3;

(2) use NMF to upgrade low-rank matrix: (W, H)=NMF (Y _i-1), L _i=WH; W ∈ R ^{n * r}, H ∈ R ^{r * K};

NMF represents Non-negative Matrix Factorization, and NMF represents Non-negative Matrix Factorization, and W and H are that order is the NMF decomposition result of r, and the measure function of NMF selects Itakura-Saito to estimate;

(3) use Soft-thresholding operator to upgrade sparse matrix: S _i=(Y _i-1-L _i+ S _i-1> λ) (Y _i-1-L _i+ S _i-1-λ);

Wherein: symbol representing matrix correspondence position element product, λ is thresholding constant; λ is relevant with noise level, recommendation λ=σ, the mean square deviation that wherein σ is noise;

(4) upgrade stack matrix: Y _i=L _i+ S _i;

(5) if i reach maximum iteration time i=imax or stop iteration, the estimated value L of output L and S _iand S _i; Otherwise jump to step (2), i=i+1; Continue to carry out iterative process.

In described step (5), utilize the phase spectrum reconstruct of sparse matrix and noisy speech to strengthen voice spectrum:

Wherein: the spectral phase that ∠ Y (n, k) is noisy speech, S is sparse matrix, S (n, k) is sparse matrix spectral amplitude value,

for the enhancing voice spectrum of reconstruct, n is time frame index, and k is frequency indices.

Sound enhancement method provided by the invention decomposes by non-negative low-rank and sparse matrix, and can make to decompose the low-rank matrix and the element in sparse matrix that obtain is all nonnegative value.The method does not need to carry out end-point detection and model training, has strong robustness, the parameter easy advantage such as adjusting less, and the voice that are particularly suitable under strong noise environment strengthen task.

Accompanying drawing explanation

Fig. 1 is speech-enhancement system block diagram of the present invention.

Embodiment

Now the invention will be further described by reference to the accompanying drawings, and referring to Fig. 1, the sound enhancement method based on non-negative low-rank and sparse matrix decomposition principle, comprises following concrete steps:

1) noisy speech signal y (t) is carried out to pre-service 101; 101 stages of pre-service comprise signal smoothing and minute frame, make it easy to subsequent processes.Signal smoothing refers to adopt the P point arest neighbors signal average of y (t) to calculate noisy speech signal currency, in order to the amplitude wave-shape of level and smooth noisy speech signal.In the present invention, the value of P is 3,

the window function that divides frame to adopt is Hamming window, and window length is 200 points, and overlapping the counting that each interframe moves is 80 points;

2) noisy speech signal after minute frame is carried out to DFT (discrete Fourier transformation) 102, obtain signal spectrum, signal spectrum comprises the amplitude spectrum 104|Y (n, k) of signal | and phase spectrum 103 ∠ Y (n, k).Wherein n represents frame index, n=1, and 2 ..., N; K represents frequency indices, k=1, and 2 ..., k; N is total time frame number; K is that Fourier transform is counted;

3) in frequency domain, using the amplitude spectrum of every frame voice 104 as column vector order, arrange, several speech frames just form the noisy time-frequency matrix Y of a N * K like this.

4) noisy time-frequency matrix Y is carried out to NLSMD (non-negative low-rank and sparse matrix decompose) 105, calculate non-negative low-rank matrix L and non-negative sparse matrix S.

Y=L+S+E meets rank (L)≤r, || S|| ₀≤ h, L>=0, S>=0

Here the amplitude spectrum of the corresponding noise of L; The amplitude spectrum of the corresponding voice of S; || S|| ₀the non-zero element number that representing matrix S contains; E is residual matrix; The order of rank (L) representing matrix L; R and h represent low-rank and sparse constraint upper limit parameter; Through contrast test, good noise reduction is obtained in r value 1～3 o'clock.

The computation process of NLSMD (non-negative low-rank and sparse matrix decompose) 105 is as follows:

1. initialization: Y ₀=Y; L ₀=S ₀=[0] _{n * K}; Iterations i=1; Maximum iteration time imax=10 ³; Relative error threshold value δ=10 ^-3;

2. use Non-negative Matrix Factorization to upgrade low-rank matrix: (W, H)=NMF (Y _i-1), L _i=WH; W ∈ R ^{n * r}, H ∈ R ^{r * K};

Wherein: L _ibe that NMF represents Non-negative Matrix Factorization through the estimated value of the i time iteration L, W and H are that order is the NMF decomposition result of r, because W and H are nonnegative value, so L _iinevitable is also nonnegative matrix.The measure function of NMF algorithm can be selected Euclidean distance, Kullback-Leibler to estimate with Itakura-Saito to estimate.Through contrast test, adopt Itakura-Saito to estimate and obtain best effects.Therefore, the present invention adopts the NMF method of estimating based on Itakura-Saito to calculate L.

3. use Soft-thresholding operator to upgrade sparse matrix S _i: S _i=(Y _i-1-L _i+ S _i-1> λ) (Y _i-1-L _i+ Si-1-λ);

Wherein: symbol

representing matrix correspondence position element product, λ is threshold value, the value of λ is relevant with noise intensity, recommendation λ=σ, wherein σ is noise mean square deviation.

4. upgrade stack matrix: Y _i=L _i+ S _i;

If 5. i reach maximum iteration time i=imax or

stop iteration, estimated value Li and the S of output L and S _i; Otherwise jump to step 2., i=i+1, continues to carry out iterative process;

5) utilize sparse matrix S and the reconstruct of noisy speech phase spectrum to strengthen voice spectrum, because people's ear is insensitive to the phase information of sound, can replace strengthening with the phase place ∠ Y (n, k) of noisy speech frequency spectrum the phase place of voice, the complex number spectrum of the voice that are enhanced:

6) the complex number spectrum matrix that strengthens voice is expanded into vector, it is carried out to IDFT (inverse discrete Fourier transform) 106, the discrete time of obtaining enhancing voice represents:

Wherein:

vec function representation is concatenated into rectangular array vector the operation of one-dimensional vector by time frame sequential.

Claims

1. the sound enhancement method based on non-negative low-rank and sparse matrix decomposition principle, is characterized in that, with non-negative low-rank and sparse matrix decomposition method, isolates voice signal from noisy speech, and implementation step is as follows:

Y=L+S+E meets rank (L)≤r, || S|| ₀≤ h, L>=0, S>=0;

2. the sound enhancement method based on non-negative low-rank and sparse matrix decomposition principle according to claim 1, is characterized in that, in described step (1), discrete noisy speech signal is carried out to pretreated processing procedure and is:

3. non-negative low-rank according to claim 1 and sparse matrix decomposition algorithm, is characterized in that, the step of calculating low-rank matrix L and sparse matrix S is as follows:

(1) initialization: Y ₀=Y; L ₀=S ₀=[0] _{n * K};

(4) upgrade stack matrix: Y _i=L _i+ S _i;

(5) if i reach maximum iteration time i=imax or

stop iteration, the estimated value L of output L and S _iand S _i; Otherwise jump to step (2), i=i+1; Continue to carry out iterative process.

4. the sound enhancement method based on non-negative low-rank and sparse matrix decomposition principle according to claim 1, is characterized in that, utilizes the phase spectrum reconstruct of sparse matrix and noisy speech to strengthen voice spectrum in described step (5):

Wherein: the spectral phase that ∠ Y (n, k) is noisy speech, S is sparse matrix, | S (n, k) | be sparse matrix spectral amplitude value,