CN108574911A

CN108574911A - The unsupervised single microphone voice de-noising method of one kind and system

Info

Publication number: CN108574911A
Application number: CN201710137778.0A
Authority: CN
Inventors: 李军锋; 李煦; 颜永红
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2017-03-09
Filing date: 2017-03-09
Publication date: 2018-09-25
Anticipated expiration: 2037-03-09
Also published as: CN108574911B

Abstract

The invention discloses a kind of unsupervised single microphone voice de-noising method, the method includes：Carry out frequency spectrum extraction of the step 1) to the voice training data of all phonemes of covering of acquisition, then carries out k mean clusters to amplitude spectrum, obtains the corresponding voice dictionary of each classification；Then all different classes of voice dictionaries are combined into a complete voice dictionary W_S；Step 2) carries out Short Time Fourier Transform to the signals with noise that current time reaches and obtains present frame amplitude spectrum x_t, then with the preceding processed amplitude spectrum of L frames it is combined as output speech manual X=[x_t‑L..., x_t‑1, x_t], the noise matrix W that former frame is estimated_NWith voice dictionary W_SIt is combined into total dictionary matrix W=[W_S W_N], Non-negative Matrix Factorization is carried out to output speech manual X using the algorithm of iteration, obtains noise matrixVoice noise weight vectors corresponding with present frameThe noise matrix that step 3) is obtained by estimationWith noise weight vectorReconstruct the current frame speech signal after noise reduction.

Description

The unsupervised single microphone voice de-noising method of one kind and system

Technical field

The present invention relates to field of voice signal, it is more particularly related to a kind of unsupervised single microphone language Sound noise-reduction method and system.

Background technology

In many application scenarios (such as voice communication, automatic speech recognition, hearing aid), voice signal inevitably by It makes an uproar to the interference of ambient noise, such as road, wind is made an uproar, circuit noise etc., it is therefore desirable to which algorithm for design makes an uproar letter to the collected band of equipment Number carry out noise reduction process.And there is usually one microphones to pick up voice signal, algorithm for many hearing devices (or instrument) It needs to remove noise signal from a signals with noise, which in turns increases the solution difficulty of problem.

Traditional single microphone voice de-noising algorithm includes mainly two parts：Noise estimating part and gain calculating part Point.This kind of algorithm generally assumes that noise is stationary signal, therefore algorithm has preferable inhibition to stationary noise, however very Noise is non-stable in the case of more, it is difficult to which accurate estimated noise signal, bad so as to cause noise reduction.In recent years, it is based on Single microphone voice de-noising algorithm of data-driven has obtained extensive pass due to having preferable inhibition to nonstationary noise Note is such as based on the voice de-noising algorithm of Non-negative Matrix Factorization (non-negative matrix factorization, NMF).

In the algorithm based on NMF, Non-negative Matrix Factorization is carried out to voice and noise signal training data first and is obtained pair The dictionary matrix answered, these dictionary matrixes are used to describe the spectrum structure of voice and noise；Then in the noise reduction stage, band is made an uproar letter It number is decomposed into the product of dictionary matrix and weight matrix, after voice dictionary and respective weights matrix multiple are finally reconstructed noise reduction Voice signal.It is that this kind of algorithm does not need stationary noise it is assumed that can have preferable inhibition to nonstationary noise, be conducive to Practical application.However there is also some limitations for this kind of algorithm, need the training data of speaker dependent and specific noise type, But be difficult to obtain matched training data in advance in many scenes, cause using limited.Secondly, this algorithm is typically all Noise reduction process is carried out to one section of voice, and algorithm is required to handle in real time in practical applications.

Invention content

It is an object of the invention to overcome traditional to need based on NMF noise reduction algorithms to rely on speaker dependent and specific noise The limitation of type training data, it is proposed that a kind of unsupervised single microphone voice de-noising method, it is convenient to be applied in actual scene,

To achieve the goals above, the present invention provides a kind of unsupervised single microphone voice de-noising method, the methods Including：

Carry out frequency spectrum extraction of the step 1) to the voice training data of all phonemes of covering of acquisition, then to amplitude spectrum into Row k mean clusters obtain the corresponding voice dictionary of each classification；Then all different classes of voice dictionaries are combined into one Complete voice dictionary W_S；

Step 2) carries out Short Time Fourier Transform to the signals with noise that current time reaches and obtains present frame amplitude spectrum x_t, then It is combined as output speech manual X=[x with the preceding processed amplitude spectrum of L frames_t-L..., x_t-1, x_t], former frame is estimated The noise matrix W arrived_NWith voice dictionary W_SIt is combined into total dictionary matrix W=[W_S W_N], using the algorithm of iteration to exporting language Music X carries out Non-negative Matrix Factorization, obtains noise matrixVoice noise weight vectors corresponding with present frame

The noise matrix that step 3) is obtained by estimationWith noise weight vectorReconstruct the present frame language after noise reduction Sound signal.

In above-mentioned technical proposal, the step 1) specifically includes：

Step 101) acquires a large amount of clean speech as voice training data；The voice training data of acquisition will cover institute Some phonemes；

Step 102) pre-processes above-mentioned collected voice training data, then extracts the frequency of voice training data Spectrum；

Step 103) carries out k mean clusters, the speech frame similar by structure is composed to the amplitude spectrum of above-mentioned voice training data It is polymerized to one kind, obtains different classes of corresponding voice amplitudes spectrumG=1 ..., G, G are total clusters number；

Step 104) composes the voice amplitudes that above-mentioned cluster obtainsG=1 ..., G carries out Non-negative Matrix Factorization respectively, Obtain the corresponding voice dictionary of each classification；Then all different classes of voice dictionaries are combined into a complete voice word Allusion quotation.

In above-mentioned technical proposal, the specific implementation process of the step 104) is：

First to carrying out following Non-negative Matrix Factorization per a kind of voice amplitudes spectrum after cluster：

Wherein,To belong to the speech manual of g classes,To decompose the obtained corresponding dictionary matrix of g classes, it is used for Different spectrum structural informations is described；

All different classes of voice dictionaries are combined into a complete voice dictionary：

In above-mentioned technical proposal, the step 2) specifically includes：

Step 2-1) x is composed for present frame noisy speech_tIf x_tInput before being L-th frame is then not processed directly By x_tAs output, otherwise combine the processed amplitude spectrum of preceding L frames as output speech manual X=[x_t-L..., X_t-1, x_t]；

Step 2-2) the noise matrix W that estimates former frame_NWith voice dictionary W_SBe combined into total dictionary matrix W= [W_S W_N], noise matrix W_NInitial value be random nonnegative value, use nonnegative value random initializtion present frame weight matrix h_t, and In conjunction with the estimated obtained weight vectors of preceding L frames as weight matrix H=[h_t-L..., h_t-1, h_t]；Wherein h_t=[h_{S, t} ^T h_{N, t} ^T]^T,The voice weight vectors of G small dictionaries, h are corresponded to for present frame_{N, t}For noise vector；

Step 2-3)) determine X, W and H after, calculate X and WH similarity value V：

V=X./(WH)

Wherein ,/expression is divided by point by point；

Step 2-4) take out last row of V v_t, update the weight vectors h of present frame_t：

Wherein .* indicates point-by-point multiplication；

Step 2-5) it is rightIn speech vectorSparse punishment is carried out, is updated

Wherein, λ and ε is coefficient；

Step 2-6) update present frame noise matrix W_N, and it is normalized：

Step 2-7) judgment step 2-6) obtained byWhether restrain；If a determination be made that certainly, it is transferred to step 3)；Otherwise, it enablesIt is transferred to step 2-3).

In above-mentioned technical proposal, the step 3) specifically includes：

Step 3-1) noise matrix is being obtained by step 2) estimationWith present frame weight vectorsLater, reconstruct point Current frame speech spectrum from afterAnd noise spectrum

Step 3-2) speech manual after final noise reduction is obtained by the form of Wiener filtering

Step 3-3) combine the speech manual after noise reductionIt is extensive by inverse Fourier transform with the phase of present frame signals with noise It appears again the time domain waveform after noise reduction.

The present invention also provides a kind of unsupervised single microphone voice de-noising system, the system comprises：

The carry out frequency spectrum of voice dictionary generation module, the voice training data for all phonemes of covering to acquisition carries It takes, k mean clusters then is carried out to amplitude spectrum, obtain the corresponding voice dictionary of each classification；It then will be all different classes of Voice dictionary is combined into a complete voice dictionary W_S；

Noise dictionary generation module, the signals with noise for being reached to current time carry out Short Time Fourier Transform and are worked as Previous frame amplitude spectrum x_t, then with the preceding processed amplitude spectrum of L frames it is combined as output speech manual X=[x_t-L..., x_t-1, x_t], The noise matrix W that former frame is estimated_NWith voice dictionary W_SIt is combined into total dictionary matrix W=[W_S W_N], using iteration Algorithm to output speech manual X carry out Non-negative Matrix Factorization, obtain noise matrixVoice noise weight corresponding with present frame VectorWith

Noise reduction module, for passing through noise matrixVoice noise weight vectors corresponding with present frameReconstruct drop Current frame speech signal after making an uproar.

In above-mentioned technical proposal, the specific implementation step of the voice dictionary generation module is：

Step 104) composes the voice amplitudes that above-mentioned cluster obtainsG=1 ..., G carries out Non-negative Matrix Factorization respectively, Obtain the corresponding voice dictionary of each classification；Then all different classes of voice dictionaries are combined into a complete voice word Allusion quotation：

In above-mentioned technical proposal, the specific implementation process of the noise dictionary generation module is：

Step 2-3)) determine X, W and H after, calculate X and WH similarity value V：

V=X./(WH)

Wherein ,/expression is divided by point by point；

Wherein .* indicates point-by-point multiplication；

Wherein, λ and ε is coefficient；

Step 2-6) update present frame noise matrix W_N, and it is normalized：

(normalization W_N)

In above-mentioned technical proposal, the specific implementation process of the noise reduction module is：

The advantage of the invention is that：

1, method of the invention eliminates the limitation to speaker dependent and specific noise training data, expands algorithm The scope of application；

2, the present invention realizes a kind of online noise reduction algorithm, has stronger practicability.

Description of the drawings

Fig. 1 is the flow chart of unsupervised single microphone voice de-noising method of the present invention；

Fig. 2 is online NMF decomposition algorithms proposed by the present invention.

Specific implementation mode

The present invention is described in more detail in the following with reference to the drawings and specific embodiments.

As shown in Figure 1, a kind of unsupervised single microphone voice de-noising method, the method includes：

Carry out frequency spectrum extraction of the step 1) to the voice training data of all phonemes of covering of acquisition, then to amplitude spectrum into Row k mean clusters obtain the corresponding voice dictionary of each classification；Then all different classes of voice dictionaries are combined into one Complete voice dictionary；It specifically includes：

Step 101) acquires a large amount of clean speech as voice training data；

Voice training data can be obtained from the sound bank much increased income, and the voice training data of acquisition will cover all Phoneme；

Step 102) pre-processes above-mentioned collected voice training data, then extracts the frequency spectrum of voice signal；

Carrying out pretreatment to voice signal includes：To each frame voice signal elder generation's zero padding to N points, N=2ⁱ, i is integer, and i≥8；Then, adding window is carried out to the signal of each frame or preemphasis is handled, windowed function is using Hamming window (hamming) or breathes out Peaceful window (hanning).

Step 104) composes the voice amplitudes that above-mentioned cluster obtainsG=1 ..., G carries out Non-negative Matrix Factorization respectively, Obtain the corresponding voice dictionary of each classification；Then all different classes of voice dictionaries are combined into a complete voice word Allusion quotation；

Wherein,To belong to the speech manual of g classes,To decompose the obtained corresponding dictionary matrix of g classes, it is used for Different spectrum structural information (such as each dictionary describes a phoneme) is described,；

Voice dictionary can be obtained by above-mentioned cluster and decomposition in this way, for describing the phonetic element in noisy speech.

Step 2) carries out Short Time Fourier Transform (short-time Fourier to the signals with noise that current time reaches Transform, STFT) obtain present frame amplitude spectrum x_t, then with the preceding processed amplitude spectrum of L frames be combined as output voice Compose X=[x_t-L..., x_t-1, x_t], Non-negative Matrix Factorization (NMF) is carried out to output speech manual using the algorithm of iteration, is made an uproar Sound dictionary and the corresponding voice noise weight vectors of present frame；

As shown in Fig. 2, the step 2) specifically includes：

Step 2-2) the noise matrix W that estimates former frame_NWith voice dictionary W_SBe combined into total dictionary matrix W= [W_S W_N], noise matrix W_NInitial value be random nonnegative value, use nonnegative value random initializtion present frame weight matrix h_t, and In conjunction with the estimated obtained weight vectors of preceding L frames as weight matrix H=[h_t-L..., h_t-1, h_t]；Wherein h_t=[h_{S, t} ^T h_{N, t} ^T]^T,The voice weight vectors of multiple small dictionaries, h are corresponded to for present frame_{N, t}For noise to Amount；

Step 2-3)) determine X, W and H after, calculate X and WH similarity value V：

V=x./(WH)

Wherein ,/expression is divided by point by point；

Step 2-4) take out last row of V v_t, update present frame weight vectors h_t：

Wherein .* indicates point-by-point multiplication；

Wherein, λ and ε is coefficient；

Step 2-6) update noise matrix W_N, and it is normalized：

(normalization W_N)

Wherein, in above-mentioned steps ,/indicate to be divided by point by point and be multiplied point by point with .*.

Step 2-7) judgment step 2-6) obtained byWhether restrain；If a determination be made that certainly, it is transferred to step 2- 8)；Otherwise, it enablesIt is transferred to step 2-3)；

In this example, general iteration then confirms W 50 times_NIt has been restrained that, then stop iteration.

Step 2-8) it enablesThe processed speech manual of L frames before update：What addition was newly handled One frame removes an earliest frame, storage noise matrix W_NWith weight vectors h_tFor handling next frame.

The noise matrix that step 3) is obtained by estimationAnd weight vectorsReconstruct the letter of the current frame speech after noise reduction Number；It specifically includes：

In addition, the present invention also provides a kind of unsupervised single microphone voice de-noising system, the system comprises：

Noise dictionary generation module, the signals with noise for being reached to current time carry out Short Time Fourier Transform and are worked as Previous frame amplitude spectrum x_t, then with the preceding processed amplitude spectrum of L frames it is combined as output speech manual X=[x_t-L..., x_t-1, x_t], The noise matrix W that former frame is estimated_NWith voice dictionary W_SIt is combined into total dictionary matrix W=[W_S W_N], using iteration Algorithm carry out Non-negative Matrix Factorization, obtain noise matrixVoice noise weight vectors corresponding with present frameWith

It should be noted last that above example is only used to illustrate the technical scheme of the present invention and unrestricted.Although ginseng It is described the invention in detail according to embodiment, it will be understood by those of ordinary skill in the art that, to the technical side of the present invention Case is modified or replaced equivalently, and without departure from the spirit and scope of technical solution of the present invention, should all be covered in the present invention Right in.

Claims

1. a kind of unsupervised single microphone voice de-noising method, the method includes：

Carry out frequency spectrum extraction of the step 1) to the voice training data of all phonemes of covering of acquisition, then carries out k to amplitude spectrum Mean cluster obtains the corresponding voice dictionary of each classification；Then by all different classes of voice dictionaries be combined into one it is complete Standby voice dictionary W_S；

Step 2) carries out Short Time Fourier Transform to the signals with noise that current time reaches and obtains present frame amplitude spectrum x_t, then with preceding L The processed amplitude spectrum of frame is combined as output speech manual X=[x_t-L..., x_t-1, x_t], former frame is estimated Noise matrix W_NWith voice dictionary W_SIt is combined into total dictionary matrix W=[W_S W_N], using the algorithm of iteration to exporting speech manual X Non-negative Matrix Factorization is carried out, noise matrix is obtainedVoice noise weight vectors corresponding with present frame

The noise matrix that step 3) is obtained by estimationWith voice noise weight vectorsReconstruct the present frame language after noise reduction Sound signal.

2. unsupervised single microphone voice de-noising method according to claim 1, which is characterized in that the step 1) is specific Including：

Step 101) acquires a large amount of clean speech as voice training data；The voice training data of acquisition will cover all Phoneme；

Step 102) pre-processes above-mentioned collected voice training data, then extracts the frequency spectrum of voice training data；

Step 103) carries out k mean clusters to the amplitude spectrum of above-mentioned voice training data, will compose the similar speech frame of structure and is polymerized to One kind obtains different classes of corresponding voice amplitudes spectrumG=1 ..., G, G are total clusters number；

Step 104) composes the voice amplitudes that above-mentioned cluster obtainsG=1 ..., G carries out Non-negative Matrix Factorization respectively, obtains The corresponding voice dictionary of each classification；Then all different classes of voice dictionaries are combined into a complete voice dictionary.

3. unsupervised single microphone voice de-noising method according to claim 2, which is characterized in that the step 104) Specific implementation process is：

Wherein,To belong to the speech manual of g classes,To decompose the obtained corresponding dictionary matrix of g classes, for describing Different spectrum structural informations；

4. unsupervised single microphone voice de-noising method according to claim 3, which is characterized in that the step 2) is specific Including：

Step 2-1) x is composed for present frame noisy speech_tIf x_tInput before being L-th frame is then not processed x directly_t As output, otherwise combine the processed amplitude spectrum of preceding L frames as output speech manual X=[x_t-L..., x_t-1, x_t]；

Step 2-2) the noise matrix W that estimates former frame_NWith voice dictionary W_SIt is combined into total dictionary matrix W=[W_S W_N], noise matrix W_NInitial value be random nonnegative value, use nonnegative value random initializtion present frame weight matrix h_t, and combine The preceding estimated obtained weight vectors of L frames are as weight matrix H=[h_t-L..., h_t-1, h_t]；Wherein h_t=[h_{S, t} ^T h_{N, t} ^T ]^T,The voice weight vectors of G small dictionaries, h are corresponded to for present frame_{N, t}For noise vector；

Step 2-3)) determine X, W and H after, calculate X and WH similarity value V：

V=X./(WH)

Wherein ,/expression is divided by point by point；

Wherein .* indicates point-by-point multiplication；

Wherein, λ and ε is coefficient；

Step 2-6) update present frame noise matrix W_N, and it is normalized：

Step 2-7) judgment step 2-6) obtained byWhether restrain；If a determination be made that certainly, it is transferred to step 3)；It is no Then, it enablesIt is transferred to step 2-3).

5. unsupervised single microphone voice de-noising method according to claim 4, which is characterized in that the step 3) is specific Including：

Step 3-1) noise matrix is being obtained by step 2) estimationWith present frame weight vectorsLater, after reconstruct separation Current frame speech spectrumAnd noise spectrum

Step 3-3) combine the speech manual after noise reductionIt is recovered by inverse Fourier transform with the phase of present frame signals with noise Time domain waveform after noise reduction.

6. a kind of unsupervised single microphone voice de-noising system, the system comprises：

Voice dictionary generation module, the carry out frequency spectrum extraction of the voice training data for all phonemes of covering to acquisition, so K mean clusters are carried out to amplitude spectrum afterwards, obtain the corresponding voice dictionary of each classification；Then by all different classes of voice words Allusion quotation is combined into a complete voice dictionary W_S；

Noise dictionary generation module, the signals with noise for being reached to current time carry out Short Time Fourier Transform and obtain present frame Amplitude spectrum x_t, then with the preceding processed amplitude spectrum of L frames it is combined as output speech manual X=[x_t-L..., x_t-1, x_t], will before The noise matrix W that one frame is estimated_NWith voice dictionary W_SIt is combined into total dictionary matrix W=[W_S W_N], using the calculation of iteration Method carries out Non-negative Matrix Factorization to output speech manual X, obtains noise matrixVoice noise weight vectors corresponding with present frameWith

Noise reduction module, for passing through noise matrixVoice noise weight vectors corresponding with present frameAfter reconstructing noise reduction Current frame speech signal.

7. unsupervised single microphone voice de-noising system according to claim 6, which is characterized in that the voice dictionary life It is at the specific implementation step of module：

Step 104) composes the voice amplitudes that above-mentioned cluster obtainsG=1 ..., G carries out Non-negative Matrix Factorization respectively, obtains The corresponding voice dictionary of each classification；Then all different classes of voice dictionaries are combined into a complete voice dictionary：

8. unsupervised single microphone voice de-noising system according to claim 7, which is characterized in that the noise dictionary life It is at the specific implementation process of module：

Step 2-2) the noise matrix W that estimates former frame_NWith voice dictionary W_SIt is combined into total dictionary matrix W=[W_S W_N], noise matrix W_NInitial value be random nonnegative value, use nonnegative value random initializtion present frame weight matrix h_t, and combine The preceding estimated obtained weight vectors of L frames are as weight matrix H=[h_t-L..., h_t-1, h_t]；Wherein ht=[h_{S, t} ^T h_{N, t} ^T ]^T,The voice weight vectors of G small dictionaries, h are corresponded to for present frame_{N, t}For noise vector；

Step 2-3)) determine X, W and H after, calculate X and WH similarity value V：

V=X./(WH)

Wherein ,/expression is divided by point by point；

Wherein .* indicates point-by-point multiplication；

Wherein, λ and ε is coefficient；

Step 2-6) update present frame noise matrix W_N, and it is normalized：

(normalization W_N)

9. unsupervised single microphone voice de-noising system according to claim 8, which is characterized in that the noise reduction module Specific implementation process is：