CN105957537A

CN105957537A - Voice denoising method and system based on L1/2 sparse constraint convolution non-negative matrix decomposition

Info

Publication number: CN105957537A
Application number: CN201610452012.7A
Authority: CN
Inventors: 周健; 路成
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2016-06-20
Filing date: 2016-06-20
Publication date: 2016-09-21
Anticipated expiration: 2036-06-20
Also published as: CN105957537B

Abstract

The invention discloses a voice denoising method and system based on L1/2 sparse constraint convolution non-negative matrix decomposition. In single-channel voice enhancement, it is assumed that noised voice signals v(i) are additively relevant to noise signals n(i) and voice signals s(i), i.e., v(i)=n(i)+s(i), and noise-base information is obtained by training specific noise by use of a CNMF method; and then by taking a noise base as prior information, a voice base is obtained by decomposing noised voice by use of a CNMF_L1/2 method, and finally, voice after denoising is synthesized. According to the method, correlation of voice between frames can be better described; and strong-sparse constraining is performed on a voice-base coefficient matrix by use of L1/2 regular item, and the voice after separation comprises less residual noise. Compared to conventional methods such as a spectral subtraction method, a wiener filtering method and a minimum mean square deviation logarithm domain spectrum estimation method and the like, the voice after enhancement can be understood more easily.

Description

A kind of based on L1/2The speech de-noising method of sparse constraint convolution Non-negative Matrix Factorization and System

Technical field

The invention belongs to acoustics signal processing field, be specifically related to a kind of based on L_1/2Sparse constraint convolution nonnegative matrix is divided The speech de-noising method and system solved.

Background technology

Voice is the important carrier of daily exchange, but is often subject to various sound pollution in actual environment, makes people listen not Clear content.Speech enhan-cement is by suppression or eliminates these noise jamming, extracts dry from contaminated voice signal as far as possible Clean voice, thus obtain the voice that can understand.Speech enhancement technique is frequently used in speech recognition, voice coding and intelligent communication In field.

Sound enhancement method based on Non-negative Matrix Factorization (Non-negative matrix factorization, NMF) It is a kind of expression based on part, base vector and the coefficient square of characteristics of speech sounds can be obtained representing by decomposing voice signal Battle array.At present, NMF i.e. Non-negative Matrix Factorization method is the emphasis that Many researchers is paid close attention to.The ultimate principle of NMF is Fundamental matrix and corresponding coefficient matrix, calculate the fundamental matrix corresponding to each information source composition and coefficient according to cost function Matrix, thus realize the separation of signal.According to the priori of known audio signal, NMF can be divided into blind signal model, prison Superintend and direct model and half-blindness model, do not know apriori signals composition fundamental matrix the most completely it is known that all aliasing signal compositions basic Matrix, with the fundamental matrix only knowing part aliasing signal composition.And choosing of cost function mainly includes signal before and after separation Similarity and some restrictive condition two classes added according to the characteristic of handled signal.

Chinese patent CN201220541700.2 discloses a kind of audio frequency separation method based on NMF Non-negative Matrix Factorization, Including supplementary music speech differentiation module and NMF Non-negative Matrix Factorization module, the method is ground by introducing this new mathematics of NMF Study carefully achievement, in conjunction with the audio frequency characteristics difference of speech audio Yu music VF, can be preferably by voice sound in the middle of the audio frequency of mixing Frequency separates with music VF, thus obtains the most clearly music VF and speech audio, in conjunction with NMF method and engineering Learning method, classifies to audio frequency.

Convolution non-negative matrix factorization method (Convolutive non-negative matrix factorization, CNMF), use the Non-negative Matrix Factorization sum of a succession of displacement to the Time Continuous information representing in signal, this decomposition method The time-varying characteristics of voice signal can be described well.

Summary of the invention

It is an object of the invention to provide a kind of CNMF that voice signal is carried out when decomposing, it is added L_qOpenness restriction, To obtain better quality, intelligibility higher speech de-noising method and system.

To achieve these goals, the present invention is by the following technical solutions: a kind of based on L_1/2Sparse constraint convolution non-negative square Battle array decomposes the method for noisy speech denoising, it is characterised in that: assume that noisy speech signal v (i) is noise signal n (i) and voice Signal s (i) additivity is uncorrelated, i.e. v (i)=n (i)+s (i), and the method for noisy speech denoising comprises the following steps:

Step 1: utilize CNMF method that specific noise is trained obtaining noise basis information:

Step 2: using noise basis as prior information, uses CNMF_L_1/2Method carries out decomposition to noisy language and obtains voice Base, is finally synthesizing the voice after denoising.

Described step 1 specifically includes following steps:

Step 1.1: noise is carried out Short Time Fourier Transform conversion and obtains its amplitude spectrum N；

Step 1.2: noise amplitude spectrum is carried out CNMF decomposition and obtains noise basis WⁿAnd the coefficient matrix H of correspondenceⁿ, decompose Object function as follows:

D (V | Λ) = \frac{1}{2} | | V - Λ | |^{2} - - - (1)

Wherein, V is the noise amplitude spectrum matrix that band decomposes, and Λ is the convolution estimated value to V:

Λ = Σ_{t = 0}^{T_{0} - 1} W^{n} (t) \overset{t &RightArrow;}{H^{n}} - - - (2)

In formula (2), W (t) and H represents t basic matrix and coefficient matrix respectively,Represent and matrix moved to right t step by row, Row benefit 0 is vacated on the left side.

In described step 1.2, target function type (1) is owing to respectively for being convex function for W and H, can alternately update W And H, use gradient descent method to obtain its renewal equation:

W_{i k}^{n} (t) &LeftArrow; W_{i k}^{n} (t) \cdot \frac{Σ_{k = 1}^{T} (V_{i j} + {\overset{&OverBar;}{W}}_{i k}^{n} (t) \cdot {\overset{&OverBar;}{W}}_{k i}^{n} (t) \cdot Λ_{i j}) \cdot \overset{t &RightArrow;}{H_{j k}^{n}}}{Σ_{k = 1}^{T} (Λ_{i j} + {\overset{&OverBar;}{W}}_{i k}^{n} (t) \cdot {\overset{&OverBar;}{W}}_{k i}^{n} (t) \cdot V_{i j}) \cdot \overset{t &RightArrow;}{H_{j k}^{n}}} - - - (3)

H_{k j}^{n} &LeftArrow; H_{k j}^{n} \cdot \frac{Σ_{i = 1}^{M} ({\overset{&OverBar;}{W}}_{k i}^{n} (t) \cdot V_{i j})}{Σ_{i = 1}^{M} ({\overset{&OverBar;}{W}}_{k i}^{n} (t) \cdot {\overset{&LeftArrow; t}{Λ}}_{i j})} - - - (4)

Described step 2 specifically includes following steps:

Step 2.1: noisy speech is carried out STFT conversion, time-frequency domain obtain following nonnegative matrix and:

V=S+N (5)

Wherein, V, S, N are the amplitude spectrum matrix of noisy speech, clean speech and noise respectively, obtain speech manual simultaneously Phase information；

Formula (5) the right is carried out convolution Non-negative Matrix Factorization, obtains:

V = Σ_{t = 0}^{T - 1} [\begin{matrix} W_{t}^{n} & W_{t}^{s} \end{matrix}] [\begin{matrix} \overset{t &RightArrow;}{H^{n}} \\ \overset{t &RightArrow;}{H^{s}} \end{matrix}] = Σ_{t = 0}^{T - 1} Σ_{k = 1}^{R} W_{i k}^{n} (t) \overset{t &RightArrow;}{H_{k j}^{n}} + Σ_{t = 0}^{T - 1} Σ_{k = 1}^{R} W_{i k}^{s} (t) \overset{t &RightArrow;}{H_{k j}^{s}} - - - (6)

Wherein, W^sAnd H^sRepresent voice base and the coefficient matrix of correspondence, WⁿAnd HⁿRepresent noise basis and the coefficient of correspondence thereof Matrix；Step 2.2: noise basis W that integrating step 1.2 obtainsⁿThe amplitude spectrum matrix of noisy speech is carried out CNMF_L_1/2Decompose Obtain voice base W^s, voice base system number H^sWith new noise basis coefficient matrixCNMF_L_1/2Object function during decomposition is as follows:

D (V | W^{n}, W^{s}, H^{n}, H^{s}) = \frac{1}{2} | | V - Σ_{t = 0}^{T - 1} W^{n} (t) \overset{t &RightArrow;}{H^{n}} - Σ_{t = 0}^{T - 1} W^{s} (t) \overset{t &RightArrow;}{H^{s}} | |^{2} + λ | | H^{s} | |_{1 / 2}^{1 / 2} - - - (7)

Step 2.3: utilize the voice base W that step 2.2 obtains^s, voice base system number H^sAfter obtaining denoising with phase information synthesis The amplitude spectrum S of voice, synthetic method is as follows:

S = Σ_{t = 0}^{T - 1} W^{s} (t) \overset{t &RightArrow;}{H^{s}} - - - (8)

Step 2.4: the amplitude spectrum S of voice after denoising is carried out inverse STFT conversion and obtains enhanced voice spectrum.

In described step 2.2, target function type (7) solves by being carried out by alternately update mode, it may be assumed that

1st step, fixing Wⁿ、And H^s, update W^s；

2nd step, fixing W^s、WⁿWithUpdate H^s；

3rd step, fixing W^s、H^sAnd Wⁿ, update

Owing to formula (7) is all a convex function for above-mentioned each step, gradient descent method can be used to obtain more new regulation:

W_{i k}^{s} (t) &LeftArrow; W_{i k}^{s} (t) \cdot \frac{Σ_{k = 1}^{T} (V_{i j} + {\overset{&OverBar;}{W}}_{i k}^{s} (t) \cdot {\overset{&OverBar;}{W}}_{k i}^{s} (t) \cdot Λ_{i j}) \cdot \overset{t &RightArrow;}{H_{j k}^{s}}}{Σ_{k = 1}^{T} (Λ_{i j} + {\overset{&OverBar;}{W}}_{i k}^{s} (t) \cdot {\overset{&OverBar;}{W}}_{k i}^{s} (t) \cdot V_{i j}) \cdot \overset{t &RightArrow;}{H_{j k}^{s}}} - - - (9)

H_{k j}^{s} &LeftArrow; H_{k j}^{s} \cdot \frac{Σ_{i = 1}^{M} ({\overset{&OverBar;}{W}}_{k i}^{s} (t) \cdot V_{i j})}{Σ_{i = 1}^{M} ({\overset{&OverBar;}{W}}_{k i}^{s} (t) {\overset{&LeftArrow; t}{Λ}}_{i j} + \frac{λ}{2} {(H_{k j}^{s})}^{(- 1 / 2)})} - - - (10)

{\overset{&OverBar;}{H}}_{k j}^{n} &LeftArrow; {\overset{&OverBar;}{H}}_{k j}^{n} \cdot \frac{Σ_{i = 1}^{M} ({\overset{&OverBar;}{W}}_{i k}^{n} (t)) \cdot V_{i j}}{Σ_{i = 1}^{M} ({\overset{&OverBar;}{W}}_{i k}^{n} (t)) \cdot {\overset{&LeftArrow; t}{Λ}}_{i j}} - - - (11)

A kind of based on L_1/2Sparse constraint convolution Non-negative Matrix Factorization noisy speech denoising system, including:

STFT conversion module, obtains its amplitude for specific noise and noisy speech carry out Short Time Fourier Transform conversion Spectrum；

Noise training module, is used for utilizing CNMF method to be trained specific noise obtaining noise basis information；

Speech decomposition module, for using training the noise basis obtained as prior information, using CNMF_L_1/2Method is to containing Language of making an uproar carries out decomposition and obtains voice base；

Voice synthetic module, the amplitude spectrum of voice after the voice base obtained and phase information synthesis obtain denoising；

Frequency spectrum modular converter, obtains enhanced voice frequency for the amplitude spectrum of voice after denoising carries out inverse STFT conversion Spectrum.

Non-negative Matrix Factorization (Non-negative matrix factorization, NMF) is that a kind of special base divides Solve, it is desirable to matrix all elements is all non-negative.I.e. for a given nonnegative matrix V ∈ R^≥0,n×m, it can be entered by NMF Row decomposes two nonnegative matrix W ∈ R of generation^≥0,n×rWith H ∈ R^≥0,r×mSo that V ≈ WH.It is generally recognised that V each in NMF Row are separate, not in view of the time-varying characteristics of some signal (such as voice signal), i.e. past between adjacent column (frame) Toward there being certain dependency.In order to describe this inter-frame relation, use the time that the NMF sum of a succession of displacement represents in signal Continuous information, thus produces convolution non-negative matrix factorization method (Convolutive non-negative matrix factorization,CNMF).CNMF is a kind of extension form of NMF,

The base that NMF and CNMF produces when decomposing voice signal amplitude spectrum matrix V tends to sparse, if on W or H Increase an openness restriction, be possible not only to obtain more sparse base, and sparse degree and reconstructed error can be weighed.Due to In NMF, every a line of coefficient matrix H and every string of basic matrix W are corresponding, if to H plus sparse restriction, then and can be corresponding Produce the most sparse base W, theoretical according to sparse signal representation, it is possible on dictionary, to represent original letter with less base Number.

The object function that H adds on CNMF openness restriction can be expressed as:

L (V | | Λ) = \frac{1}{2} | | V - Λ | |^{2} + λ | | H | |_{q}^{q} - - - (12)

Wherein, λ ∈ R^≥0It is a regularization parameter, is used for balancing sparse degree and reconstruction error；Q=0,1/2,1,2 point Do not represent L₀、L_1/2、L₁、L₂Regularization.

To the restriction of H in (12) formula, it is desirable in the H obtained, element is 0 as far as possible, referred to as L₀Regularization.But L₀Canonical Change is the np hard problem of a non-convex, can not find locally optimal solution.In order to solve this problem, it is proposed that a kind of use L₁Regularization Carry out approximate solution L₀The method of regularization.CNMF adds L₁Sparse constraint (hereinafter referred to as " L₁_ CNMF ") it is expressed as follows:

L (V | | Λ) = \frac{1}{2} | | V - Λ | |^{2} + λ | | H | |_{1}^{1} - - - (13)

Although L₁And L₂Regularization can change into convex optimization problem, but their solution might not be the most sparse.Special Be not when 0 < q < when 1, L_qRegularization can produce compares L₁More sparse solution；When 1/2≤q < when 1, the least L of q_qSolution the most sparse；And When 1/2 < q < when 1, L_qThe sparse degree solved does not has the biggest difference.

In speech de-noising processes, if it is possible to use and represent original voice signal than sparse base, then going The stage of making an uproar can carry less noise, and the voice after so can making denoising becomes apparent from understanding.The coefficient matrix H of voice base is more Sparse, more can utilize less voice basic weight structure primitive tone signal.The present invention uses more sparse L to H_1/2Regularization constraint, Target function type (12) can be rewritten as:

L (V | | Λ) = \frac{1}{2} | | V - Λ | |^{2} + λ | | H | |_{1 / 2}^{1 / 2} - - - (14)

By when voice signal carries out CNMF decomposition, it being added L_qOpenness restriction, obtains more sparse voice Basis representation primitive sound.When synthesizing enhanced voice with the voice base obtained, less noise basis can be carried, obtain quality more Good, the higher voice of intelligibility.

The inventive method can preferably portray the dependency of voice between frame；And use L_1/2Regular terms is to voice base system Matrix number carries out strong sparse constraint, can realize the voice packet after separating containing less residual noise.Compare and traditional method such as spectrum Subtraction, Wiener Filter Method and Minimum Mean Square Error log-domain Power estimation method etc., more can improve the intelligibility of voice after enhancing.

Accompanying drawing explanation

Fig. 1 is the present invention flow chart to noisy speech signal denoising.

Fig. 2 is the flow chart of the noise training process of step 1.

Fig. 3 is that step 2 carries out the flow chart of voice after local behavior enhancing to noisy speech.

Fig. 4 is L_1/2The CNMF denoising method of sparse constraint convergence curve on speech enhan-cement.

Fig. 5 is the coefficient matrix H and voice base W implementing enhancing stage voice signal, and (a) and (b) is CNMF_L respectively₁ And CNMF_L_1/2Decompose H and W during voice signal.

Fig. 6 is six kinds of distinct methods STOI values under different noise circumstances.

Fig. 7 is six kinds of methods SegSNR improvement values under different noise circumstances.

Detailed description of the invention

Below in conjunction with specific embodiments and the drawings, the present invention is described further.

The present invention is a kind of based on L_1/2Sparse constraint convolution Non-negative Matrix Factorization (hereinafter referred to as " CNMF_L_1/2") voice Denoising method, Fig. 1 is the overview flow chart of denoising of the present invention.After overall input is certain particular type noise and mixed noise Voice, wherein noise can be different types of (such as stationary noise, nonstationary noise etc.)；It is output as the voice after denoising.

Fig. 2 is the flow chart of the noise training process of step 1.

Step 1.1, carries out Short Time Fourier Transform ((Short-Time Fourier Transform, STFT)) to noise Conversion obtains its amplitude spectrum N.

Step 1.2, carries out CNMF decomposition to noise amplitude spectrum and obtains noise basis WⁿAnd the coefficient matrix H of correspondenceⁿ, decompose Object function as follows:

D (V | Λ) = \frac{1}{2} | | V - Λ | |^{2} - - - (1)

Λ = Σ_{t = 0}^{T_{0} - 1} W^{n} (t) \overset{t &RightArrow;}{H^{n}} - - - (2)

Owing to formula (1) is respectively for being convex function for W and H, can alternately update W and H, use gradient descent method to obtain Its renewal equation:

W_{i k}^{n} (t) &LeftArrow; W_{i k}^{n} (t) \cdot \frac{Σ_{k = 1}^{T} (V_{i j} + {\overset{&OverBar;}{W}}_{i k}^{n} (t) \cdot {\overset{&OverBar;}{W}}_{k i}^{n} (t) \cdot Λ_{i j}) \cdot \overset{t &RightArrow;}{H_{j k}^{n}}}{Σ_{k = 1}^{T} (Λ_{i j} + {\overset{&OverBar;}{W}}_{i k}^{n} (t) \cdot {\overset{&OverBar;}{W}}_{k i}^{n} (t) \cdot V_{i j}) \cdot \overset{t &RightArrow;}{H_{j k}^{n}}} - - - (3)

H_{k j}^{n} &LeftArrow; H_{k j}^{n} \cdot \frac{Σ_{i = 1}^{M} ({\overset{&OverBar;}{W}}_{k i}^{n} (t) \cdot V_{i j})}{Σ_{i = 1}^{M} ({\overset{&OverBar;}{W}}_{k i}^{n} (t) \cdot {\overset{&LeftArrow; t}{Λ}}_{i j})} - - - (4)

Fig. 3 is the flow chart that step 2 carries out the voice after local behavior denoising to noisy language, and single-channel voice strengthens logical Often assume that noisy speech signal v (i) is that noise signal n (i) is uncorrelated with voice signal s (i) additivity, i.e. v (i)=n (i)+s (i)。

Step 2.1, carries out STFT conversion to noisy speech, can time-frequency domain obtain following nonnegative matrix and:

V=S+N (5)

Wherein, V, S, N are the amplitude spectrum matrix of noisy speech, clean speech and noise respectively.Obtain speech manual simultaneously Phase information.

Formula (5) the right is carried out convolution Non-negative Matrix Factorization, obtains

V = Σ_{t = 0}^{T - 1} [\begin{matrix} W_{t}^{n} & W_{t}^{s} \end{matrix}] [\begin{matrix} \overset{t &RightArrow;}{H^{n}} \\ \overset{t &RightArrow;}{H^{s}} \end{matrix}] = Σ_{t = 0}^{T - 1} Σ_{k = 1}^{R} W_{i k}^{n} (t) \overset{t &RightArrow;}{H_{k j}^{n}} + Σ_{t = 0}^{T - 1} Σ_{k = 1}^{R} W_{i k}^{s} (t) \overset{t &RightArrow;}{H_{k j}^{s}} - - - (6)

Wherein, W^sAnd H^sRepresent voice base and the coefficient of correspondence, WⁿAnd HⁿRepresent noise basis and the coefficient of correspondence thereof.

Step 2.2, noise basis W that combined training module obtains^sThe amplitude spectrum matrix of noisy speech is carried out CNMF_L_1/2 Decompose and obtain voice base W^s, voice base system number H^sWith new noise basis coefficient matrixCNMF_L_1/2Object function during decomposition As follows:

D (V | W^{n}, W^{s}, H^{n}, H^{s}) = \frac{1}{2} | | V - Σ_{t = 0}^{T - 1} W^{n} (t) \overset{t &RightArrow;}{H^{n}} - Σ_{t = 0}^{T - 1} W^{s} (t) \overset{t &RightArrow;}{H^{s}} | |^{2} + λ | | H^{s} | |_{1 / 2}^{1 / 2} - - - (7)

Formula (7) needs to solve W^s、H^sWith new noise basis coefficient matrixCan be carried out by alternately update mode for this, I.e.

1st step, fixing Wⁿ、And H^s, update W^s；

2nd step, fixing W^s、WⁿWithUpdate H^s；

3rd step, fixing W^s、H^sAnd Wⁿ, update

W_{i k}^{s} (t) &LeftArrow; W_{i k}^{s} (t) \cdot \frac{Σ_{k = 1}^{T} (V_{i j} + {\overset{&OverBar;}{W}}_{i k}^{s} (t) \cdot {\overset{&OverBar;}{W}}_{k i}^{s} (t) \cdot Λ_{i j}) \cdot \overset{t &RightArrow;}{H_{j k}^{s}}}{Σ_{k = 1}^{T} (Λ_{i j} + {\overset{&OverBar;}{W}}_{i k}^{s} (t) \cdot {\overset{&OverBar;}{W}}_{k i}^{s} (t) \cdot V_{i j}) \cdot \overset{t &RightArrow;}{H_{j k}^{s}}} - - - (8)

H_{k j}^{s} &LeftArrow; H_{k}^{s} \cdot \frac{Σ_{i = 1}^{M} ({\overset{&OverBar;}{W}}_{k i}^{s} (t) \cdot V_{i j})}{Σ_{i = 1}^{M} ({\overset{&OverBar;}{W}}_{k i}^{s} (t) {\overset{&LeftArrow; t}{Λ}}_{i j} + \frac{λ}{2} {(H_{k j}^{s})}^{(- 1 / 2)})} - - - (9)

{\overset{&OverBar;}{H}}_{k j}^{n} &LeftArrow; {\overset{&OverBar;}{H}}_{k j}^{n} \cdot \frac{Σ_{i = 1}^{M} ({\overset{&OverBar;}{W}}_{i k}^{n} (t)) \cdot V_{i j}}{Σ_{i = 1}^{M} ({\overset{&OverBar;}{W}}_{i k}^{n} (t)) \cdot {\overset{&LeftArrow; t}{Λ}}_{i j}} - - - (10)

Step 2.3, utilizes the voice base W that step 2 obtains^s, voice base system number H^sSynthesize with phase information and obtain language after denoising The amplitude spectrum S of sound, synthetic method is as follows:

S = Σ_{t = 0}^{T - 1} W^{s} (t) \overset{t &RightArrow;}{H^{s}} - - - (11)

Step 4, carries out inverse STFT conversion and obtains enhanced voice spectrum the amplitude spectrum S of voice after denoising.

A kind of based on L_1/2Sparse constraint convolution Non-negative Matrix Factorization noisy speech denoising system, it is characterised in that including: STFT conversion module, obtains its amplitude spectrum for specific noise and noisy speech carry out Short Time Fourier Transform conversion；Noise Training module, is used for utilizing CNMF method to be trained specific noise obtaining noise basis information；Speech decomposition module, being used for will The noise basis that training obtains, as prior information, uses CNMF_L_1/2Method carries out decomposition to noisy language and obtains voice base；Language Sound synthesis module, the amplitude spectrum of voice after the voice base obtained and phase information synthesis obtain denoising；Frequency spectrum modular converter, Enhanced voice spectrum is obtained for the amplitude spectrum of voice after denoising being carried out inverse STFT conversion.

Beneficial effect by emulation experiment sunykatuib analysis noisy speech of the present invention denoising method:

Experimental Hardware is configured to Core i5 3.2GHz, internal memory 4G, and simulation software is Matlab2013a.In order to verify this The effectiveness of method that invention is proposed, have chosen the language material in TIMIT sound bank as clean speech, respectively male voice 25 Sentence, female voice 25, every about 3s；The noise choosing NOISEX-92 noise storehouse again includes putting down as experimental data, the noise chosen Steady noise and nonstationary noise, be tetra-kinds of noises of Babble, F16, White and M109 respectively.Clean speech and the standard of noise It is sample rate 8kHz, precision 16bit.In experimentation, to clean speech, according to five kinds of different signal to noise ratios, (SNR is respectively with noise For-5dB, 0dB, 5dB, 10dB) mix.When noise spectrum and noisy speech are composed and calculated, all signal is carried out sub-frame processing, Frame length is 512 sampling points, and interframe overlap 50% carries out the discrete Fourier transform of 512 to every frame.

In order to compare the advantage of the present invention, use the CNMF method without sparse constraint, L₁The CNMF method of sparse constraint with And the L that the present invention proposes_1/2The CNMF method of sparse constraint (is referred to as CNMF, CNMF_L individually below₁、CNMF_L_1/2) three kinds Distinct methods compares.Simultaneously in order to verify the practicality of the put forward sound enhancement method of the present invention, also increase with three kinds of conventional voices Strong method includes spectrum-subtraction, (divides below based on prior weight Wiener Filter Method, Minimum Mean Square Error log-domain spectral amplitude estimation method Jian Chengwei PS, Wiener, logMMSE) contrast, compare the diversity in speech enhan-cement effect between them.

Parameter for method performance impact mainly has three: number R of time-frequency base, and the number of times Iter of method iteration is dilute Sparse coefficient λ.Time-frequency base represents the feature of voice signal, and R chooses time-frequency in the number represented about time-frequency base, and experiment The number value of base is empirically corresponding with the phoneme in voice.Every the clean speech used due to this experiment is all containing about 12 The voice of individual phoneme, so R chooses 12.On the iterations of method, Fig. 4 is L of the present invention_1/2The CNMF method of sparse constraint The convergence curve being used on speech enhan-cement, it can be seen that the value of object function is the most steady after 200 times, from experimental viewpoint Confirm that method restrains, therefore in subsequent experimental, Iter is set to 200.Sparse coefficient λ is the weight of balance cost function and sparse degree Want parameter, by adjusting λ value, the speech enhan-cement stage can be rolled between residual noise and reduction voice distortion removing In.Voice base owes sparse or the most sparse all can make enhanced sound effect be deteriorated, and voice can under to different λ for the present invention The experiment of degree of understanding obtains empirical value 0.01.

Experimental result

The present invention uses two kinds of evaluation criterions to carry out objective evaluation present invention performance on speech enhan-cement.The first is at language Evaluation in sound intelligibility, for STOI (the Short-Time Objective of the intelligibility of speech after denoising Intelligibility) score.The degree that in voice, information loaded is understood mainly is investigated in intelligibility of speech evaluation.STOI is A kind of intelligibility measurement index, for weighing the intelligibility performance of sound enhancement method, and the value of STOI is in (0,1) scope In, it is worth the biggest, after showing to strengthen, the intelligibility of voice is the highest.

The second is the evaluation in voice quality, for the segmental signal-to-noise ratio SegSNR (Segmental of voice after denoising SNR).Voice quality assessment mainly investigates the audition comfort level of voice, naturalness and joyful degree.SegSNR is speech enhan-cement side A kind of objective effective evaluation that method is conventional, is mainly used in weighing after strengthening voice relative to the wave distortion journey of clean speech Degree.

The present invention uses L_1/2Regularization constraint, with L₁Constraint is compared, when decomposing noisy speech, due at object function In, coefficient matrix H is made L_1/2Openness restriction, can produce more sparse H, thus obtain more when decomposing voice signal Add sparse voice base.In order to verify CNMF_L_1/2Effect on openness, will strengthen the coefficient matrix H of stage voice signal And voice base W shows in Figure 5, Fig. 5 (a) and (b) are CNMF_L respectively₁And CNMF_L_1/2Decompose voice signal time H and W, the top half of figure is coefficient matrix H, and the latter half is 12 voice bases.From fig. 5, it can be seen that and CNMF_L₁And CNMF_ L_1/2H more sparse.

Fig. 6 is six kinds of distinct methods STOI values under different noise circumstances, and wherein UN represents original noisy speech.From figure In 6 it will be seen that with two kinds of similar methods CNMF and CNMF_L₁Compare, the CNMF_L of the present invention_1/2Method has on STOI Have superiority, illustrate that the noise basis carried when using the most sparse voice basis representation clean speech during denoising is the fewest, because of After this enhancing, the intelligibility of voice is the highest.Meanwhile, compared with classical sound enhancement method PS, Wiener, logMMSE, this The STOI value of inventive method promotes substantially, particularly has higher intelligibility when below 0dB.This is because the inventive method The dictionary learning first passing through supervision obtains noise dictionary, when strengthening stage decomposition noisy speech, and the aspect ratio to noise More sensitive and insensitive to the energy of noise, therefore more noise basis can be isolated, the enhancing intelligibility of speech obtained is the best.

Fig. 7 is that six kinds of methods SegSNR improvement values under different noise circumstances (utilizes the SegSNR of voice after strengthening to subtract The SegSNR strengthening front voice is gone to be calculated), the value of improvement is the biggest, and voice quality is the best.It can be seen from figure 7 that increasing After strong in the quality of voice, CNMF_L_1/2Method is better than similar CNMF and CNMF_L₁Two kinds of methods.Compared with classical way, The slightly advantage when low signal-to-noise ratio, during high s/n ratio, effect is slightly not as classical way.When this is likely due to high s/n ratio, noisy Noise basis in voice is less, although combines the noise dictionary of training, but also results in certain voice base during burbling noise It is pulled away, the degradation of voice after causing strengthening.

Performance evaluation

CNMF_L_1/2Sound enhancement method needs to be trained noise in advance obtaining noise dictionary as prior information, instruction When practicing noise, an iteration is time-consumingly about 0.1s.L is used during strengthening_1/2Noisy speech is carried out point by the CNMF of instruction constraint Solving, solution procedure uses the property taken advantage of more new regulation, and the time-consuming of an iteration is 0.15s.

Additionally, due to CNMF_L_1/2Method is a kind of method based on dictionary learning, is first made an uproar in speech enhan-cement Sound dictionary, therefore more sensitive to the characteristic information of noise, when decomposing noisy speech, it is possible to more effectively burbling noise base. When low signal-to-noise ratio, more noise basis can be isolated；When high s/n ratio, then certain voice base can be caused to lose.

Conclusion

The present invention proposes a kind of based on L_1/2The single-channel voice Enhancement Method of sparse constraint CNMF.CNMF_L_1/2Method Make full use of convolution nonnegative matrix describe voice signal time response advantage on, use L_1/2Coefficient matrix H is entered by regular terms Row sparse constraint.By using the more preferable L of sparse performance_1/2Item is decomposed in constraint, can obtain more sparse voice base and represent Clean speech, and use herein and have the sound enhancement method of supervision, the noise basis obtained by the noise training stage, more can have Effect ground helps enhancing stage voice signal to separate with noise signal.Emulate under four class difference noises and five kinds of signal to noise ratios Experiment, demonstrates CNMF_L_1/2Method has good performance in speech enhan-cement.From experimental section it is found that this method exists Effect under low signal-to-noise ratio environment is pretty good, and following work will study CNMF_L under Low SNR further_1/2Fitting of method The property used.It addition, also by sparse coefficient λ based on experience is carried out deeper into research.

The above, be only presently preferred embodiments of the present invention, and the present invention not does any type of restriction.Every Any simple modification, equivalent variations and modification above example made according to technology and the method essence of the present invention, the most still Belong in the range of technology and the method scheme of the present invention.

Claims

1. one kind based on L_1/2The method of sparse constraint convolution Non-negative Matrix Factorization noisy speech denoising, it is characterised in that: assume to contain Noisy speech signal v (i) is that noise signal n (i) is uncorrelated with voice signal s (i) additivity, i.e. v (i)=n (i)+s (i), noisy language The method of sound denoising comprises the following steps:

Step 2: using noise basis as prior information, uses CNMF_L_1/2Method carries out decomposition to noisy language and obtains voice base, It is finally synthesizing the voice after denoising.

The method of noisy speech denoising the most according to claim 1, it is characterised in that: described step 1 specifically includes following step Rapid:

Step 1.2: noise amplitude spectrum is carried out CNMF decomposition and obtains noise basis WⁿAnd the coefficient matrix H of correspondenceⁿ, the mesh of decomposition Scalar functions is as follows:

D (V | Λ) = \frac{1}{2} | | V - Λ | |^{2} - - - (1)

Λ = Σ_{t = 0}^{T_{0} - 1} W^{n} (t) \overset{t &RightArrow;}{H^{n}} - - - (2)

In formula (2), W (t) and H represents t basic matrix and coefficient matrix respectively,Represent and matrix is moved to right t step, the left side by row Vacate row benefit 0.

The method of noisy speech denoising the most according to claim 2, it is characterised in that: target function type in described step 1.2 (1) owing to respectively for being convex function for W and H, can alternately update W and H, gradient descent method is used to obtain its renewal side Journey:

W_{i k}^{n} (t) &LeftArrow; W_{i k}^{n} (t) \cdot \frac{Σ_{k = 1}^{T} (V_{i j} + {\overset{&OverBar;}{W}}_{i k}^{n} (t) \cdot {\overset{&OverBar;}{W}}_{k i}^{n} (t) \cdot Λ_{i j}) \cdot \overset{t &RightArrow;}{H_{j k}^{n}}}{Σ_{k = 1}^{T} (Λ_{i j} + {\overset{&OverBar;}{W}}_{i k}^{n} (t) \cdot {\overset{&OverBar;}{W}}_{k i}^{n} (t) \cdot V_{i j}) \cdot \overset{t &RightArrow;}{H_{j k}^{n}}} - - - (3)

H_{k j}^{n} &LeftArrow; H_{k j}^{n} \cdot \frac{Σ_{i = 1}^{M} ({\overset{&OverBar;}{W}}_{k i}^{n} (t) \cdot V_{i j})}{Σ_{i = 1}^{M} ({\overset{&OverBar;}{W}}_{k i}^{n} (t) \cdot \overset{&LeftArrow; t}{Λ_{i j}})} - - - (4)

The method of noisy speech denoising the most according to claim 1, it is characterised in that: described step 2 specifically includes following step Rapid:

V=S+N (5)

Wherein, V, S, N are the amplitude spectrum matrix of noisy speech, clean speech and noise respectively, obtain the phase place of speech manual simultaneously Information；

V = Σ_{t = 0}^{T - 1} [\begin{matrix} W_{t}^{n} & W_{t}^{s} \end{matrix}] [\begin{matrix} \overset{t &RightArrow;}{H^{n}} \\ \overset{t &RightArrow;}{H^{s}} \end{matrix}] = Σ_{t = 0}^{T - 1} Σ_{k = 1}^{R} W_{i k}^{n} (t) \overset{t &RightArrow;}{H_{k j}^{n}} + Σ_{t = 0}^{T - 1} Σ_{k = 1}^{R} W_{i k}^{s} (t) \overset{t &RightArrow;}{H_{k j}^{s}} - - - (6)

Wherein, W^sAnd H^sRepresent voice base and the coefficient matrix of correspondence, WⁿAnd HⁿRepresent noise basis and the coefficient matrix of correspondence thereof；

Step 2.2: noise basis W that integrating step 1.2 obtainsⁿThe amplitude spectrum matrix of noisy speech is carried out CNMF_L_1/2Decompose To voice base W^s, voice base system number H^sWith new noise basis coefficient matrixCNMF_L_1/2Object function during decomposition is as follows:

D (V | W^{n}, W^{s}, H^{n}, H^{s}) = \frac{1}{2} | | V - Σ_{t = 0}^{T - 1} W^{n} (t) \overset{t &RightArrow;}{H^{n}} - Σ_{t = 0}^{T - 1} W^{s} (t) \overset{t &RightArrow;}{H^{s}} | |^{2} + λ | | H^{s} | |_{1 / 2}^{1 / 2} - - - (7)

Step 2.3: utilize the voice base W that step 2.2 obtains^s, voice base system number H^sSynthesize with phase information and obtain voice after denoising Amplitude spectrum S, synthetic method is as follows:

S = Σ_{t = 0}^{T - 1} W^{s} (t) \overset{t &RightArrow;}{H^{s}} - - - (8)

The method of the most described noisy speech denoising, it is characterised in that: target letter in described step 2.2 Numerical expression (7) solve by being carried out by alternately update mode, it may be assumed that

1st step, fixing Wⁿ、And H^s, update W^s；

2nd step, fixing W^s、WⁿWithUpdate H^s；

3rd step, fixing W^s、H^sAnd Wⁿ, update

W_{i k}^{s} (t) &LeftArrow; W_{i k}^{s} (t) \cdot \frac{Σ_{k = 1}^{T} (V_{i j} + {\overset{&OverBar;}{W}}_{i k}^{s} (t) \cdot {\overset{&OverBar;}{W}}_{k i}^{s} (t) \cdot Λ_{i j}) \cdot \overset{t &RightArrow;}{H_{j k}^{s}}}{Σ_{k = 1}^{T} (Λ_{i j} + {\overset{&OverBar;}{W}}_{i k}^{s} (t) \cdot {\overset{&OverBar;}{W}}_{k i}^{s} (t) \cdot V_{i j}) \cdot \overset{t &RightArrow;}{H_{j k}^{s}}} - - - (9)

H_{k j}^{s} &LeftArrow; H_{k j}^{s} \cdot \frac{Σ_{i = 1}^{M} ({\overset{&OverBar;}{W}}_{k i}^{s} (t) \cdot V_{j})}{Σ_{i = 1}^{M} ({\overset{&OverBar;}{W}}_{k i}^{s} (t) \cdot \overset{&LeftArrow; t}{Λ_{i j}} + \frac{λ}{2} {(H_{k j}^{s})}^{(- 1 / 2)})} - - - (10)

{\overset{&OverBar;}{H}}_{k j}^{n} &LeftArrow; {\overset{&OverBar;}{H}}_{k j}^{n} \cdot \frac{Σ_{i = 1}^{M} ({\overset{&OverBar;}{W}}_{i k}^{n} (t)) \cdot V_{i j}}{Σ_{i = 1}^{M} ({\overset{&OverBar;}{W}}_{i k}^{n} (t)) \cdot \overset{&LeftArrow; t}{Λ_{i j}}} - - - (11)

6. one kind based on L_1/2Sparse constraint convolution Non-negative Matrix Factorization noisy speech denoising system, it is characterised in that including:

STFT conversion module, obtains its amplitude spectrum for specific noise and noisy speech carry out Short Time Fourier Transform conversion；

Speech decomposition module, for using training the noise basis obtained as prior information, using CNMF_L_1/2Method is to noisy language Sound carries out decomposition and obtains voice base；

Frequency spectrum modular converter, obtains enhanced voice spectrum for the amplitude spectrum of voice after denoising carries out inverse STFT conversion.