CN103594093A

CN103594093A - Method for enhancing voice based on signal to noise ratio soft masking

Info

Publication number: CN103594093A
Application number: CN201210290074.4A
Authority: CN
Inventors: 王景芳
Original assignee: 王景芳
Current assignee: Hunan International Economics University
Priority date: 2012-08-15
Filing date: 2012-08-15
Publication date: 2014-02-19

Abstract

The invention discloses a method for enhancing voice based on signal to noise ratio soft masking. The method comprises: establishing noise power spectrum updates of a sub-band time varying coefficient, using different threshold update smooth spectrums for different frequency points, enhancing voice and restraining noise; determining a posterior signal to noise ratio from a noise power spectrum, performing iterative computation to obtain a prior signal to noise ratio of a present frame according to the posterior signal to noise ratio and the prior signal to noise ratio of a previous frame, and obtaining that each frequency point is in a masking region or in a target signal region according to values of the prior signal to noise ratio. A masking value is calculated by probability distribution obtained by hypothesis testing. The method uses correlation of adjacent frames to extract information to realize enhancement of voice spectrum smooth iteration estimation. For non-stationary noise and strong background noise, a voice enhancement algorithm based on signal to noise ratio soft masking is provided. A rapid tracking noise algorithm performs smooth update on the non-stationary noise frame by frame, preferably estimating a noise spectrum. The algorithm can effectively restrain background noise and improve voice quality and speech intelligibility after denoising.

Description

Based on signal-to-noise ratio flexible, shelter sound enhancement method

Technical field

The invention belongs to voice process technology field, refer to especially a kind ofbased on signal-to-noise ratio flexible, shelter sound enhancement method.

Background technology

Voice are basic means that the mankind link up; Research has brought many new problems to voice signal for the mankind's various social activitieies and behavior, and meanwhile, the development of voice processing technology is at every moment changing mankind's daily life style; For example, the appearance of speech coding technology makes people can in limited communication bandwidth resource, listen to sound at a distance, and recently, the development of wideband speech coding makes our speech in communicating by letter more natural, have more intelligibility, alleviated or reduced the misunderstanding producing in linking up; The breakthrough of a large vocabulary continuous speech recognition difficult problem has caused people to have new phonetic entry mode and interactive mode, and people can liberate both hands and directly give an oral account, and the language that makes to indicate machine work or understand us, increases work efficiency greatly.

The voice processing technology using in daily life as the technology such as voice coding and speech recognition all unavoidably will be in the face of the interference of diversity of settings noise; The existence of noise greatly reduces the performance of these utilizations or directly causes user to stand and abandon using; The background conversation sound that neighbourhood noise exists as scene, the machine vibration noise in car steering storehouse, the car engine sound in running at high speed, the repercussion noises of indoor wall etc., all can pollute primary speech signal; The existence of ground unrest and characteristic thereof are especially serious on considering the parameter voice processing technology impact of human speech characteristic, have destroyed the parameter model and the auditory properties that presuppose; Existing speech recognition system can be used well under noise-free environment, once use in noisy environment place, its recognition performance sharply declines.Obviously, under the interference of noise, the differentiation between the phonetic feature using in recognition system is weakened, and causes system identification mistake to increase.

Along with the universal of mobile communication comes true, when mobile communication technology brings people without constraint and voice communication easily, especially voice communication has been taken to an applied environment that is full of Complex Noise; And the voice coding of mobile phone unavoidably can increase encoding error in noisy environment.

How to eliminate the inconvenience that additive noise brings, the appearance that voice strengthen can reduce or solve the adverse effect of noise; Voice strengthen (speech enhancement) and are typically used as in the speech processing system that front-end processing module appears at various practical applications; It is by carrying out filtering to noisy speech, approximate reduction clean speech signal, make speech processes not directly in the face of noisy speech signal, strengthened the robustness of voice system, and the speech enhancement technique of high robust can expand the application places of speech processing system effectively.

Summary of the invention

(1) technical matters that will solve

In view of this, fundamental purpose of the present invention is to propose a kind ofly based on signal-to-noise ratio flexible, to shelter sound enhancement method, for listener, is mainly to improve voice quality, improves the intelligibility of speech, reduces sense of fatigue; For speech processing system (recognizer, vocoder, mobile phone), be discrimination and the antijamming capability of raising system.The key scientific problems that quasi-solution is determined: realize Noise voice and strengthen, improve signal to noise ratio (S/N ratio), reduce voice messaging distortion and damage after strengthening, be practically applicable to multiple noise circumstance as far as possible.Specifically have noise power spectrum renewal,priori snr computation, shelter that territory is also determined with echo signal territory, the big or small calculating of masking value etc.

(2) technical scheme

For achieving the above object, the invention provides and a kind ofly based on signal-to-noise ratio flexible, shelter sound enhancement method, the method comprises:

1) the noise power spectrum of frequency band time-varying coefficient upgrades,noisy speech l frame power spectrum | Y (l, k) | ², k is frequency sequence number, l frame is estimated noise power spectrum

Figure DEST_PATH_RE-142992DEST_PATH_IMAGE001

, P (l, k) is the level and smooth power spectrum of voice, smoothing factor,

Figure DEST_PATH_RE-380255DEST_PATH_IMAGE003

, noisy speech power spectrum minimum value,

If

Figure DEST_PATH_RE-610565DEST_PATH_IMAGE005

, so

Figure DEST_PATH_RE-572705DEST_PATH_IMAGE006

,

Otherwise

Figure DEST_PATH_RE-669974DEST_PATH_IMAGE007

;

Figure DEST_PATH_RE-780099DEST_PATH_IMAGE008

,

If

Figure DEST_PATH_RE-937410DEST_PATH_IMAGE009

,

Figure DEST_PATH_RE-121267DEST_PATH_IMAGE010

,

Otherwise

Figure DEST_PATH_RE-22227DEST_PATH_IMAGE011

;

Figure DEST_PATH_RE-240719DEST_PATH_IMAGE012

,

Here to different frequency point, adopt different thresholdings to upgrade level and smooth spectrum, outstanding voice, suppress noise;

Figure DEST_PATH_RE-568932DEST_PATH_IMAGE013

;

Figure DEST_PATH_RE-240085DEST_PATH_IMAGE014

;

2) according to the size of priori signal to noise ratio (S/N ratio), obtain each Frequency point and shelter territory or echo signal territory;

Time domain:

Figure DEST_PATH_RE-679156DEST_PATH_IMAGE015

, y (n) represents noisy speech signal, x (n) and d (n) represent respectively clean speech and noise signal.The short time discrete Fourier transform of Y (n):

Figure DEST_PATH_RE-752155DEST_PATH_IMAGE016

,

Its polar coordinate representation:

Figure DEST_PATH_RE-251269DEST_PATH_IMAGE017

,

Figure DEST_PATH_RE-878559DEST_PATH_IMAGE018

with

Figure DEST_PATH_RE-652480DEST_PATH_IMAGE019

represent respectively k the phase and magnitude that Frequency point is corresponding;

According to independence:

Figure DEST_PATH_RE-845564DEST_PATH_IMAGE020

Priori instantaneous signal-to-noise ratio:

Figure DEST_PATH_RE-515580DEST_PATH_IMAGE021

,

Figure DEST_PATH_RE-895746DEST_PATH_IMAGE022

;

3) masking value size is calculated by the probability distribution of supposing acquisition :

Two desirable values are sheltered (ideal binary mask, IdBM):

G _kgain function can be considered to a stochastic variable, because it depends on instantaneous signal-to-noise ratio

Figure DEST_PATH_RE-989790DEST_PATH_IMAGE024

.In binary, shelter, G is the stochastic variable that distributes of Bernoulli Jacob (Bernoulli) with 0 or 1 value, and its parameter p is hypothetical probabilities.G _kto be difficult to estimate, because it depends on the accurate estimation of instantaneous signal-to-noise ratio.Yet we can obtain G at its expectation _kmore reliable.Do like this, we obtain following weighted mean expection amplitude square frequency spectrum and include above-mentioned two hypothesis in:

Figure DEST_PATH_RE-830707DEST_PATH_IMAGE025

Here P (H ₁) the hypothesis H that refers to ₁correct probability, E (G _k| H ₁) expression hypothesis H ₁represent that gain function is that true (being that echo signal is occupied an leading position) represents gain function, E (G _k| H ₀) expression hypothesis H ₀real (shelter and occupy an leading position).E(G _k|H ₁)=1，E(G _k|H ₀)=0。In practice, with a very little value E (G _k| H ₀) result has better quality and strengthen voice and contain a small amount of residual noise.In our research, we are G _f=-20 decibel values replace E (G _k| H ₀), to reduce residual noise as far as possible;

Suppose that the real part of discrete Fourier transform (DFT) (DFT) coefficient and imaginary part such as are at the independently Gaussian random variables of variance, as real part and the imaginary part of voice discrete Fourier transform (DFT) (DFT) coefficient

Figure DEST_PATH_RE-167010DEST_PATH_IMAGE026

normal Distribution all

Figure DEST_PATH_RE-282734DEST_PATH_IMAGE027

,

Figure DEST_PATH_RE-653672DEST_PATH_IMAGE028

all obey

Figure DEST_PATH_RE-196649DEST_PATH_IMAGE029

distribute, they and:

Figure DEST_PATH_RE-285828DEST_PATH_IMAGE030

obey

Figure DEST_PATH_RE-408504DEST_PATH_IMAGE031

distribute, its density function:

Figure DEST_PATH_RE-430687DEST_PATH_IMAGE032

,

Figure DEST_PATH_RE-613407DEST_PATH_IMAGE033

,

Figure DEST_PATH_RE-455461DEST_PATH_IMAGE034

for exponential distribution:

Figure DEST_PATH_RE-381829DEST_PATH_IMAGE035

In like manner

Figure DEST_PATH_RE-258518DEST_PATH_IMAGE036

for exponential distribution:

Figure DEST_PATH_RE-612139DEST_PATH_IMAGE037

According to Bayes rule:

Figure DEST_PATH_RE-941489DEST_PATH_IMAGE038

Wherein: when

Figure DEST_PATH_RE-671547DEST_PATH_IMAGE039

,

Figure DEST_PATH_RE-402743DEST_PATH_IMAGE040

If

Figure DEST_PATH_RE-927265DEST_PATH_IMAGE041

,

Figure DEST_PATH_RE-743912DEST_PATH_IMAGE042

so,

Figure DEST_PATH_RE-277661DEST_PATH_IMAGE043

always positive;

Figure DEST_PATH_RE-863363DEST_PATH_IMAGE044

Figure DEST_PATH_RE-558787DEST_PATH_IMAGE045

, wherein

Figure DEST_PATH_RE-331571DEST_PATH_IMAGE046

Figure DEST_PATH_RE-934590DEST_PATH_IMAGE047

；

4) priori signal to noise ratio (S/N ratio) is upgraded:

Figure DEST_PATH_RE-843641DEST_PATH_IMAGE048

。

Preferably, the parameter initialization of described extraction: noisy speech signal is divided frame, frame length N=[0.25fs] point, fs is signal sampling frequency, frame moves N/2; Definite the taking away of noise spectrum initial value begun without several frames of voice segments.

Preferably, described extraction the noise power spectrum of frequency band time-varying coefficient upgradesparameter:

Figure DEST_PATH_RE-506703DEST_PATH_IMAGE049

.

Preferably, the size according to priori signal to noise ratio (S/N ratio) of described extraction obtains each Frequency point and shelters territory or echo signal field parameter:

Figure DEST_PATH_RE-501204DEST_PATH_IMAGE050

.

Preferably, the priori signal to noise ratio (S/N ratio) undated parameter of described extraction: .

Preferably, described this invention implementation procedure is shown in Fig. 1, and voice enhancing process is illustrated as shown in Figure 2.

Preferably, noisy speech signal is processed one by one in real time, as shown in Figure 3.

(3) beneficial effect

1, provided by the inventionly thisly based on signal-to-noise ratio flexible, shelter sound enhancement method, have and effectively suppress noise, improve significantly speech recognition system performance and intelligibility, and there is robustness under different noise circumstances and signal to noise ratio (S/N ratio) condition.This algorithm is real-time, accomplishes that validity and real-time are two satisfied;

2, provided by the inventionly thisly based on signal-to-noise ratio flexible, shelter sound enhancement method advantage and characteristic:

1) realized the noise power spectrum real-time update of frequency band time-varying coefficient;

2) propose signal-to-noise ratio flexible and sheltered principle;

3) take full advantage of the correlation extraction information between consecutive frame, realized the level and smooth iterative estimate method of priori signal to noise ratio (S/N ratio);

4) algorithm complex is low, can meet real-time;

3, provided by the inventionly thisly based on signal-to-noise ratio flexible, shelter sound enhancement method for non-stationary environment noise, from signal-to-noise ratio flexible, shelter angle and propose a kind of voice enhancement algorithm.Adopt quick tracking noise algorithm smoothly to upgrade frame by frame nonstationary noise, preferably estimating noise spectrum; The denoising that this method is strong background noise and the detection of weak signal provide new approach.

accompanying drawing explanation

Fig. 1 is provided by the invention a kind ofly shelters sound enhancement method process flow diagram based on signal-to-noise ratio flexible;

Fig. 2 is noisy speech enhanced processes schematic diagram provided by the invention;

Fig. 3 is that noisy speech short-time spectrum voice provided by the invention strengthen schematic diagram;

Fig. 4 is provided by the invention divide the noise power spectrum of frequency band time-varying coefficient to upgrade analogous diagram;

Figure (a) is English (The birch canoe slid on the smooth planks), voice, be all selected from AURORA database containing babble noise voice, figure (b) signal to noise ratio snr=5dB, figure (c) signal to noise ratio snr=0dB, figure (b1), (c1) are at the true babble noise power spectrum of frequency f=250Hz and the power spectrum of estimation tracking.

Fig. 5 is provided by the invention containing result contrast before and after babble noise voice enhancing emulation.

Left part figure is time-domain signal, and right part figure is sound spectrograph; Noise is selected from the babble noise of Noisex-92 database; Figure (a) raw tone (' language, sound, increasing, strong '), figure (b1) is figure (b) enhancing result (SNR=5dB); Figure (c1) is figure (c) enhancing result (SNR=0dB)

embodiment

For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.

Core content of the present invention is: realized divide the noise power spectrum of frequency band time-varying coefficient to upgrade; Propose signal-to-noise ratio flexible and sheltered principle; Take full advantage of the correlation extraction information between consecutive frame, realized the level and smooth iterative estimate method of priori signal to noise ratio (S/N ratio), reach voice and strengthen object.

As shown in Figure 1, Fig. 1 provided by the inventionly a kind ofly shelters sound enhancement method process flow diagram based on signal-to-noise ratio flexible, and the method comprises the following steps:

Step 101: parameter initialization: noisy speech signal is divided frame, frame length N=[0.25fs] point, fs is signal sampling frequency, frame moves N/2; Noise spectrum initial value;

Step 102: minute frame, m frame noisy speech, Fourier transform constantly;

Step 103: calculate m frame signal and divide the noise power spectrum of frequency band time-varying coefficient to upgrade;

Step 104: the level and smooth iteration of m frame priori signal to noise ratio (S/N ratio);

Step 105: m frame signal-to-noise ratio flexible masking factor is calculated , Fourier inversion obtains time domain voice and strengthens signal:

Figure DEST_PATH_RE-517154DEST_PATH_IMAGE053

;

Step 106: next frame real time signal processing goes to step 102.

The noise power spectrum of minute frequency band time-varying coefficient described in above-mentioned steps 103 upgrades calculation procedure and comprises:

Noisy speech l frame power spectrum | Y (l, k) | ², k is frequency sequence number, l frame is estimated noise power spectrum , P (l, k) is the level and smooth power spectrum of voice,

smoothing factor,

Figure DEST_PATH_RE-827415DEST_PATH_IMAGE003

,

Figure DEST_PATH_RE-832280DEST_PATH_IMAGE004

noisy speech power spectrum minimum value;

If

Figure DEST_PATH_RE-801373DEST_PATH_IMAGE005

, so

Figure DEST_PATH_RE-18728DEST_PATH_IMAGE006

,

Otherwise

Figure DEST_PATH_RE-288035DEST_PATH_IMAGE007

;

if,

Figure DEST_PATH_RE-920191DEST_PATH_IMAGE009

,

Figure DEST_PATH_RE-675657DEST_PATH_IMAGE010

, otherwise

Figure DEST_PATH_RE-65050DEST_PATH_IMAGE011

;

Figure DEST_PATH_RE-614980DEST_PATH_IMAGE012

；

Figure DEST_PATH_RE-89824DEST_PATH_IMAGE013

；

Figure DEST_PATH_RE-648981DEST_PATH_IMAGE014

。

The forming process of the level and smooth iteration of priori signal to noise ratio (S/N ratio) described in above-mentioned steps 104 comprises:

Figure DEST_PATH_RE-361722DEST_PATH_IMAGE048

，

Figure DEST_PATH_RE-879291DEST_PATH_IMAGE051

。

The forming process of the level and smooth iteration of priori signal to noise ratio (S/N ratio) described in above-mentioned steps 105 comprises:

Figure DEST_PATH_RE-44694DEST_PATH_IMAGE015

Y (n) represents noisy speech signal, and x (n) and d (n) represent respectively clean voice and noise signal.The short time discrete Fourier transform of Y (n):

Figure DEST_PATH_RE-673121DEST_PATH_IMAGE016

,

Its polar coordinate representation:

Figure DEST_PATH_RE-505948DEST_PATH_IMAGE017

,

with

Figure DEST_PATH_RE-847116DEST_PATH_IMAGE019

represent respectively k the phase and magnitude that Frequency point is corresponding.

According to independence:

Figure DEST_PATH_RE-279235DEST_PATH_IMAGE020

Priori instantaneous signal-to-noise ratio:

Figure DEST_PATH_RE-497726DEST_PATH_IMAGE021

, .

Masking value size is calculated by the probability distribution of supposing acquisition :

, wherein

Figure DEST_PATH_RE-936164DEST_PATH_IMAGE046

Figure DEST_PATH_RE-478004DEST_PATH_IMAGE047

。

Based on a kind of shown in Fig. 1, based on signal-to-noise ratio flexible, shelter sound enhancement method process flow diagram, Fig. 2,3 further shows voice enhancing process signal process, Fig. 4 divide the noise power spectrum of frequency band time-varying coefficient to upgrade emulation experimentfigure.

Below in conjunction with specific embodiment, to provided by the invention this based on the soft sound enhancement method further description of sheltering of posteriori SNR; experimentget the noisy voice (babble) that ground unrest is selected from Noisex-92 database, its sample frequency fs=19.98kHZ.Below we with same sample frequency fs, at computing machine noise and room noise environment, recorded " language, sound, end, point " sound and seen Fig. 1 (a).At voice, divide in frame process, frame length is got 25ms, i.e. frame length M=[0.25fs] point, frame moves

Figure DEST_PATH_RE-135567DEST_PATH_IMAGE055

, intercepting starts noise frame N=20;

Objectively from several aspects such as speech waveform, sound spectrograph, signal to noise ratio (S/N ratio) raisings, the performance of this algorithm has been carried out to comprehensive analysis.Adopt signal to noise ratio (S/N ratio)

Figure DEST_PATH_RE-909488DEST_PATH_IMAGE056

Carry out the denoising effect of analytical algorithm quantitatively; Before and after strengthening emulation containing babble noise voice, result contrast is referring to view.

Above-described specific embodiment; object of the present invention, technical scheme and beneficial effect are further described; institute is understood that; the foregoing is only specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any modification of making, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. based on signal-to-noise ratio flexible, shelter a sound enhancement method, it is characterized in that the method comprises:

For voice signal under nonstationary noise and strong background noise, be difficult to the practical problems of extracting, this algorithm design divide the noise power spectrum of frequency band time-varying coefficient to upgrade, provided size according to priori signal to noise ratio (S/N ratio) and obtained each Frequency point and shelter territory or echo signal territory; The probability distribution that masking value size is obtained by test of hypothesis is calculated and specific embodiments.

2. according to claim 1ly based on signal-to-noise ratio flexible, shelter sound enhancement method, it is characterized in that, described in the noise power spectrum of frequency band time-varying coefficient upgrades,noisy speech l frame power spectrum | Y (l, k) | ², k is frequency sequence number, l frame is estimated noise power spectrum

, P (l, k) is the level and smooth power spectrum of voice,

smoothing factor, ,

for noisy speech power spectrum minimum value, if

, so

, otherwise

;

if,

,

, otherwise

;

, to different frequency point, adopt different thresholdings to upgrade level and smooth spectrum here, outstanding voice, suppress noise;

;

.

3. according to claim 1ly based on signal-to-noise ratio flexible, shelter sound enhancement method, it is characterized in that, the described size according to priori signal to noise ratio (S/N ratio) obtains each Frequency point and shelters territory or echo signal territory; Time domain:

,

Y (n) represents noisy speech signal, and x (n) and d (n) represent respectively clean voice and noise signal; The short time discrete Fourier transform of Y (n):

, its polar coordinate representation:

,

with

According to independence:

Priori instantaneous signal-to-noise ratio: ,

.

4. according to claim 1ly based on signal-to-noise ratio flexible, shelter sound enhancement method, it is characterized in that, the probability distribution that described masking value size is obtained by test of hypothesis is calculated :

, wherein

。