CN111261197B

CN111261197B - Real-time speech paragraph tracking method under complex noise scene

Info

Publication number: CN111261197B
Application number: CN202010029721.0A
Authority: CN
Inventors: 马翼平; 张玮
Original assignee: Avic East China Photoelectric Shanghai Co ltd
Current assignee: Avic East China Photoelectric Shanghai Co ltd
Priority date: 2020-01-13
Filing date: 2020-01-13
Publication date: 2022-11-25
Anticipated expiration: 2040-01-13
Also published as: CN111261197A

Abstract

The invention discloses a real-time voice paragraph tracking method under a complex noise scene, which comprises the following steps: A. pre-treating; B. calculating a discrete Fourier transform coefficient of an input audio frame, and calculating the power of initial noise, namely calculating the arithmetic mean value of a Fourier transform amplitude spectrum, assuming that a previous frame is a noise frame; assuming that the data after the frame is a signal with noise, calculating the power of the signal with noise; D. calculating the posterior signal-to-noise ratio; E. calculating a priori signal-to-noise ratio; F. voice activation detection; G. updating a noise spectrum; H. and calculating a gain coefficient, estimating the spectrum attribute of stationary noise in a scene by using paragraph noise between the speech segments, and designing a gain function to enhance the voice and inhibit the stationary noise. And performing voiced sound detection on the basis, tracking the speech paragraphs, and shielding various noises among the speech paragraphs. Therefore, the accuracy of voice detection can be improved, the noise of voice segment superposition can be inhibited, and the noise between the voice segments influencing the listening feeling can be thoroughly shielded.

Description

Real-time voice paragraph tracking method under complex noise scene

Technical Field

The invention relates to the technical field of voice processing, in particular to a real-time voice paragraph tracking method in a complex noise scene.

Background

Engineering in the field of speech signal processing is to be faced with complex noise scenarios including stationary noise, transient noise, time-varying noise, and strong noise, etc., which have different statistical properties. When the near-talking sound pickup equipment is used for voice collection, voice communication and voice recognition, background noise is easily picked up by the microphone, direct influence is caused to the voice communication from the listening aspect, and the performance of processing modules such as rear-end voice recognition and the like can be further influenced. In a complex noise scene, steady-state noise mixed in voice is inhibited, other types of noise mixed among voice paragraphs are shielded, and pure voice paragraphs are obtained by tracking, so that the hearing of voice communication can be effectively improved, and the performance of a back-end processing module such as voice recognition and the like is improved. The speech tracking under the single noise scene with the statistical characteristics is relatively easy to process, and the speech paragraph tracking under the complex noise scene is a difficult problem.

Disclosure of Invention

The present invention is directed to provide a real-time speech paragraph tracking method in a complex noise scene, so as to solve the problems mentioned in the background art.

In order to achieve the purpose, the invention provides the following technical scheme:

a real-time speech paragraph tracking method under a complex noise scene is characterized by comprising the following steps:

A. pretreatment: framing and windowing an input audio signal; taking 16ms data as a frame x _i (n), wherein i is a frame number;

B. computing input audio frames

Discrete fourier transform coefficient Y of _i (ω _k ) Where k is the index of the spectral component;

C. assuming the previous L frames as noise frames, calculating the power of the initial noise, i.e. calculating

An arithmetic mean of the fourier transform magnitude spectrum; assuming the data after L frames as a noise signal, calculating the power of the noise signal

D. Calculating the posterior signal-to-noise ratio

E. Calculating the prior signal-to-noise ratio

F. Voice activation detection;

G. updating a noise spectrum;

H. calculating a gain coefficient;

I. signal reconstruction: calculating the amplitude spectrum and the power spectrum of the enhanced voice of the current frame, and performing inverse Fourier transform on the spectrum of the enhanced voice to obtain a reconstructed signal;

J. calculating out

Is self-correlation function of

Wherein r is _t (tau) is an autocorrelation function with a delay of tau, N is a window length and N is greater than or equal to 1 and less than or equal to N;

K. calculating a difference function:

and (3) calculating:

l, judging the voiced sound according to the following conditions: p =1-d' (τ) is calculated, p characterizing the probability that a fundamental frequency component is clearly contained in a frame of speech. Since d' (τ) has a value in the range of [0,1 ]]Then p is in the value range of [0,1 ]](ii) a With p _th As a threshold, is larger than p _th The speech frame of (1) is reserved as voiced;

m, unvoiced sound compensation and noise masking.

As a further scheme of the invention: in the step A, the input audio signal is framed and windowed, and the window function is a Hamming window:

as a further scheme of the invention: and the step F is specifically to carry out voice activation detection on the input frame and select out the noise frame. According to the posterior signal-to-noise ratio gamma _k And a priori signal-to-noise ratio

And solving a judgment parameter v for activating voice detection, judging as voice if v is greater than a judgment threshold eta, and judging as noise if v is less than eta, so as to update a noise spectrum. The calculation method of the decision parameter v is as follows.

As a further scheme of the invention: the step G is specifically as follows: after the noise frame is selected, the noise spectrum is updated according to the following formula:

as a further scheme of the invention: the step H is specifically as follows: calculating the weighting coefficient of the amplitude spectrum of the current frame according to the posterior signal-to-noise ratio and the prior signal-to-noise ratio:

as a further scheme of the invention: the function established in the step I is as follows:

as a further scheme of the invention: in the step M, if a certain frame is determined to be voiced and a signal frame within 400 milliseconds after the certain frame is determined to be not voiced, compensation is performed, that is, the signal frame is directly output without being processed; and (4) masking the non-voiced sound frame which does not meet the compensation condition, namely performing amplitude limiting processing and outputting.

Compared with the prior art, the invention has the beneficial effects that: the invention completely tracks the voice paragraph, shields the noise outside the voice paragraph, inhibits the noise superposed on the voice, and enhances the listening effect of the voice.

Drawings

FIG. 1 is a time domain waveform of an audio signal with stationary noise and transient noise superimposed on the speech and a noise peak exceeding 60 dB;

FIG. 2 is a time domain waveform of the signal of FIG. 1 after being processed by the present embodiment;

FIG. 3 is a time domain waveform of an audio signal with stationary noise and transient noise superimposed on the speech and a noise peak exceeding 110 dB;

FIG. 4 is a time domain waveform of the signal of FIG. 3 after being processed by the present invention;

fig. 5 is a flowchart of the method of the present embodiment.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-5, example 1: in an embodiment of the present invention, a real-time speech paragraph tracking method in a complex noise scene includes the following steps:

A. and (4) preprocessing. The input audio signal is framed and windowed. Taking 16ms (256 samples) of data as a frame x _i (n), where i is the frame number. Windowing is performed on the data, and the window function is a Hamming window:

B. computing input audio frames

Discrete fourier transform coefficient Y of _i (ω _k ) Where k is the index of the spectral component:

Y _i (ω _k )＝Y _k exp(jθ _y (k))

C. assuming the first L frames as noise frames, calculating the power of the initial noise, i.e. calculating

Arithmetic mean of fourier transform magnitude spectrum:

assuming that the data after L frames is a signal with noise, calculating the power of the signal with noise

|Y _i (ω _k )| ² ；

D. Calculating the posterior signal-to-noise ratio gamma _k ＝|Y _i (ω _k )| ² /λ _d (k)；

E. Calculating the prior signal-to-noise ratio

F. Voice activity detection. Since the noise may be stationary for a short time, the noise spectrum needs to be updated in real time to ensure the effect of noise suppression. And carrying out voice activation detection on the input frame, and selecting out a noise frame. According to the posterior signal-to-noise ratio gamma _k And a priori signal to noise ratio

A decision parameter v is derived that activates speech detection. If v is larger than the decision threshold eta, the voice is judged, and if v is smaller than eta, the voice is judged as noise, and the noise spectrum is updated. The calculation method of the decision parameter v comprises the following steps:

G. and updating the noise spectrum. After the noise frame is selected, the noise spectrum is updated according to the following formula:

H. a gain factor is calculated. Calculating a weighting coefficient of the magnitude spectrum of the current frame according to the posterior signal-to-noise ratio and the prior signal-to-noise ratio:

wherein exp (-) is an exponential function with a natural constant e as a base, and expint (-) is an exponential integration function with the natural constant e as a base.

I. The signal is reconstructed. Calculating the amplitude spectrum and the power spectrum of the enhanced voice of the current frame, and performing inverse Fourier transform on the frequency spectrum of the enhanced voice to obtain a reconstructed signal:

J. computing

Is self-correlation function of

K. calculating a difference function:

and (3) calculating:

l, judging the voiced sound according to the following conditions:

p =1-d' (τ) is calculated, p characterizing the probability that a fundamental frequency component is clearly contained in a frame of speech. Due to the range of d' (τ)Is enclosed as [0,1 ]]Then p has a value in the range of [0,1 ]]. With p _th As a threshold, is larger than p _th The speech frame of (1) is reserved as voiced;

m, unvoiced sound compensation and noise masking. If a certain frame is judged to be voiced and a signal frame within 400 milliseconds later is not voiced, compensation is carried out, namely the signal frame is directly output without being processed; and (4) masking the non-voiced sound frame which does not meet the compensation condition, namely performing amplitude limiting processing and outputting.

Fig. 3 and fig. 5 are audio time domain waveforms processed by the method of the present invention, and it can be seen from comparing with the original waveforms that, under the complex noise background, the method completely tracks the speech paragraphs, masks the noise outside the speech paragraphs, and also plays a role in suppressing the noise superimposed on the speech, thereby enhancing the listening effect of the speech itself.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims

1. A real-time speech paragraph tracking method under a complex noise scene is characterized by comprising the following steps:

A. pretreatment: framing and windowing an input audio signal; taking 16ms data as a frame xi (n), wherein i is a frame number;

B. computing input audio frames

Discrete fourier transform coefficient Yi (ω) _k ) Where k is the index of the spectral component;

An arithmetic mean of the fourier transform magnitude spectrum; assuming that the data after the L frame is a signal with noise, the power | Yi (ω) of the signal with noise is calculated _k )| ² ；

D. Calculating the posterior signal-to-noise ratio gamma _k ＝|Yi(ω _k )| ² /λ _d (k)；

E. Calculating a priori signal-to-noise ratio

F. Voice activation detection; the step F specifically comprises the following steps: carrying out voice activation detection on an input frame, and selecting a noise frame; according to the posterior signal-to-noise ratio gamma _k And a priori signal to noise ratio

Solving a judgment parameter v for activating voice detection, judging voice if v is greater than a judgment threshold eta, and judging noise if v is less than eta, wherein the judgment parameter v is used for updating a noise spectrum; the calculation method of the decision parameter v comprises the following steps:

G. updating a noise spectrum; the step G is specifically as follows: after the noise frame is selected, the noise spectrum is updated according to the following formula:

H. calculating a gain coefficient;

J. calculating out

Is self-correlation function of

K. calculating a difference function:

and (3) calculating:

l, judging the voiced sound according to the following conditions: calculating p =1-d' (τ), wherein p represents the probability that a certain fundamental frequency component is obviously contained in a frame of voice; since d' (τ) has a value in the range of [0,1 ]]Then p has a value in the range of [0,1 ]](ii) a With p _th As a threshold, is larger than p _th The speech frame of (1) is reserved as voiced;

m, unvoiced sound compensation and noise masking.

2. The method according to claim 1, wherein the real-time speech paragraph tracking method under complex noise scene,

in the step A, the input audio signal is framed and windowed, and the window function is a Hamming window:

3. the method according to claim 1, wherein the real-time speech paragraph tracking method under the complex noise scene,

the step H is specifically as follows: calculating the weighting coefficient of the amplitude spectrum of the current frame according to the posterior signal-to-noise ratio and the prior signal-to-noise ratio:

4. the method according to claim 1, wherein the real-time speech paragraph tracking method under the complex noise scene,

the function established in the step I is as follows:

5. the method according to claim 1, wherein the real-time speech paragraph tracking method under the complex noise scene,

in the step M, if a certain frame is judged to be voiced and a signal frame within 400 milliseconds after the certain frame is judged to be not voiced, compensation is carried out, namely the signal frame is directly output without being processed; and (4) shielding the non-voiced sound frame which does not meet the compensation condition, namely performing amplitude limiting processing and outputting the non-voiced sound frame.