CN109119096B

CN109119096B - Method and device for correcting current active tone hold frame number in VAD (voice over VAD) judgment

Info

Publication number: CN109119096B
Application number: CN201810622976.0A
Authority: CN
Inventors: 袁浩; 江东平; 朱长宝
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2012-12-25
Filing date: 2012-12-25
Publication date: 2021-01-22
Anticipated expiration: 2032-12-25
Also published as: CN103903634A; CN109119096A; CN112992188A; CN103903634B; CN112992188B

Abstract

The embodiment of the invention provides a method and a device for correcting the number of current active tone hold frames in VAD (voice activity detection) judgment, and a method and a device for adjusting a signal-to-noise ratio threshold, wherein the correction method comprises the following steps: obtaining a sub-band signal and a frequency spectrum amplitude of a current frame; calculating to obtain a long-time signal-to-noise ratio lt _ SNR and an average full-band signal-to-noise ratio SNR2_ lt _ ave according to the sub-band signals; and correcting the current active tone hold frame number according to the VAD decision results of the previous frames, the long-time signal-to-noise ratio lt _ SNR, the average full-band signal-to-noise ratio SNR2_ lt _ ave and the VAD decision result of the current frame. The technical scheme of the invention can improve the accuracy of unstable noise and music detection.

Description

Method and device for correcting current active tone hold frame number in VAD (voice over VAD) judgment

The application is a divisional application of an invention patent application with the filing date of 2012, 12 and 25, and the filing number of 201210570563.5, and the title of the invention is activation tone detection and a method and a device for the same.

Technical Field

The invention relates to an active tone detection VAD technology, in particular to a method and a device for correcting the number of current active tone hold frames in the active tone detection VAD judgment and a method and a device for adjusting a signal-to-noise ratio threshold.

Background

In a normal voice call, a user sometimes speaks and sometimes listens, and an inactive tone stage occurs in the call process at this time, and the total inactive tone stage of two parties of the call normally exceeds 50% of the total voice coding duration of the two parties of the call. In the inactive tone phase, there is only background noise, which generally has no useful information. By taking advantage of this fact, during the processing of the audio signal, active and inactive tones are detected by the active tone detection (VAD) algorithm and processed separately in different ways. Many modern speech coding standards, such as AMR, AMR-WB, support VAD functions. In terms of efficiency, the VAD of these encoders does not achieve good performance under all typical background noise. Especially under non-stationary noise, the VAD efficiency of these encoders is low. For music signals, these VADs sometimes make false detections, resulting in a significant quality degradation of the corresponding processing algorithms.

Disclosure of Invention

In view of this, an embodiment of the present invention provides a method for correcting a current active tone hold frame number in VAD (voice over active tone) decision, including:

obtaining a sub-band signal and a frequency spectrum amplitude of a current frame;

calculating to obtain a long-time signal-to-noise ratio lt _ SNR and an average full-band signal-to-noise ratio SNR2_ lt _ ave according to the sub-band signals;

correcting the current active tone hold frame number according to the VAD decision results of the previous frames, the long-time signal-to-noise ratio lt _ SNR, the average full-band signal-to-noise ratio SNR2_ lt _ ave and the VAD decision result of the current frame

The embodiment of the invention also provides a device for correcting the number of the current active tone hold frames in the VAD judgment of the active tone detection, which comprises the following steps:

the long-time signal-to-noise ratio calculation unit is used for calculating a long-time signal-to-noise ratio lt _ snr;

an average full-band signal-to-noise ratio calculation unit for calculating an average full-band signal-to-noise ratio SNR2_ lt _ ave;

and the active tone preserving frame number correcting unit is used for correcting the current active tone preserving frame number according to the judgment results of the previous frames, the long-time signal-to-noise ratio lt _ SNR, the average full-band signal-to-noise ratio SNR2_ lt _ ave and the VAD judgment result of the current frame.

The embodiment of the invention also provides a method for adjusting the signal-to-noise ratio threshold in VAD judgment of active tone detection, which comprises the following steps:

calculating to obtain a spectral gravity center characteristic parameter of the current frame according to the sub-band signal;

calculating to obtain a long-time signal-to-noise ratio lt _ snr according to the ratio of the average long-time activation sound signal energy and the average long-time background noise energy of the previous frame;

and adjusting the SNR threshold of VAD judgment according to the spectral gravity center characteristic parameter, the long-term SNR (signal-to-noise ratio) lt _ snr, the number of previous continuous activated sound frames continuousSpeech _ num and the number of previous continuous noise frames continuousnoise _ num.

The embodiment of the invention also provides a device for adjusting the signal-to-noise ratio threshold in VAD judgment of active tone detection, which comprises:

the characteristic parameter acquisition unit is used for calculating and obtaining the spectral gravity center characteristic parameter of the current frame according to the sub-band signal;

the long-time signal-to-noise ratio calculation unit is used for calculating a long-time signal-to-noise ratio lt _ snr according to the ratio of the average long-time activation sound signal energy of the previous frame to the average long-time background noise energy;

and the signal-to-noise ratio threshold adjusting unit is used for adjusting the signal-to-noise ratio threshold of VAD judgment according to the spectrum barycenter characteristic parameter, the long signal-to-noise ratio lt _ snr, the number of previous continuous activated sound frames continuousspeed _ num and the number of previous continuous noise frames continuousnoise _ num.

The technical scheme provided by the embodiment of the invention overcomes the defects of the existing VAD algorithm, improves the efficiency of the VAD for detecting unstable noise and simultaneously improves the accuracy of music detection. The voice frequency signal processing algorithm adopting the VAD can obtain better performance.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, and/or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the example serve to explain the principles of the invention and not to limit the invention.

FIG. 1 is a diagram of an embodiment 1 of the method for detecting an activating tone according to the present invention;

FIG. 2 is a diagram of an embodiment 2 of the method for detecting an activating tone according to the present invention;

fig. 3 is a schematic diagram of a process of obtaining a VAD decision result in embodiments 1 and 2 of the present invention;

FIG. 4 is a block diagram of an active voice detection (VAD) apparatus 1 according to an embodiment of the present invention;

FIG. 5 is a block diagram of an active voice detection (VAD) apparatus according to embodiment 2 of the present invention;

FIG. 6 is a schematic block diagram of a VAD decision unit in the VAD apparatus of the present invention;

FIG. 7 is a diagram illustrating a background noise detection method according to an embodiment of the present invention;

FIG. 8 is a block diagram of a background noise detection apparatus according to the present invention;

FIG. 9 is a schematic diagram of an embodiment of a method for detecting a tonal signal in accordance with the present invention;

FIG. 10 is a schematic block diagram of the tonal signal detection apparatus of the present invention;

FIG. 11 is a schematic block diagram of a tonal signal determining unit of the tonal signal detection apparatus according to the present invention;

FIG. 12 is a diagram of a method for correcting the number of currently active Hold frames in VAD decision according to an embodiment of the present invention;

fig. 13 is a schematic block diagram of a modification apparatus for the number of currently active voice hold frames in VAD decision according to the present invention;

fig. 14 is a schematic diagram illustrating an embodiment of a method for adjusting a signal-to-noise ratio threshold in VAD decision according to the present invention;

FIG. 15 is a schematic diagram illustrating a specific process for adjusting the SNR threshold according to the present invention;

fig. 16 is a schematic block diagram of an apparatus for adjusting a signal-to-noise ratio threshold in VAD decision according to the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Embodiment 1 of a Voice Activity Detection (VAD) method according to the present invention is shown in fig. 1, and the method includes:

step 101: obtaining a sub-band signal and a frequency spectrum amplitude of a current frame;

in this embodiment, an audio stream with a frame length of 20ms and a sampling rate of 32kHz is taken as an example for specific description. The method of the present invention is equally applicable under other frame lengths and sampling rates.

Inputting the current frame time domain signal into a filter bank, and performing sub-band filtering calculation to obtain a sub-band signal of the filter bank;

in this embodiment, a 40-channel filter bank is used, but the present invention is also applicable to filter banks using other channel numbers.

Inputting the current frame time domain signal into a filter bank of 40 channels, performing subband filtering calculation to obtain filter bank subband signals X [ k, l ] of 40 subbands at 16 time sampling points, where k is greater than or equal to 0 and less than 40, and l is greater than or equal to 0 and less than 16, where k is an index of a filter bank subband and represents a subband corresponding to a coefficient, and l is a time sampling point index of each subband, and the implementation steps are as follows:

101 a: the last 640 audio signal samples are stored in a data buffer.

101 b: the data in the data buffer is shifted 40 locations, the oldest 40 samples are shifted out of the data buffer, and the 40 new samples are stored in locations 0 through 39.

Multiplying the data x in the cache by a window coefficient to obtain an array z, and calculating an equation as follows:

z[n]＝x[n]·W_qmf[n]；0≤n＜640；

wherein W_qmfAre filterbank window coefficients.

The following pseudo code calculation yields an 80-point data u,

for(n＝0；n<80；n++)

{u[n]＝0；

for(j＝0；j<8；j++)

{

u[n]+＝z[n+j·80]；

}

the following equations are used to calculate the arrays r and i:

calculating 40 complex subband samples at the first time sample by using the following equation, wherein X [ k, l ] ═ r (k) + ii (k),0 ≦ k < 40, where r (k) and i (k) are respectively the real part and the imaginary part of the coefficient at the l-th time sample of the subband signal X of the filter bank, and the calculation equation is as follows:

101 c: and repeating the calculation process of 101b until all the data of the frame are filtered by the filter bank, and the final output result is the filter bank subband signal X [ k, l ].

101 d: after the above calculation process is completed, the filter group subband signals X [ k, l ] of 16 time samples of 40 subbands are obtained, k is greater than or equal to 0 and less than 40, and l is greater than or equal to 0 and less than 16.

And performing time-frequency transformation on the sub-band signals of the filter bank, and calculating to obtain a frequency spectrum amplitude.

The embodiment of the invention can be realized by performing time-frequency transformation on all filter bank subbands or part of filter bank subbands and calculating the spectral amplitude. The time-frequency transformation method of the invention can be DFT, FFT, DCT or DST. In this embodiment, DFT is taken as an example to illustrate a specific implementation method thereof. The calculation process is as follows:

performing 16-point DFT on 16 time sample point data on each filter group sub-band with indexes of 0 to 9, further improving the frequency spectrum resolution, calculating the amplitude of each frequency point, and obtaining the frequency spectrum amplitude X_{DFT_AMP}。

The time-frequency transformation calculation equation is as follows:

the amplitude calculation process of each frequency point is as follows:

first, calculate array X_DFT[k][j]The energy at each point is calculated as follows:

X_{DFT_POW}[k,j]＝(real(X_DFT[k,j])²+(image(X_DFT[k,j])²(ii) a K is more than or equal to 0 and less than 10; j is more than or equal to 0 and less than 16; wherein real (X)_{DFT_POW}[k,j])，image(X_{DFT_POW}[k,j]) Respectively representing spectral coefficients X_{DFT_POW}[k,j]Real and imaginary parts of (c).

If k is an even number, calculating the spectrum amplitude on each frequency point by adopting the following equation:

if k is an odd number, calculating the spectrum amplitude on each frequency point by adopting the following equation:

X_{DFT_AMP}namely the frequency spectrum amplitude after time-frequency transformation.

Step 102: calculating values of a frame energy parameter and a spectrum gravity center characteristic parameter of the current frame according to the sub-band signals;

the values of the frame energy parameter, the spectral center of gravity characteristic parameter and the tonal characteristic parameter can be obtained by the prior art method, and preferably, the parameters are obtained by the following method:

the frame energy parameter is a weighted superposition value or a direct superposition value of each sub-band signal energy; specifically, the method comprises the following steps:

a) the energy of each filter bank subband is calculated from the filter bank subband signal X k, l, the calculation equation is as follows:

b) and accumulating the energy of the filter bank sub-bands sensitive to partial auditory sense or all the filter bank sub-bands to obtain the frame energy parameter.

According to the psycho-acoustic model, human ears are relatively insensitive to extremely low frequency (such as below 100 Hz) and high frequency (such as above 20 kHz) sounds, the invention considers that filter bank sub-bands arranged from low to high in frequency are main filter bank sub-bands sensitive to auditory sense from the second sub-band to the last-but-one sub-band, partial or all energy of the filter bank sub-bands sensitive to auditory sense are accumulated to obtain a frame energy parameter 1, and the calculation equation is as follows:

wherein e _ sb _ start is the starting subband index, and the value range is [0,6 ]. e _ sb _ end is the ending subband index, whose value is greater than 6 and less than the total number of subbands.

The weighted value of the energy of the filter bank subband which is not used in the process of calculating the frame energy parameter 1 is added to the value of the frame energy parameter 1 to obtain a frame energy parameter 2, and the calculation equation is as follows:

wherein e _ scale1 and e _ scale2 are weighted scale factors, and the value ranges thereof are respectively [0,1]。_{num_band}Is the total number of sub-bands。

The spectral centroid characteristic parameters are obtained by ratioing the sum of weighted additions of the filter bank sub-band energies to the directly added sums of the sub-band energies or by smoothing the values of other spectral centroid characteristic parameters.

The spectral barycenter characteristic parameter can be realized by the following substeps:

a: the subband intervals for calculating the characteristic parameters of the spectral barycenter are divided as follows:

b: and (b) calculating two spectrum gravity center characteristic parameter values which are a first interval spectrum gravity center characteristic parameter and a second interval spectrum gravity center characteristic parameter respectively by adopting the spectrum gravity center characteristic parameter calculation interval division mode of the (a) and the following formula.

Delta1 and Delta2 are respectively small offset values with the value range of (0, 1). Where k is the spectral centroid number index.

c: performing smooth filtering operation on the first interval spectrum barycentric characteristic parameter sp _ center [0] to obtain a smooth spectrum barycentric characteristic parameter value, namely a smooth filtering value of the first interval spectrum barycentric characteristic parameter value, wherein the calculation process is as follows:

sp_center[2]＝sp_center_-1[2]·spc_sm_scale+sp_center[0]·(1-spc_sm_scale)

wherein spc _ sm _ scale is a spectral center of gravity parameter smoothing filtering scale factor, sp _ center_-1[2]The value of the smoothed spectrum barycenter feature parameter representing the previous frame is 1.6 as an initial value.

Step 103: calculating to obtain a signal-to-noise ratio parameter of the current frame according to the background noise energy estimated from the previous frame, the frame energy parameter of the current frame and the signal-to-noise ratio sub-band energy;

the background noise energy of the previous frame can be obtained by existing methods.

If the current frame is the starting frame, the value of the background noise energy of the signal-to-noise ratio sub-band adopts a default initial value. The principle of the snr subband background energy estimation of the previous frame is the same as that of the current frame, and the snr subband background energy estimation of the current frame is referred to step 207 in embodiment 2 below. Specifically, the snr parameter of the current frame can be implemented by using the existing snr calculation method. Preferably, the following method is employed:

first, the filter bank subbands are subdivided into several snr subbands, with the division indices given in the table below,

secondly, calculating the energy of each signal-to-noise ratio sub-band of the current frame according to the dividing mode of the signal-to-noise ratio sub-band. The calculation equation is as follows:

and thirdly, calculating the average SNR1 of the sub-band according to the energy of each SNR sub-band of the current frame and the background noise energy of each SNR sub-band of the previous frame. The calculation equation is as follows:

wherein E_{sb2_bg}The number of num _ band signal-to-noise ratio sub-bands is the estimated background noise energy of each signal-to-noise ratio sub-band of the previous frame. The principle of obtaining the background noise energy of the previous frame of signal-to-noise ratio sub-band is the same as the principle of obtaining the background energy of the current frame of signal-to-noise ratio sub-band, and the process of obtaining the background energy of the current frame of signal-to-noise ratio sub-band is referred to step 207 of embodiment 2 below;

finally, according to the estimated frame energy of the previous frame and the frame energy parameter of the current frame, calculating the full-band signal-to-noise ratio SNR 2:

wherein E_{t_bg}For the estimated previous frame full-band background noise energy, the principle of obtaining the previous frame full-band background noise energy is the same as the principle of obtaining the current frame full-band background noise energy, and the process of obtaining the current frame full-band background noise energy is referred to step 207 of embodiment 2 below;

the signal-to-noise ratio parameters in this embodiment include the subband average signal-to-noise ratio SNR1 and the full-band signal-to-noise ratio SNR 2. The full-band background noise energy and the background noise energy of each sub-band are collectively referred to as background noise energy.

Step 104: and calculating according to the tonal modification mark, the signal-to-noise ratio parameter, the spectral gravity center characteristic parameter and the frame energy parameter to obtain a VAD judgment result.

Example 2

In embodiment 2 of the active voice detection (VAD) method of the present invention, polyphase filtering is performed on an input audio signal in a frame to obtain a filter bank subband signal, time-frequency transformation is further performed on the filter bank subband signal, a frequency spectrum amplitude is calculated to obtain a frequency spectrum amplitude, and signal feature extraction is performed on each filter bank subband signal and each frequency spectrum amplitude to obtain each feature parameter value. And calculating to obtain the background noise mark and the tone mark of the current frame according to the characteristic parameter value. And calculating to obtain a signal-to-noise ratio parameter of the current frame according to the current frame energy parameter value and the background noise energy, and judging whether the current frame is an activated sound frame or not according to the calculated signal-to-noise ratio parameter of the current frame, a Voice Activity Detection (VAD) judgment result of a previous frame and each characteristic parameter. And correcting the background noise identifier according to the judgment result of the activated sound frame to obtain a new background noise identifier. And judging whether to update the background noise according to the new background noise identifier. The specific process of VAD detection is as follows:

as shown in fig. 2, this method embodiment 2 includes:

step 201: obtaining a sub-band signal and a frequency spectrum amplitude of a current frame;

step 202: calculating values of a current frame energy parameter, a spectrum gravity center characteristic parameter and a time domain stability characteristic parameter according to the sub-band signals; calculating values of a spectral flatness characteristic parameter and a tonal characteristic parameter according to the spectral amplitude;

the frame energy parameter is a weighted superposition value or a direct superposition value of each sub-band signal energy;

the spectral barycentric characteristic parameter is the ratio of a weighted accumulated value and an unweighted accumulated value of all or part of subband signal energy;

in particular, the amount of the solvent to be used,

and calculating to obtain a spectrum gravity center characteristic parameter according to the energy of each filter bank sub-band, wherein the spectrum gravity center characteristic parameter is obtained by solving the ratio of the weighted sum of the energy of the filter bank sub-bands and the directly added sum of the energy of the sub-bands or by performing smooth filtering on other spectrum gravity center characteristic parameter values.

sp_center[2]＝sp_center_-1[2]·spc_sm_scale+sp_center[0]·(1-spc_sm_scale)

The time domain stability characteristic parameter is an expected ratio of the variance of the amplitude superposition value to the square of the amplitude superposition value, or the ratio is multiplied by a coefficient;

in particular, the amount of the solvent to be used,

and calculating the time domain stability characteristic parameter according to the frame energy parameters of the latest frame signals. In this embodiment, a time-domain stability characteristic parameter is calculated by using the latest frame energy parameter of the 40-frame signal. The specific calculation steps are as follows:

firstly, the energy amplitude of the signal of the latest 40 frames is calculated, and the calculation equation is as follows:

wherein e _ offset is an offset value with a value range of [0,0.1]

And secondly, sequentially adding the energy amplitudes of two adjacent frames from the current frame to the previous 40 th frame to obtain 20 amplitude superposition values. The specific calculation equation is as follows:

Amp_t2(n)＝Amp_t1(-2n)+Amp_t1(-2n-1)；0≤n＜20；

where n is 0, Amp_t1Representing the energy amplitude of the current frame, and when n is less than 0, Amp_t1Representing the energy amplitude of n frames preceding the current frame.

And finally, calculating the ratio of the variance and the average energy of the latest 20 amplitude superposition values to obtain a time domain stability characteristic parameter ltd _ stable _ rate 0. The calculation equation is as follows:

the spectral flatness characteristic parameter is the ratio of the geometric mean and the arithmetic mean of certain spectral amplitudes, or the ratio is multiplied by a coefficient;

specifically, the spectral amplitude X_{DFT_AMP}Dividing the frequency bands into a plurality of frequency bands, and calculating the spectral flatness of each frequency band of the current frame to obtain the spectral flatness characteristic parameters of the current frame.

In this embodiment, the spectral amplitude is divided into 3 frequency bands, and the spectral flatness characteristics of the 3 frequency bands are calculated, which is specifically implemented as follows:

firstly, X is put in_{DFT_AMP}The division into 3 bands is done according to the index in the table below.

And secondly, respectively calculating the spectral flatness of each sub-band to obtain the spectral flatness characteristic parameters of the current frame. The calculation equation of each spectral flatness characteristic parameter value of the current frame is as follows:

and finally, performing smooth filtering on the spectral flatness characteristic parameters of the current frame to obtain the final spectral flatness characteristic parameters of the current frame.

sSMR(k)＝smr_scale·sSMR_-1(k)+(1-smr_scale)·SMR(k)；0≤k＜3

Wherein smr _ scale is a smoothing factor with a value range of [0.6, 1%]，sSMR_-1(k) Is the value of the kth spectral flatness characteristic parameter of the previous frame. .

The tonal characteristic parameter is obtained by calculating the correlation value of the intra-frame frequency spectrum difference coefficient of the signals of the two frames before and after, or continuously carrying out smooth filtering on the correlation value.

Specifically, the method for calculating the correlation value of the intra-frame spectral difference coefficient of the two previous and next frame signals is as follows:

and calculating to obtain the tonal characteristic parameter according to the spectral amplitude, wherein the tonal characteristic parameter can be calculated according to all spectral amplitudes or part of the spectral amplitudes.

The calculation steps are as follows:

and a, carrying out difference operation on part (not less than 8 spectral coefficients) or all spectral amplitudes and adjacent spectral amplitudes, and setting the value of the difference result less than 0 to be 0 to obtain a group of non-negative spectral difference coefficients.

In this embodiment, the frequency point coefficients with position indexes from 3 to 61 are selected as an example, and the tonal characteristic parameter is calculated. The specific process is as follows:

and performing difference operation on adjacent spectrum amplitudes from the frequency point 3 to the frequency point 61, wherein the following equation is adopted:

spec_dif[n-3]＝X_{DFT_AMP}(n+1)-X_{DFT_AMP}(n)；3≤n＜62；

the variable less than 0 in spec _ dif is zeroed.

And b, solving a correlation coefficient of the current frame non-negative spectrum difference coefficient and the previous frame non-negative spectrum difference coefficient obtained by the calculation in the step a to obtain a first tonal characteristic parameter value. The calculation equation is as follows:

where pre _ spec _ dif is a non-negative spectral difference coefficient of the previous frame.

And c, performing smooth operation on the first tonal characteristic parameter value to obtain a second tonal characteristic parameter value. The calculation equation is as follows:

tonality_rate2＝tonal_scale·tonality_rate2_-1+(1-tonal_scale)·tonality_rate1

final _ scale is tone featureA sign parameter smoothing factor with a value range of [0.1, 1%]，tonality_rate2_-1The second tonal characteristic parameter value of the previous frame is the initial value range of [0, 1%]。

Step 203: calculating to obtain a signal-to-noise ratio parameter of the current frame according to the background noise energy estimated from the previous frame, the frame energy parameter of the current frame and the signal-to-noise ratio sub-band energy;

step 204: calculating to obtain an initial background noise identifier and a tonality mark of the current frame according to the energy parameter, the spectral gravity center characteristic parameter, the time domain stability characteristic parameter, the spectral flatness characteristic parameter and the tonality characteristic parameter of the current frame;

step 205: calculating according to the tonal modification mark, the signal-to-noise ratio parameter, the spectral gravity center characteristic parameter and the frame energy parameter to obtain a VAD judgment result;

specifically, the specific implementation method of step 205 is described below in conjunction with fig. 3.

Understandably, the step 205 before the VAD decision is adjustable in a forward-backward sequence as long as the parameters therein have no forward-backward causal relationship, for example, the step 204 of obtaining the initial background noise flag and the adjustability flag may be before the snr calculating step 203.

The initial background noise flag of the current frame needs to be corrected and then used for calculating the signal-to-noise ratio parameter of the next frame, so the operation of obtaining the initial background noise flag of the current frame can also be performed after VAD decision.

Step 206: correcting the initial background noise identifier according to the VAD judgment result of the current frame, the tonal characteristic parameter, the signal-to-noise ratio parameter, the tonal mark and the time domain stability characteristic parameter;

the background noise flag is assigned a value of 1 if the SNR2 is less than a set threshold SNR2_ redec _ thr1, SNR1 is less than SNR1_ redec _ thr1, VAD flag VAD _ flag is equal to 0, tunability feature parameter tonality _ rate2 is less than tonality _ rate2_ thr1, tunability flag tonality _ flag is equal to 0, and temporal stability feature parameter lt _ stable _ rate0 is less than lt _ stable _ rate0_ redec _ thr1 (set to 0.1).

Step 207: obtaining the background noise energy of the current frame according to the corrected value of the background noise mark, the frame energy parameter of the current frame and the full-band background noise energy of the previous frame; and the background noise energy of the current frame is used for calculating the signal-to-noise ratio parameter of the next frame.

And judging whether to update the background noise according to the background noise identifier, and if the background noise identifier is 1, updating the background noise according to the ratio of the energy of the full-band background noise to the energy of the current frame signal obtained by estimation. The background noise energy estimate comprises a sub-band background noise energy estimate and a full-band background noise energy estimate.

a, the subband background noise energy estimation equation is as follows:

E_{sb2_bg}(k)＝E_{sb2_bg_pre}(k)·α_{bg_e}+E_{sb2_bg}(k)·(1-α_{bg_e})；0≤k＜num_sb

where num _ sb is the number of frequency domain subbands, E_{sb2_bg_pre}(k) Representing the sub-band background noise energy of the k-th snr sub-band of the previous frame.

α_{bg_e}Is the background noise update factor, the value of which is determined by the full-band background noise energy of the previous frame and the current frame energy parameter. The calculation process is as follows:

if the previous frame has full background noise energy E_{t_bg}Frame energy parameter E smaller than current frame_t1If not, the value is 0.96, otherwise, the value is 0.95.

b, estimating the energy of the full-band background noise:

if the background noise mark of the current frame is 1, updating the accumulated value E of the background noise energy_{t_sum}And the number of background noise energy accumulation frames N_{Et_counter}The calculation equation is as follows:

E_{t_sum}＝E_{t_sum_-1}+E_t1；

N_{Et_counter}＝N_{Et_counter_-1}+1；

wherein E_{t_sum_-1}Accumulated value of background noise energy for previous frame, N_{Et_counter_-1}And calculating the accumulated frame number of the background noise energy obtained by the previous frame.

c, the total band background noise energy is added by the background noise energy E_{t_sum}And accumulating the number of frames N_{Et_counter}The ratio of (A) to (B) is obtained:

judgment of N_{Et_counter}Whether it is equal to 64, if N_{Et_counter}Equal to 64, respectively adding the background noise energy_{t_sum}And accumulating the number of frames N_{Et_counter}Multiplied by 0.75.

And d, adjusting the sub-band background noise energy and the background noise energy accumulated value according to the tonal modification mark, the frame energy parameter and the full-band background noise energy value. The calculation process is as follows:

if the tunability flag tonalityflag is equal to 1 and the frame energy parameter E_t1Is less than the background noise energy characteristic parameter E_{t_bg}Is multiplied by a gain factor gain,

then, E_{t_sum}＝E_{t_sum}·gain+delta；E_{sb2_bg}(k)＝E_{sb2_bg}(k)·gain+delta；

Wherein the value range of gain is [0.3,1 ].

In the embodiments 1 and 2, the process of obtaining the VAD decision result by calculating according to the tonality flag, the signal-to-noise ratio parameter, the spectral center-of-gravity characteristic parameter, and the frame energy parameter includes the following steps as shown in fig. 3:

step 301: calculating the ratio of the average long-time activation sound signal energy and the average long-time background noise energy obtained by calculating the previous frame to obtain a long-time signal-to-noise ratio lt _ snr;

average long-time activation tone signal energy E_fgAnd average long term background noise energy E_bgSee step 307 for the calculation and definition of (c). The long term signal-to-noise ratio lt _ snr is calculated as follows:

in this equation, the long term signal-to-noise ratio lt _ snr is expressed logarithmically.

Step 302: calculating the average value of the full-band signal-to-noise ratios SNR2 of the latest frames to obtain the average full-band signal-to-noise ratio SNR2_ lt _ ave;

the calculation equation is as follows:

SNR2(n) represents the value of full-band SNR2 for the nth frame before the current frame, and F _ num is the total number of frames for which the average is calculated, and ranges from [8,64 ].

Step 303: obtaining a signal-to-noise ratio threshold snr _ thr of VAD judgment according to the spectral gravity center characteristic parameter, the long-time signal-to-noise ratio lt _ snr, the continuous activated sound frame number continuousspeechjnum and the continuous noise frame number continuousnoise _ num;

the method comprises the following concrete steps:

first, an initial value of the snr threshold snr _ thr is set, in the range of [0.1, 2], preferably 1.06.

Secondly, the value of the SNR threshold snr _ thr is adjusted for the first time according to the characteristic parameter of the spectrum barycenter. The method comprises the following steps: if the value of the characteristic parameter sp _ center [2] of the spectrum center of gravity is larger than a set threshold value spc _ vad _ dec _ thr1, adding an offset value to the snr _ thr, and preferentially changing the offset value to be 0.05; otherwise, if sp _ center [1] is greater than spc _ vad _ dec _ thr2, adding an offset value to snr _ thr, and taking the preferential offset value as 0.10; otherwise, adding an offset value to the snr _ thr, and taking the offset value as 0.40 as the priority; wherein, the threshold values spc _ vad _ dec _ thr1 and spc _ vad _ dec _ thr2 have the value range of [1.2, 2.5]

Thirdly, the value of SNR _ thr is adjusted twice according to the number of previous continuous active audio frames continuousspeechjnum, the number of previous continuous noise frames continuousnoise _ num, the average full-band SNR2_ lt _ ave and the long-term SNR lt _ SNR. If the previous continuous voice number continuousspeechnum is larger than a set threshold value cpn _ vad _ dec _ thr1, subtracting 0.2 from the snr _ thr; otherwise, if the previous continuous noise number continuousness _ noise _ num is greater than a set threshold cpn _ vad _ dec _ thr2, and the SNR2_ lt _ ave is greater than an offset value plus the long-term SNR lt _ SNR multiplied by the coefficient lt _ tsnr _ scale, then the SNR _ thr is added with an offset value, and the preferential offset value is 0.1; otherwise, if the continuous _ noise _ num is greater than a set threshold cpn _ vad _ dec _ thr3, adding an offset value to the snr _ thr, and taking the offset value as 0.2 as the priority; otherwise, if the continuous _ noise _ num is greater than a set threshold cpn _ vad _ dec _ thr4, the snr _ thr is added with an offset value, and the preferred offset value is 0.1. The threshold values cpn _ vad _ dec _ thr1, cpn _ vad _ dec _ thr2, cpn _ vad _ dec _ thr3, cpn _ vad _ dec _ thr4 are in the range of [2, 500], and the coefficient lt _ tsnr _ scale is in the range of [0, 2 ]. The invention can be realized by skipping the step and directly entering the last step.

And finally, finally adjusting the SNR threshold snr _ thr according to the value of the long-term SNR lt _ snr to obtain the SNR threshold snr _ thr of the current frame.

The correction equation is as follows:

snr_thr＝snr_thr+(lt_tsnr-thr_offset)·thr_scale；

wherein, thr _ offset is an offset value, and the value range is [0.5,3 ]; thr _ scale is a gain coefficient, which has a value range of [0.1,1 ].

Step 304: calculating to obtain initial VAD judgment according to a judgment threshold SNR _ thr of the VAD and signal-to-noise ratio parameters SNR1 and SNR2 calculated by a current frame;

the calculation process is as follows:

if the SNR1 is greater than the decision threshold SNR _ thr, it is determined that the current frame is an active audio frame, and the value of VAD _ flag is used to indicate whether the current frame is an active audio frame, in this embodiment, the value 1 is used to indicate that the current frame is an active audio frame, and 0 is used to indicate that the current frame is an inactive audio frame. Otherwise, judging that the current frame is an inactive sound frame, and setting the value of a VAD flag VAD _ flag to be 0.

If the SNR2 is greater than a set threshold value SNR2_ thr, the current frame is judged to be an active voice frame, and the VAD flag VAD _ flag is set to 1. Wherein, the value range of snr2_ thr is [1.2, 5.0]

Step 305: correcting the VAD judgment result according to the tone flag, the average full-band signal-to-noise ratio SNR2_ lt _ ave, the spectrum center of gravity and the long-time signal-to-noise ratio lt _ SNR;

the method comprises the following specific steps:

if the tunability flag indicates that the current frame is a tonal signal, namely the tonality _ flag is 1, the current frame is judged to be an activated tone signal, and the vad _ flag is set to 1.

If the average full-band SNR2_ lt _ ave is greater than a set threshold SNR2_ lt _ ave _ t _ thr1 plus the long-term SNR lt _ SNR multiplied by the coefficient lt _ tsnr _ tscale, the current frame is determined to be an active audio frame, and vad _ flag is set to 1.

In this embodiment, the SNR2_ lt _ ave _ thr1 has a value range of [1, 4], and lt _ tsnr _ tscale has a value range of [0.1, 0.6 ].

If the average full-band SNR2_ lt _ ave is greater than a set threshold SNR2_ lt _ ave _ t _ thr2, the spectral centroid characteristic parameter sp _ center [2] is greater than a set threshold sp _ center _ t _ thr1 and the long-term SNR lt _ SNR is less than a set threshold lt _ tsnr _ t _ thr1, the current frame is determined to be an active audio frame and vad _ flag is set to 1. The SNR2_ lt _ ave _ t _ thr2 has a value range of [1.0, 2.5], sp _ center _ t _ thr1 has a value range of [2.0, 4.0], and lt _ tsnr _ t _ thr1 has a value range of [2.5, 5.0 ].

If the SNR2_ lt _ ave is greater than a set threshold SNR2_ lt _ ave _ t _ thr3, and the spectral centroid characteristic parameter sp _ center [2] is greater than a set threshold sp _ center _ t _ thr2 and the long-term SNR lt _ SNR is less than a set threshold lt _ tsnr _ t _ thr2, the current frame is determined to be an active audio frame and the vad _ flag is set to 1. The SNR2_ lt _ ave _ t _ thr3 has a value range of [0.8, 2.0], sp _ center _ t _ thr2 has a value range of [2.0, 4.0], and lt _ tsnr _ t _ thr2 has a value range of [2.5, 5.0 ].

If the SNR2_ lt _ ave is greater than a set threshold SNR2_ lt _ ave _ t _ thr4, and the spectral centroid characteristic parameter sp _ center [2] is greater than a set threshold sp _ center _ t _ thr3 and the long-term SNR lt _ SNR is less than a set threshold lt _ tsnr _ t _ thr3, the current frame is determined to be an active audio frame and the vad _ flag is set to 1. The SNR2_ lt _ ave _ t _ thr4 has a value range of [0.6, 2.0], sp _ center _ t _ thr3 has a value range of [3.0, 6.0], and lt _ tsnr _ t _ thr3 has a value range of [2.5, 5.0 ].

Step 306: correcting the number of the active tone holding frames according to the judgment results of the previous frames, the long-time signal-to-noise ratio lt _ SNR, the average full-band signal-to-noise ratio SNR2_ lt _ ave and the VAD judgment result of the current frame;

the specific calculation steps are as follows:

the precondition for correcting the current active sound hold frame number is that the active sound flag indicates that the current frame is an active sound frame, and if the condition is not met, the value of the current active sound hold frame number num _ speech _ handover is not corrected, and the process directly proceeds to step 307.

The active tone hold frame number correction step is as follows:

if the previous consecutive speech frame number continuos _ speech _ num is less than a set threshold continuos _ speech _ num _ thr1 and lt _ tsnr is less than a set threshold lt _ tsnr _ h _ thr1, the current active tone hold frame number num _ speech _ handoff is equal to the minimum consecutive active tone frame number minus the previous consecutive speech frame number continuos _ speech _ num. Otherwise, if the SNR2_ lt _ ave is greater than a set threshold SNR2_ lt _ ave _ thr1 and the number of previous continuous speech frames continuousspeechjnum is greater than a set threshold continuousspeechjnum _ thr2, the value of the active tone holding frame number num _ speech _ handoff is set according to the magnitude of the long-term SNR lt _ tsnr. Otherwise, the value of the current active tone hold frame number num _ speech _ handoff is not modified. In this embodiment, the minimum number of consecutive active audio frames is 8, which may be between [6 and 20 ].

The method comprises the following specific steps:

if the long-term signal-to-noise ratio lt _ snr is greater than 2.6, the value of num _ speech _ handover is 3; otherwise, if the long-term signal-to-noise ratio lt _ snr is greater than 1.6, the value of num _ speech _ handoff is 4; otherwise, num _ speech _ handle has a value of 5.

Step 307: and adding the active tone preservation according to the judgment result of the current frame and the number of the active tone preservation frames num _ speech _ handover to obtain the VAD judgment result of the current frame.

The method comprises the following steps:

if the current frame is judged as an inactive tone, i.e., the active tone flag is 0, and the active tone hold frame number num _ speed _ handoff is greater than 0, an active tone hold is added, i.e., the active tone flag is set to 1, and the value of num _ speed _ handoff is decremented by 1.

And obtaining a final VAD judgment result of the current frame.

Preferably, after step 304, the method further includes calculating the average long-time activation tone signal energy E according to the VAD initial decision result_fg(ii) a After step 307, calculating an average long-term background noise energy E according to the VAD decision result_bgAnd the calculated value is used for VAD decision of the next frame.

Average long-time activation tone signal energy E_fgThe specific calculation process is as follows:

a) if the VAD initialization decision result indicates that the current frame is an active voice frame, i.e., the value of the VAD flag is 1, and E_t1Greater than E_bgIf the average long-term activation sound energy accumulated value fg _ energy and the average long-term activation sound energy accumulated frame number fg _ energy _ count are several times, in this embodiment, 6 times are taken. The update method is fg _ energy plus E_t1A new fg _ energy is obtained. The fg _ energy _ count is added with 1 to obtain a new fg _ energy _ count.

b) In order to ensure that the energy of the average long-term active tone signal can reflect the energy of the latest active tone signal, if the accumulated frame number of the average long-term active tone energy is equal to a certain set value fg _ max _ frame _ num, the accumulated frame number and the accumulated value are multiplied by an attenuation coefficient attenu _ coef 1. In this embodiment, fg _ max _ frame _ num takes a value of 512, and attenu _ coef1 takes a value of 0.75.

c) The average long-term activation sound signal energy is obtained by dividing the average long-term activation sound energy accumulated value fg _ energy by the average long-term activation sound energy accumulated frame number, and the calculation equation is as follows:

average long term background noise energy E_bgThe calculation method comprises the following steps:

let bg _ energy _ count be the number of background noise energy accumulation frames for recording how many frames of energy the accumulation of the latest background noise energy contains. bg _ energy is the accumulated value of the recent background noise energy.

a) If the current frame is determined to be an inactive audio frame, the value of the VAD flag is 0, and the SNR2 is less than 1.0, the background noise energy accumulated value bg _ energy and the background noise energy accumulated frame number bg _ energy _ count are updated. The updating method is that the accumulated value bg _ energy of the background noise energy is added with E_t1A new background noise energy accumulation value bg _ energy is obtained. The background noise energy accumulation frame number bg _ energy _ count is added with 1 to obtain a new background noise energy accumulation frame number bg _ energy _ count.

b) If the background noise energy accumulation frame number bg _ energy _ count is equal to the maximum count frame number calculated by averaging the long-term background noise energy, the accumulation frame number and the accumulation value are multiplied by the attenuation coefficient attenu _ coef 2. In this embodiment, the maximum count frame number of the average long-term background noise energy calculation is 512, and the attenuation coefficient attenu _ coef2 is equal to 0.75.

c) The average long-term background noise energy calculation equation obtained by dividing the background noise energy accumulated value bg _ energy by the background noise energy accumulated frame number is as follows:

in order to implement the above-mentioned embodiments 1 and 2 of the active tone detection method, the present invention further provides an embodiment 1 of an active tone detection (VAD) apparatus, as shown in fig. 4, the apparatus including:

a filter bank for obtaining a subband signal of a current frame;

the frequency spectrum amplitude calculation unit is used for obtaining the frequency spectrum amplitude of the current frame;

the characteristic parameter acquisition unit is used for calculating values of a frame energy parameter and a spectrum gravity center characteristic parameter of the current frame according to the sub-band signals;

the signal-to-noise ratio calculating unit is used for calculating the signal-to-noise ratio parameter of the current frame according to the background noise energy estimated from the previous frame, the frame energy parameter of the current frame and the signal-to-noise ratio sub-band energy;

and the VAD judgment unit is used for calculating according to the tonal modification mark, the signal-to-noise ratio parameter, the spectrum gravity center characteristic parameter and the frame energy parameter to obtain a VAD judgment result.

Corresponding to the method embodiment 2, the characteristic parameter obtaining unit is further configured to obtain a value of a time domain stability characteristic parameter by calculation according to the subband signal, and obtain values of a spectral flatness characteristic parameter and a tonal characteristic parameter by calculation according to a spectral amplitude; (ii) a

The characteristic parameters can be obtained by the existing method or the following method:

the spectral barycentric characteristic parameter is the ratio of the weighted accumulated value and the unweighted accumulated value of all or part of subband signal energy, or the value obtained by performing smooth filtering on the ratio;

the tonal characteristic parameter is obtained by calculating the correlation value of the intra-frame frequency spectrum difference coefficient of the signals of the front frame and the rear frame, or continuously performing smooth filtering on the correlation value.

As shown in fig. 5, an embodiment 2 of the active voice detection (VAD) apparatus of the present invention is different from embodiment 1 in that the apparatus further includes a flag calculation unit and a background noise energy processing unit, wherein:

the sign calculation unit is used for calculating a tonal modification sign of the current frame according to the energy parameter, the spectral gravity center characteristic parameter, the time domain stability characteristic parameter, the spectral flatness characteristic parameter and the tonal modification characteristic parameter of the current frame;

a background noise energy processing unit, comprising:

the identification calculation module is used for calculating to obtain an initial background noise identification of the current frame according to the energy parameter, the spectrum gravity center characteristic parameter, the time domain stability characteristic parameter, the spectrum flatness characteristic parameter and the tonal characteristic parameter of the current frame;

the identification correction module is used for correcting the initial background noise identification according to the VAD judgment result of the current frame, the tonal characteristic parameter, the signal-to-noise ratio parameter, the tonal mark and the time domain stability characteristic parameter;

and the background noise energy acquisition module is used for acquiring the background noise energy of the current frame according to the corrected value of the background noise identifier, the frame energy parameter of the current frame and the full-band background noise energy of the previous frame, wherein the background noise energy of the current frame is used for calculating the signal-to-noise ratio parameter of the next frame.

Corresponding to method embodiments 1 and 2, as shown in fig. 6, the VAD decision unit includes:

the long-time signal-to-noise ratio calculation module is used for calculating the ratio of the average long-time activation sound signal energy and the average long-time background noise energy obtained by previous frame calculation to obtain a long-time signal-to-noise ratio lt _ snr;

the average full-band signal-to-noise ratio calculation module is used for calculating the average value of the full-band signal-to-noise ratios SNR2 of a plurality of recent frames to obtain an average full-band signal-to-noise ratio SNR2_ lt _ ave;

the signal-to-noise ratio threshold calculation module is used for obtaining a signal-to-noise ratio threshold snr _ thr of VAD judgment according to the spectrum barycenter characteristic parameter, the long-time signal-to-noise ratio lt _ snr, the number of previous continuous activated sound frames continuousspeed _ num and the number of previous continuous noise frames continuousnoise _ num;

the initial VAD decision module is used for calculating to obtain initial VAD decision according to the decision threshold SNR _ thr of the VAD and the signal-to-noise ratio parameters SNR1 and SNR2 calculated by the current frame;

the VAD result correction module corrects the judgment result of the VAD according to the tone flag, the average full-band signal-to-noise ratio SNR2_ lt _ ave, the spectrum gravity center and the long-time signal-to-noise ratio lt _ SNR;

the active voice hold frame correction module is used for correcting to obtain the active voice hold frame number according to the judgment results of the previous frames, the long-time signal-to-noise ratio lt _ SNR, the average full-band signal-to-noise ratio SNR2_ lt _ ave, the signal-to-noise ratio of the current frame and the VAD judgment result of the current frame;

and the VAD judgment module is used for adding the activated sound to keep according to the judgment result of the current frame and the number of the activated sound keeping frames num _ speech _ flag to obtain the VAD judgment result of the current frame.

More preferably, the VAD decision unit further includes: an energy calculating module for calculating the average long-time activation sound signal energy E according to the VAD initial judgment result_fg(ii) a And averaging long-term background noise energy E according to VAD (voice activity detection) judgment result_bgAnd updating, wherein the updated value is used for VAD judgment of the next frame.

The present invention further provides an embodiment of a method for detecting background noise, as shown in fig. 7, the method includes:

step 701: obtaining a sub-band signal and a frequency spectrum amplitude of a current frame;

step 702: calculating values of a frame energy parameter, a spectrum gravity center characteristic parameter and a time domain stability characteristic parameter according to the subband signals, and calculating values of a spectrum flatness characteristic parameter and a tonal characteristic parameter according to a frequency spectrum amplitude;

preferably, the frame energy parameter is a weighted or direct superposition of the individual subband-signal energies.

The spectral barycentric characteristic parameter is the ratio of the weighted accumulated value and the unweighted accumulated value of all or part of subband signal energy, or the value obtained by performing smooth filtering on the ratio.

The temporal stability parameter is a desired ratio of the variance of the frame energy amplitude to the square of the amplitude superposition, or the ratio is multiplied by a coefficient.

The spectral flatness parameter is the ratio of the geometric mean and the arithmetic mean of certain spectral amplitudes, or the ratio multiplied by a coefficient.

Specifically, step 701 and step 702 may adopt the same method as above, and are not described herein again.

Step 703: and detecting background noise according to the spectrum gravity center characteristic parameter, the time domain stability characteristic parameter, the spectrum flatness characteristic parameter, the tone characteristic parameter and the current frame energy parameter, and judging whether the current frame is background noise.

Preferably, when any one of the following conditions is satisfied, it is determined that the current frame is not a noise signal:

the time domain stability parameter lt _ stable _ rate0 is greater than a set threshold value;

the smooth filtering value of the first interval spectrum gravity center characteristic parameter value is larger than a set threshold value, and the time domain stability characteristic parameter value is also larger than a certain set threshold value;

the tonal characteristic parameter or the value after smooth filtering is larger than a set threshold value, and the value of the time domain stability characteristic parameter lt _ stable _ rate0 is larger than the set threshold value;

the spectral flatness characteristic parameters of each sub-band or the values after each smooth filtering are all smaller than the respective corresponding set threshold values;

or, determining the frame energy parameter E_t1Is greater than a set threshold E _ thr 1.

Specifically, the current frame is assumed to be background noise.

In this embodiment, a background noise flag background _ flag is used to indicate whether a current frame is a background noise, and it is agreed that if the current frame is determined to be a background noise, the background noise flag background _ flag is set to 1, otherwise, the background noise flag background _ flag is set to 0.

And detecting whether the current frame is a noise signal according to the time domain stability characteristic parameter, the spectrum gravity center characteristic parameter, the spectrum flatness characteristic parameter, the tonal characteristic parameter and the current frame energy parameter. If not, the background noise flag background _ flag is set to 0.

The specific process is as follows:

it is determined whether the time domain stability parameter lt _ stable _ rate0 is greater than a set threshold value lt _ stable _ rate _ thr 1. If so, judging that the current frame is not a noise signal, and setting background _ flag to 0. The value range of the threshold value lt _ stable _ rate _ thr1 in this embodiment is [0.8, 1.6 ];

and judging whether the characteristic parameter value of the center of gravity of the smoothed spectrum is greater than a set threshold value sp _ center _ thr1 or not, and whether the characteristic parameter value of the time domain stability is greater than a set threshold value lt _ stable _ rate _ thr2 or not. If so, judging that the current frame is not a noise signal, and setting background _ flag to 0. The sp _ center _ thr1 has a value range of [1.6, 4 ]; the value range of lt _ stable _ rate _ thr2 is (0, 0.1%).

And judging whether the value of the tonal characteristic parameter tonalitjrate 2 is greater than a set threshold value tonalitjrate _ thr1 or not, and whether the value of the time domain stability characteristic parameter lt _ stable _ rate0 is greater than a set threshold value lt _ stable _ rate _ thr3 or not, if the conditions are met simultaneously, judging that the current frame is not background noise, and assigning a background _ flag value to be 0. The threshold value tonalityajthr 1 is in the range of [0.4, 0.66 ]. The threshold value lt _ stable _ rate _ thr3 has a value range of [0.06, 0.3 ].

Whether the value of the spectral flatness characteristic parameter sSMR [0] is smaller than a set threshold value sSMR _ thr1, whether the value of the spectral flatness characteristic parameter sSMR [1] is smaller than a set threshold value sSMR _ thr2, and whether the value of the spectral flatness characteristic parameter sSMR [2] is smaller than a set sSMR _ thr3 are judged. If the above conditions are simultaneously satisfied, the current frame is judged not to be background noise. background _ flag is assigned a value of 0. The threshold values smr _ thr1, smr _ thr2, smr _ thr3 range from [0.88, 0.98 ]. Whether the value of the flatness characteristic parameter sSMR [0] is smaller than a set threshold value sSMR _ thr4, whether the value of the spectral flatness characteristic parameter sSMR [1] is smaller than a set threshold value sSMR _ thr5 and whether the value of the spectral flatness characteristic parameter sSMR [1] is smaller than a set threshold value sSMR _ thr6 are judged. And if any one of the conditions is satisfied, judging that the current frame is not background noise. background _ flag is assigned a value of 0. The values of sSMR _ thr4, sSMR _ thr5 and sSMR _ thr6 were in the range of [0.80, 0.92]

Judging frame energy parameter E_t1Is greater than a set threshold value E _ thr1, and if the above conditions are satisfied, it is determined that the current frame is not background noise. background _ flag is assigned a value of 0. E _ thr1 takes values according to the dynamic range of the frame energy parameter.

If the current frame is not detected to be background noise, the current frame is indicated to be background noise.

Corresponding to the above method, the present invention further provides a background noise detection apparatus, as shown in fig. 8, the apparatus including:

a filter bank for obtaining a subband signal of a current frame;

the characteristic parameter calculation unit is used for calculating values of a frame energy parameter, a spectrum gravity center characteristic parameter and a time domain stability characteristic parameter according to the subband signals and calculating values of a spectrum flatness characteristic parameter and a tonal characteristic parameter according to a spectrum amplitude;

And the background noise judgment unit is used for detecting background noise according to the spectrum gravity center characteristic parameter, the time domain stability characteristic parameter, the spectrum flatness characteristic parameter, the tone characteristic parameter and the current frame energy parameter and judging whether the current frame is background noise.

Preferably, the background noise determination unit determines that the current frame is not a noise signal if any of the following conditions is satisfied:

The invention also provides a method for detecting a tonal signal, as shown in fig. 9, the method comprises:

step 901: obtaining a sub-band signal and a frequency spectrum amplitude of a current frame;

step 902: calculating values of a spectral gravity center characteristic parameter and a time domain stability characteristic parameter of the current frame according to the sub-band signals, and calculating values of a spectral flatness characteristic parameter and a tonal characteristic parameter according to a frequency spectrum amplitude;

preferably, the spectral centroid characteristic parameter is a ratio of a weighted accumulated value and an unweighted accumulated value of all or part of subband signal energy, or a value obtained by performing smooth filtering on the ratio; the time domain stability characteristic parameter is an expected ratio of the variance of the amplitude superposition value to the square of the amplitude superposition value, or the ratio is multiplied by a coefficient;

Step 903: and judging whether the current frame is a tonal signal according to the tonal characteristic parameter, the time domain stability characteristic parameter, the spectrum flatness characteristic parameter and the spectrum gravity center characteristic parameter.

When step 903 determines whether the signal is a tone signal, the following operations are performed:

A) assume that the current frame signal is an unmodulated signal and use a tonal frame flag tonality _ frame to indicate whether the current frame is a tonal frame.

In this embodiment, a value of tonality _ frame of 1 indicates that the current frame is a tonal frame, and 0 indicates that the current frame is a non-tonal frame;

B) judging whether the value of the tunability characteristic parameter tonality _ rate1 or tonality _ rate2 after smoothing filtering is larger than a corresponding set threshold value tonality _ determination _ thr1 or tonality _ determination _ thr2, if one of the conditions is satisfied, executing the step C), otherwise, executing the step D);

wherein, the value range of the tonality _ precision _ thr1 is [0.5, 0.7], and the value range of the tonality _ rate1 is [0.7, 0.99 ].

C, if the time domain stability characteristic parameter value lt _ stable _ rate0 is smaller than a set threshold value lt _ stable _ decision _ thr 1; the characteristic parameter value sp _ center [1] of the spectrum barycenter is larger than a set threshold value spc _ precision _ thr1, and the characteristic parameter of the spectrum flatness of each sub-band is smaller than a corresponding preset threshold value, specifically, the characteristic parameter sSMR [0] of the spectrum flatness is smaller than a set threshold value sSMF _ precision _ thr1, or sSMR [1] is smaller than a set threshold value sSMF _ precision _ thr2, or sSMR [2] is smaller than a set threshold value sSMF _ precision _ thr 3; judging that the current frame is a tonal frame, setting the value of a tonal frame flag tonality _ frame to be 1, otherwise, judging that the current frame is a non-tonal frame, and setting the value of the tonal frame flag tonality _ frame to be 0. And proceeds to step D.

Wherein, the threshold value lt _ stable _ precision _ thr1 has a value range of [0.01, 0.25], spc _ precision _ thr1 has a value range of [1.0, 1.8], sSMF _ precision _ thr1 has a value range of [0.6, 0.9], sSMF _ precision _ thr2[0.6, 0.9], and sSMF _ precision _ thr3[0.7, 0.98 ].

D) And updating a tunability degree characteristic parameter tunability _ degree according to a tunability frame flag tunability _ frame, wherein an initial value of the tunability degree parameter tunability _ degree is set when the active tone detection device starts to work, and the value range is [0,1 ]. Under different conditions, the calculation method of the tonality _ degree characteristic parameter is different:

if the current tonal frame flag indicates that the current frame is a tonal frame, the tonal degree characteristic parameter tonality _ degree is updated using the following equation:

tonality_degree＝tonality_degree_-1·td_scale_A+td_scale_B；

wherein tonality _ degree_-1Is the characteristic parameter of the degree of tonality of the previous frame. Its initial value range is [0, 1]]. td _ scale _ A is an attenuation coefficient with a value range of [0, 1%](ii) a td _ scale _ B is an accumulation coefficient with a value range of [0, 1%]。

E) And judging whether the current frame is a tonal signal according to the updated tonal degree characteristic parameter tonalityjdegree, and setting the value of a tonal flag tonalityjflag.

Specifically, if the tonality _ degree characteristic parameter is greater than a certain set threshold, the current frame is determined to be a tonality signal, otherwise, the current frame is determined to be a non-tonality signal.

Corresponding to the foregoing method for detecting a tonal signal, the present invention further provides a device for detecting a tonal signal, as shown in fig. 10, the device comprising:

a filter bank for obtaining a subband signal of a current frame;

the characteristic parameter calculation unit is used for calculating values of current spectrum gravity center characteristic parameters and time domain stability characteristic parameters according to the sub-band signals and calculating values of spectrum flatness characteristic parameters and tonal characteristic parameters according to the spectrum amplitude;

as mentioned above, the spectral centroid characteristic parameter is the ratio of the weighted accumulated value and the unweighted accumulated value of all or part of subband signal energy, or the value obtained by performing smooth filtering on the ratio;

And the tonal signal judging unit is used for judging whether the current frame is a tonal signal according to the tonal characteristic parameter, the time domain stability characteristic parameter, the spectrum flatness characteristic parameter and the spectrum gravity center characteristic parameter.

As shown in fig. 11, the tonality signal determination unit includes:

the signal initialization module of the key, is used for presuming the current frame signal is the non-key signal, and indicate whether the current frame is the key frame with a key frame mark tonality _ frame;

the tonal characteristic parameter judging module is used for judging whether the value of the tonal characteristic parameter tonalitjrate 1 or tonalitjrate 2 after smooth filtering is larger than a corresponding set threshold value or not;

the tonal characteristic parameter judging module is used for judging whether the time domain stability characteristic parameter value is smaller than a set threshold value or not when the tonal characteristic parameter judging module judges that the time domain stability characteristic parameter value is smaller than the set threshold value; the value of the spectrum gravity center characteristic parameter is larger than a set threshold value, and the spectrum flatness characteristic parameters of each sub-band are smaller than the corresponding preset threshold values; judging the current frame as a tonal frame; judging whether the current frame is a tonal signal according to the tonal degree characteristic parameter tonalityjdegree obtained by calculation, and when the tonal characteristic parameter judging module judges that the current frame is not the tonal signal, judging whether the current frame is the tonal signal according to the updated tonal degree characteristic parameter tonalityjdegree, and setting the value of a tonal flag tonalityjg;

and the tonality parameter updating module is used for updating the tonality parameter tonality _ degree according to the tonality frame mark when the tonality parameter tonality _ rate1 or the value of the tonality _ rate2 after smooth filtering is smaller than the corresponding set threshold value, wherein the initial value of the tonality parameter tonality _ degree is set when the active tone detection device starts to work.

Specifically, if the current tonality frame flag indicates that the current frame is a tonality frame, the tonality parameter updating module updates the tonality feature parameter tonality _ degree using the following equation:

tonality_degree＝tonality_degree_-1·td_scale_A+td_scale_B；

If the tonality _ degree characteristic parameter is larger than a certain set threshold value, the tonality signal judgment module judges that the current frame is a tonality signal, otherwise, judges that the current frame is a non-tonality signal.

Specifically, if the tonality _ degree characteristic parameter is greater than the threshold value by 0.5, determining that the current frame is a tonality signal, and setting the value of a tonality _ flag to 1; otherwise, the current frame is judged to be a non-tonal signal, and the value is set to 0. The threshold value interval of the tonal signal judgment is [0.3, 0.7 ].

The present invention also provides a method for correcting the number of active tone hold frames in VAD decision, as shown in fig. 12, the method includes:

step 1201: calculating to obtain a long-time signal-to-noise ratio lt _ snr according to the sub-band signals;

specifically, a long-time signal-to-noise ratio lt _ snr is calculated according to a ratio of the average long-time activation sound signal energy and the average long-time background noise energy calculated from a previous frame; the long term signal-to-noise ratio lt _ snr can be expressed logarithmically.

Step 1202: calculating an average full-band signal-to-noise ratio SNR2_ lt _ ave;

calculating the average value of the full-band signal-to-noise ratios SNR2 of the latest frames to obtain the average full-band signal-to-noise ratio SNR2_ lt _ ave;

step 1203: and correcting the number of the current active tone hold frames according to the judgment results of the previous frames, the long-time signal-to-noise ratio lt _ SNR, the average full-band signal-to-noise ratio SNR2_ lt _ ave and the VAD judgment result of the current frame.

Understandably, the precondition for the current active audio hold frame number modification is that the active audio flag indicates that the current frame is an active audio frame.

Preferably, when the current active tone hold frame number is modified, if the previous continuous voice frame number is smaller than a set threshold 1 and the long-term signal-to-noise ratio lt _ snr is smaller than a set threshold 2, the current active tone hold frame number is equal to the minimum continuous active tone frame number minus the previous continuous voice frame number; otherwise, if the average full-band SNR2_ lt _ ave is greater than a set threshold 3 and the number of previous continuous speech frames is greater than a set threshold 4, setting the value of the active tone hold frame number according to the magnitude of the long-term SNR, otherwise, not modifying the value of the current active tone hold frame number num _ speech _ handover.

Corresponding to the foregoing method for correcting the number of active tone hold frames, the present invention further provides a device for correcting the number of active tone hold frames in VAD decision, as shown in fig. 13, the device includes:

specifically, the long-time signal-to-noise ratio calculating unit calculates a long-time signal-to-noise ratio lt _ snr by using a ratio of the average long-time activation sound signal energy and the average long-time background noise energy calculated from a previous frame;

specifically, the average full-band SNR computing unit computes an average of the full-band SNRs 2 of the last several frames to obtain an average full-band SNR2_ lt _ ave.

As described above, the precondition for the current active sound hold frame number modification is that the active sound flag indicates that the current frame is an active sound frame.

Preferably, the active tone-holding frame number correcting unit corrects the current active tone-holding frame number, if the previous continuous speech frame number is smaller than a set threshold 1 and the long-term SNR lt _ SNR is smaller than a set threshold 2, the current active tone-holding frame number is equal to the minimum continuous active tone frame number minus the previous continuous speech frame number, otherwise, if the average full band SNR2_ lt _ ave is greater than a set threshold 3 and the previous continuous speech frame number is greater than a set threshold 4, the value of the active tone-holding frame number is set according to the size of the long-term SNR, otherwise, the value of the current active tone-holding frame number num _ speech _ handoff is not corrected.

The present invention also provides a method for adjusting the signal-to-noise ratio threshold in VAD decision, as shown in fig. 14, the adjusting method includes:

step 1401: calculating to obtain a spectral gravity center characteristic parameter of the current frame according to the sub-band signal;

specifically, the spectral centroid characteristic parameter is a ratio of a weighted accumulated value and an unweighted accumulated value of all or part of subband signal energy, or a value obtained by performing smooth filtering on the ratio.

Step 1402: calculating the ratio of the average long-time activation sound signal energy and the average long-time background noise energy obtained by calculating the previous frame to obtain a long-time signal-to-noise ratio lt _ snr;

step 1403: and adjusting the signal-to-noise ratio threshold of VAD judgment according to the spectral gravity center characteristic parameter, the long-term signal-to-noise ratio, the number of the previous continuous activated sound frames and the number of the previous continuous noise frames continuousnoise _ num.

Specifically, as shown in fig. 15, the step of adjusting the signal-to-noise ratio threshold includes:

step 1501: setting an initial value of a signal-to-noise ratio threshold snr _ thr;

step 1502: firstly adjusting the value of a signal-to-noise ratio threshold snr _ thr according to the spectrum gravity center parameter;

step 1503: adjusting the value of the SNR threshold SNR _ thr twice according to the previous continuous active audio frame number continuousspeechjnum, the previous continuous noise frame number continuousnoise _ num, the average full-band SNR2_ lt _ ave and the long-term SNR lt _ SNR;

step 1504: and finally correcting the SNR threshold snr _ thr according to the value of the long-term SNR lt _ snr to obtain the SNR threshold snr _ thr of the current frame.

Corresponding to the foregoing method for adjusting the snr threshold, the present invention further provides an apparatus for adjusting the snr threshold in VAD decision, as shown in fig. 16, the apparatus includes:

preferably, the spectral centroid characteristic parameter is a ratio of a weighted accumulated value and an unweighted accumulated value of all or part of subband signal energy, or a value obtained by performing smooth filtering on the ratio.

The long-time signal-to-noise ratio calculation unit is used for calculating the ratio of the average long-time activation sound signal energy and the average long-time background noise energy obtained by calculation of the previous frame to obtain a long-time signal-to-noise ratio lt _ snr;

and the signal-to-noise ratio threshold adjusting unit is used for adjusting the signal-to-noise ratio threshold of VAD judgment according to the spectrum gravity center characteristic parameter, the long-term signal-to-noise ratio, the number of the previous continuous activated sound frames and the number of the previous continuous noise frames continuousnoise _ num.

Specifically, when the snr threshold adjustment unit adjusts the snr threshold, an initial value of the snr threshold snr _ thr is set; firstly adjusting the value of a signal-to-noise ratio threshold snr _ thr according to the spectrum gravity center parameter; adjusting the value of SNR _ thr twice according to the number of previous continuous active audio frames continuousspeechjnum, the number of previous continuous noise frames continuousnoise _ num, the average full-band signal-to-noise ratio SNR2_ lt _ ave and the long-time signal-to-noise ratio lt _ SNR; and finally, finally adjusting the SNR threshold snr _ thr according to the value of the long-term SNR lt _ snr to obtain the SNR threshold snr _ thr of the current frame.

Many modern speech coding standards, such as AMR, AMR-WB, support VAD functions. In terms of efficiency, the VAD of these encoders does not achieve good performance under all typical background noise. Especially under non-stationary noise, such as office noise, the VAD efficiency of these encoders is low. For music signals, these VADs sometimes make false detections, resulting in a significant quality degradation of the corresponding processing algorithms.

The method of the invention overcomes the defects of the existing VAD algorithm, improves the efficiency of the VAD for detecting unstable noise and simultaneously improves the accuracy of music detection. The voice frequency signal processing algorithm adopting the VAD can obtain better performance.

The background noise detection method provided by the invention can make the estimation of the background noise more accurate and stable, and is beneficial to improving the accuracy of VAD detection. The tonal signal detection method provided by the invention improves the accuracy of tonal music detection. The invention also provides a correction method of the number of the active tone hold frames, which can ensure that the VAD algorithm can obtain better balance in performance and efficiency under different noises and signal-to-noise ratios. The invention also provides a method for adjusting the signal-to-noise ratio threshold in VAD judgment, which can ensure that the VAD judgment algorithm can achieve better accuracy under different signal-to-noise ratios, and further improve the efficiency under the condition of ensuring the quality.

It will be understood by those skilled in the art that all or part of the steps of the above methods may be implemented by instructing the relevant hardware through a program, and the program may be stored in a computer readable storage medium, such as a read-only memory, a magnetic or optical disk, and the like. Alternatively, all or part of the steps of the above embodiments may be implemented using one or more integrated circuits. Accordingly, each module/unit in the above embodiments may be implemented in the form of hardware, and may also be implemented in the form of a software functional module. The present invention is not limited to any specific form of combination of hardware and software.

Claims

1. A method for correcting the number of current active tone hold frames in an active tone detection (VAD) decision comprises the following steps:

correcting the number of the current active tone hold frames according to the VAD decision results of the previous frames, the long-time signal-to-noise ratio lt _ SNR, the average full-band signal-to-noise ratio SNR2_ lt _ ave and the VAD decision result of the current frame;

when the current active tone hold frame number is corrected,

if the previous continuous speech frame number is less than a set threshold 1 and the long term signal-to-noise ratio lt _ snr is less than a set threshold 2, the current active tone hold frame number is equal to the minimum continuous active tone frame number minus the previous continuous speech frame number,

otherwise, if the average full-band SNR2_ lt _ ave is greater than a set threshold 3 and the previous continuous speech frame number is greater than a set threshold 4, then the value of the active tone hold frame number is set according to the value of the long-term SNR lt _ SNR,

otherwise, the value of the current active tone hold frame number is not corrected.

2. The method of claim 1 wherein the long term signal-to-noise ratio lt _ snr is calculated from a ratio of an average long term active tone signal energy and an average long term background noise energy of a previous frame; the average of the full-band SNR2 over the most recent frames is calculated to obtain the average full-band SNR2_ lt _ ave.

3. The method of claim 1, wherein the current active tone hold frame number modification is conditioned on an active tone flag indicating that the current frame is an active tone frame.

4. An apparatus for modifying a number of current active tone hold frames in an active tone detection (VAD) decision, comprising:

the active tone hold frame number correcting unit is used for correcting the current active tone hold frame number according to the judgment results of a plurality of previous frames, the long-time signal-to-noise ratio lt _ SNR, the average full-band signal-to-noise ratio SNR2_ lt _ ave and the VAD judgment result of the current frame;

when the active tone keeping frame number correcting unit corrects the current active tone keeping frame number, if the previous continuous voice frame number is less than a set threshold value 1 and the long-time signal-to-noise ratio lt _ SNR is less than a set threshold value 2, the current active tone keeping frame number is equal to the minimum continuous active tone frame number minus the previous continuous voice frame number, otherwise, if the average full-band signal-to-noise ratio SNR2_ lt _ ave is greater than a set threshold value 3 and the previous continuous voice frame number is greater than a set threshold value 4, the value of the active tone keeping frame number is set according to the value of the long-time signal-to-noise ratio lt _ SNR, otherwise, the value of the current active tone keeping frame number is not corrected.

5. The correction apparatus as claimed in claim 4, wherein the long term SNR calculation unit calculates the long term SNR lt _ snr by a ratio of an average long term activation tone energy of a previous frame to an average long term background noise energy; the average full-band SNR computing unit computes an average of full-band SNRs 2 of a plurality of recent frames to obtain the average full-band SNR2_ lt _ ave.

6. The correction apparatus as claimed in claim 4, wherein the condition that the active audio hold frame number correction unit corrects the current active audio hold frame number is that the active audio flag indicates that the current frame is an active audio frame.