CN112102818B

CN112102818B - Signal-to-noise ratio calculation method combining voice activity detection and sliding window noise estimation

Info

Publication number: CN112102818B
Application number: CN202011297932.9A
Authority: CN
Inventors: 胡岸; 何云鹏; 许兵
Original assignee: Chipintelli Technology Co Ltd
Current assignee: Chipintelli Technology Co Ltd
Priority date: 2020-11-19
Filing date: 2020-11-19
Publication date: 2021-01-26
Anticipated expiration: 2040-11-19
Also published as: CN112102818A

Abstract

The signal-to-noise ratio calculation method combining voice activity detection and sliding window noise estimation comprises the following steps: s1, carrying out frame-by-frame processing on input voice with noise, and S2, setting a sliding window and continuously updating the minimum value of each frequency point of a frequency spectrum in the window; s3, solving the frame energy f and the frame spectrum entropy of each frame; s4, judging whether the voice activity detection state is in the voice activity detection state according to whether the frame energy and the frame spectrum entropy are simultaneously larger than respective set threshold values; and S5, when the voice activity detection state is in, obtaining and updating the frame signal-to-noise ratio. The invention can judge the environment instant state by judging whether the frame signal-to-noise ratio updating time is controlled according to the voice activity detection state, thereby updating the frame signal-to-noise ratio more effectively and accurately.

Description

Signal-to-noise ratio calculation method combining voice activity detection and sliding window noise estimation

Technical Field

The invention belongs to the technical field of artificial intelligence, relates to voice recognition, and particularly relates to a signal-to-noise ratio calculation method combining voice activity detection and sliding window noise estimation.

Background

The application scenes of voice are gradually enriched, and different application scenes are often accompanied by noise. These speech related applications require tools such as decibel detectors, etc., and speech technologies such as speech recognition, array signal processing, etc. may also require signal-to-noise ratios or optimize the experience in terms of signal-to-noise ratios. Therefore, it is necessary to obtain an accurate snr estimate, which firstly needs to perform a relatively accurate real-time estimate of the background noise, and secondly needs to decide when to update the snr.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention discloses a signal-to-noise ratio calculation method combining voice activity detection and sliding window noise estimation.

The invention discloses a signal-to-noise ratio calculation method combining voice activity detection and sliding window noise estimation, which comprises the following steps of:

s1, carrying out frame-by-frame processing on input voice with noise, and carrying out short-time Fourier transform on each frame of data to obtain a frequency spectrum Y (k, l), wherein k is frequency point frequency and l is frame number;

s2, setting a sliding window, and continuously updating the minimum value of each frequency point of the frequency spectrum in the window;

the continuous updating method comprises the following specific steps:

with the frequency spectrum amplitude of each frequency point of the first frame

The sum of the square values is used as the initial value of background energy; starting from a second frame, comparing each frequency point of the frame with the same frequency point value of all frames in front of the frame and in a window, selecting the minimum value, and updating the frame frequency point by frequency point after obtaining the minimum value of a single frequency point to obtain the minimum value of the background energy of the full frequency band of the frame;

s3, solving the frame energy frame _ energy and the frame spectral entropy frame _ entropy of each frame;

s4, judging whether the frame energy frame _ energy and the frame spectrum entropy frame _ entropy are larger than respective set threshold values or not at the same time, and judging whether the frame energy frame _ energy and the frame spectrum entropy frame _ entropy are in a voice activity detection state or not;

and S5, when the voice activity detection state is in, obtaining and updating the frame signal-to-noise ratio.

Preferably: in step S2, the update equation of the minimum value of the background energy is:

；

wherein min (k, l) is the minimum value before update of the frequency point k, and frame _ energy is the frame energy; back _ energy (l) is the background energy of the l-th frame; alpha is a background energy smoothing parameter, and N is the number of points of Fourier transform.

Preferably: frame energy of the l-th frame

；

The frame spectrum entropy frame _ entropy is estimated by adopting the following formula, wherein N is the number of points of Fourier transform:

；

p (k, l) is the proportion of the power spectrum of each frequency point to the power spectrum of the whole frame, wherein k is the frequency point frequency and l is the frame number.

Preferably: the threshold value set in step S4 is linearly related to the background spectral entropy,

the calculation formula of the background spectral entropy back _ entropy (l) of the l-th frame is:

(ii) a Wherein beta is a background frame spectrum entropy smoothing parameter, and l is a frame number.

Preferably: the step S4 specifically includes:

when the frame energy frame _ energy and the frame spectral entropy frame _ entropy are simultaneously larger than the respective set threshold values, defining the state as a state 1, otherwise, defining the state as a state 2;

in the state 1, the value of the voice counting frame voice _ frame is added with 1, and the value of the silence counting frame silence _ frame is 0;

in the state 2, the value of the silence count frame silence _ frame is added with 1, and the value of the voice count frame voice _ frame is 0;

and judging that the voice activity detection state is 1 only when the continuous occurrence frequency of the state 1 reaches the set state 1 frequency threshold value, namely, the voice activity detection state is considered to be in the voice activity detection state.

Preferably: the frame signal-to-noise ratio in step S5 is obtained according to the following formula when the voice activity detection state is 1,

frame snr of frame l:

；

γ is the SNR smoothing parameter of the frame, frame _ energy (l) is the frame energy of the l-th frame, and back _ energy (l) is the background energy of the l-th frame.

The invention can judge the environment instant state by controlling the update time of the frame signal-to-noise ratio through the voice activity detection state, thereby updating the frame signal-to-noise ratio more effectively and accurately.

Drawings

Fig. 1 is a schematic flow chart of an embodiment of the signal-to-noise ratio calculation method according to the present invention.

Detailed Description

The following provides a more detailed description of the present invention.

the continuous updating method comprises the following specific steps:

Taking the square value as an initial value of background energy; starting from a second frame, comparing each frequency point of the frame with the same frequency point value of all frames in front of the frame and in a window, selecting the minimum value, and updating the frame frequency point by frequency point after obtaining the minimum value of a single frequency point to obtain the minimum value of the background energy of the full frequency band of the frame;

s3, solving the frame energy and the frame spectrum entropy frame _ entropy of each frame;

s4, judging whether the frame energy frame _ energy and the frame spectrum entropy frame _ entropy are simultaneously larger than a set threshold value or not, if so, judging whether the frame energy frame _ energy and the frame spectrum entropy are in a voice activity detection state, otherwise, judging not to be in a voice activity detection state;

The concrete mode is as follows:

taking the sum of the square values of the spectral amplitude Y (k, l) of each frequency point of the first frame as an initial value of background energy; starting from a second frame, comparing each frequency point of the frame with the same frequency point value of all frames in front of the frame and in a window, selecting the minimum value, and updating the frame frequency point by frequency point after obtaining the minimum value of a single frequency point to obtain the minimum value of the background energy of the full frequency band of the frame;

s4, judging whether the voice activity detection state is in the voice activity detection state according to the continuous times of whether the frame energy frame _ energy and the frame spectrum entropy frame _ entropy are simultaneously larger than the respective set threshold values;

s5, when the voice activity detection state is in, the frame signal to noise ratio is obtained and updated

As shown in fig. 1, the input noisy speech y is processed frame by frame, the minimum value of the background energy is updated by a sliding window, the frame energy and the frame spectral entropy of each frame are obtained, and the update output of the frame signal-to-noise ratio snr is performed after whether the speech activity detection state is achieved.

In the frame-by-frame processing, each frame of data is subjected to short-time Fourier transform to obtain a frequency spectrum Y (k, l) and the frequency spectrum amplitude

(ii) a Where k represents the frequency of the frequency bins and l represents the number of frames.

Setting a sliding window, continuously updating the minimum value of each frequency point of the frequency spectrum in the window through the sliding window, and ensuring that the estimated background energy value back _ energy is as smooth as possible through a smoothing strategy and cannot be suddenly changed due to burst noise;

the method can obtain the frame energy frame _ energy and the frame spectrum entropy frame _ entropy at the same time, after obtaining the frame energy and the frame spectrum entropy respectively in each frame, compare the values of the two with respective threshold values,

obtaining the voice activity detection state according to the frame energy frame _ energy and the frame spectrum entropy frame _ entropy, if the frame energy frame _ energy and the frame spectrum entropy are larger than the threshold value, adding 1 to the voice counting frame, otherwise adding 1 to the quiet counting frame,

for example, setting the voice count frame to be greater than 5 or the quiet count frame to be greater than 10 may be used to determine whether to turn on or off the voice, and then output the voice activity detection status 0/1 to determine whether to update the snr.

One specific flow of sliding window noise estimation is given below:

according to the frequency spectrum amplitude of each frequency point in the first frame

The sum of the squared values is used as the background energy initialization value and recorded as the initial background energy data min (k, l). The sliding window frame length may be set to L =80, that is, the length of the frame can be set to cover the pronunciation of the monosyllabic word in chinese, but the length is not limited to L =80 and varies depending on the speech speed and the language type.

That is, each frequency point of each frame is compared with the value of the L-1 frame in the L frame with the past window length, the minimum value of each frequency point is selected, and the minimum value is updated to the background energy data min (k, L). After obtaining the background energy data min (k, l) of a single frequency point, the background energy data min (k, l) is updated frequency point by frequency point to obtain the background energy minimum value of the full frequency band of the frame, and subsequent operation needs to use the minimum value to update the background energy.

Background energy of the l-th frame

Where frame _ energy is the frame energy value.

In the present invention, the background energy smoothing parameter α can be set to 0.9, and l is the frame number.

One specific procedure for voice activity detection is given below:

the frame energy frame _ energy of each frame is first obtained according to a time domain method, the specific time domain method provided in the present invention is used for reference, and the obtaining method is not limited to the time domain method, but can also be a frequency domain method. In particular to spectral amplitude

The square of the frame is added frequency point by frequency point to obtain the frame energy of the l frame:

sum denotes summation;

secondly, the frame spectral entropy frame _ entropy of the frame needs to be solved, and the simplest estimation method can adopt the following formula to estimate the spectral entropy of the current frame, wherein N is the number of points of Fourier transform, and N/2 is taken for summation due to the conjugate symmetry property.

;

after the frame energy frame _ energy and the frame spectral entropy frame _ entropy are obtained, a background spectral entropy back _ entropy needs to be obtained, and the update timing of the background spectral entropy is performed according to whether the background spectral entropy is in a voice activity detection state or not, that is, when the background spectral entropy is in the voice activity detection state, the background spectral entropy is updated.

The background spectral entropy back _ entropy is smoothed as follows, i.e., the background spectral entropy of the first frame

，

Wherein the value of the background frame spectrum entropy smoothing parameter beta can be selected to be 0.95, and l is the frame number.

The step S4 may specifically be:

when the frame energy frame _ energy and the frame spectral entropy frame _ entropy are simultaneously larger than the respective set threshold, defining the state as a state 1, otherwise, defining the state as a state 2;

in state 2, 1 is added to the value of the silence count frame silence _ frame, and the value of the speech count frame voice _ frame is 0.

Under the frame-by-frame detection state, the continuous occurrence frequency of the state 1 reaches a set state 1 frequency threshold value, the voice activity detection state is judged to be 1, at the moment, the voice activity detection state is in, and the voice activity detection state is not in other states; if the continuous occurrence frequency of the state 2 reaches the set threshold value of the state 2 frequency, the voice activity detection state is 0, at this time, no voice can be judged, and the system can enter a power-saving standby mode.

Let th _ energy be the threshold of the frame energy, and th _ entropy be the threshold of the frame spectral entropy. The threshold for voice activity detection can be set by referring to the following but not limited to the following, and the setting in the present invention is a specific implementation:

when the current frame energy frame _ energy and the frame spectrum entropy frame _ entropy are simultaneously larger than the respective threshold, adding 1 to the voice counting frame voice _ frame value, and setting the silence counting frame silence _ frame value to be 0;

and if the silence count frame is required to appear, adding 1 to the silence count frame value, otherwise, resetting the value of the silence count frame value, and otherwise, resetting the value of the silence count frame value. Namely, the increase and accumulation process of the voice counting frame and the quiet counting frame can not be interrupted, and the voice counting frame and the quiet counting frame can be accumulated only by adding 1 which continuously appears, and the accumulation is restarted after interruption, namely zero clearing.

For example, if the voice count frame voice _ frame >5, that is, the threshold of the number of times of state 1 is 5, it can be determined that the voice activity detection state vad _ state is 1, and at this time, it is in the voice activity detection state;

if the silence _ frame is greater than 10, i.e. the threshold of the number of times of state 2 is 10, the speech activity detection state vad _ state is considered to be 0, and at this time, the system can enter the power saving state.

The voice activity detection state vad _ state is found to be 1, and the frame signal-to-noise ratio can be found and updated if the voice activity detection state is considered to be in the voice activity detection state.

Wherein, when the frame SNR is in the voice activity detection state, i.e. the voice activity detection state vad _ state is 1, the frame SNR of the l-th frame can be obtained according to the following formula

，

When the voice activity detection state vad _ state is 0, it is considered that the voice activity detection state is not in the voice activity detection state, and the update is not performed, the value of the frame snr smoothing parameter γ may be 0.8, frame _ energy (l), and back _ energy (l) respectively represent the frame energy and the background energy of the first frame.

The foregoing is a description of preferred embodiments of the present invention, and the preferred embodiments in the preferred embodiments may be combined and combined in any combination, if not obviously contradictory or prerequisite to a certain preferred embodiment, and the specific parameters in the examples and the embodiments are only for the purpose of clearly illustrating the inventor's invention verification process and are not intended to limit the patent protection scope of the present invention, which is defined by the claims and the equivalent structural changes made by the content of the description of the present invention are also included in the protection scope of the present invention.

Claims

1. The signal-to-noise ratio calculation method combining voice activity detection and sliding window noise estimation is characterized by comprising the following steps of:

the continuous updating method comprises the following specific steps:

s5, when the voice activity detection state is in, obtaining and updating a frame signal-to-noise ratio;

in said step S3; frame energy of the l-th frame

；

；

p (k, l) is the proportion of the power spectrum of each frequency point in the whole frame power spectrum, wherein k is the frequency point frequency and l is the frame number;

setting a threshold value in the step S4 to be linearly related to the background spectrum entropy;

；

wherein beta is a background frame spectrum entropy smoothing parameter, and l is a frame number;

the step S4 specifically includes:

judging that the voice activity detection state is 1 only when the continuous occurrence frequency of the state 1 reaches a set state 1 frequency threshold value, namely, the voice activity detection state is considered to be in the voice activity detection state;

the frame signal-to-noise ratio in step S5 is obtained according to the following formula when the voice activity detection state is 1,

frame snr of frame l:

；

2. The signal-to-noise ratio calculation method according to claim 1, characterized in that: in step S2, the update equation of the minimum value of the background energy is:

；