CN112885375A

CN112885375A - Global signal-to-noise ratio estimation method based on auditory filter bank and convolutional neural network

Info

Publication number: CN112885375A
Application number: CN202110025619.8A
Authority: CN
Inventors: 王龙标; 李楠; 党建武; 张苏林; 于波
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-01-08
Filing date: 2021-01-08
Publication date: 2021-06-01

Abstract

The invention discloses a global signal-to-noise ratio estimation method based on an auditory filter bank and a convolutional neural network, which comprises the following steps: 1) for noisy speech, dividing audio into different sub-bands by utilizing a high-pass filter and a low-pass filter according to a bark scale, and calculating the energy of each sub-band; 2) constructing a convolutional neural network, calculating the noise proportion in each sub-band, and further calculating the noise energy in the sub-band; 3) the global SNR is calculated. The invention mainly provides a dynamic noise estimation method based on an ear filter bank and aiming at a multi-subband convolutional neural network aiming at global signal-to-noise ratio estimation in a noise environment. Aiming at the energy of different sub-bands, a convolutional neural network is utilized, and a noise ratio estimation method is provided, which can dynamically estimate the noise energy ratio of different sub-bands. And the dynamic sub-band noise energy is utilized to further fuse the sub-band to the full-band signal-to-noise ratio estimation method, so that the accuracy of the global signal-to-noise ratio calculation is further improved.

Description

Global signal-to-noise ratio estimation method based on auditory filter bank and convolutional neural network

Technical Field

The invention relates to the field of voice signal processing, in particular to a global signal-to-noise ratio estimation method based on an auditory filter bank and a convolutional neural network, aiming at the problem of inaccurate noise estimation in an environment with a relatively low signal-to-noise ratio.

Background

In recent years, emerging industries such as smart homes, conversation robots, smart sound boxes and the like are vigorously developed, so that the life style of people and the interaction mode of people and machines are greatly changed, and voice interaction is widely applied to the emerging fields as a new interaction mode. With the application of deep learning in speech recognition, the recognition performance is greatly improved, the recognition rate is over 95 percent, and the recognition effect basically reaches the hearing level of people. However, the above is limited to the near-field condition, the noise and the reverberation generated by the room are very small, and how to achieve a good recognition effect in a complex scene (much noise or much reverberation) becomes a very important user experience.

Noise estimation is an important research direction for far-field speech recognition. In a noisy environment, the amount of influence of the noise on clean speech can be generally expressed as a signal-to-noise ratio (SNR), which is defined as the ratio of signal power to noise power expressed in decibels (dB). Accurate snr estimation can help design algorithms and systems that compensate for noise effects, such as robust speech recognition systems, speech enhancement, and noise suppression. Noise estimation of the signal-to-noise ratio is however a challenging task, since we usually do not know how different kinds of noise affect the original audio within one environment.

Generally, SNR estimates are generally divided into two categories, one being local SNR estimates, which are usually focused on SNR estimates at the frame level; the other is a global SNR estimate, which typically focuses on the overall distribution of noise in a period. In this work, we mainly solved the estimation problem of the global SNR.

The main methods for global SNR are divided into two types: one is an SNR estimation method based on a signal processing method in the past, and a typical method is a waveform energy distribution analysis (WADA) method, which assumes that speech and noise are modeled as gamma and gaussian distributions, respectively, and this method has a problem that background noise is assumed to be gaussian noise, but in daily life, there are not only gaussian noise but also a plurality of different kinds of noise, so this method has a great limitation in practical application. Another method is a deep learning-based method, which usually performs noise estimation according to input different kinds of speech features, but the performance of this method drops sharply with the decrease of the signal-to-noise ratio in real environment and the influence of unsteady noise. Therefore, it is still a challenging topic to propose a global SNR estimation method in real scenes.

Disclosure of Invention

The invention aims to explore a method for calculating the noise energy ratio of each sub-band in an auditory filter bank by using a convolutional neural network so as to improve the accuracy of global SNR estimation.

The technical scheme of the invention is a global signal-to-noise ratio estimation method based on an auditory filter bank and a convolutional neural network, which specifically comprises the following steps: 1) for noisy speech, dividing audio into different sub-bands by utilizing a high-pass filter and a low-pass filter according to a bark scale, and calculating the energy of each sub-band; 2) constructing a convolutional neural network, calculating the noise proportion in each sub-band, and further calculating the noise energy in the sub-band; 3) the global SNR is calculated.

The method comprises the following specific steps:

1) filter bank based on Bark scale

It is difficult to distinguish between noise and noisy speech using the global full-band approach. To overcome this, in this study we used a multi-subband approach. The noise has a different distribution at different frequencies. Under high SNR conditions, since noise is mainly distributed in a high frequency band, the noise can be easily identified using high frequency components of subbands. Under low signal-to-noise conditions, it is difficult to estimate the energy of speech and noise from noise. The use of sub-bands can process noise into multiple frequency bands, making it easier to determine speech and noise portions. The noise-containing voice is divided into sub-bands with different frequencies, so that the distinguishing capability of noise and voice can be improved.

As shown in fig. 1, since a person usually has a higher focus on low and mid frequencies when listening to a segment of speech, in this study, a filterbank for base hearing is used to segment the original speech waveform into different subbands. Here hearing based filterbank we use a Bark scale based filterbank consisting of band pass filters with constant bandwidth. In this study, the cut-off frequency of the filter was set to [ 100200300400510630770920108012701480172020002320270031503700 ] according to the Bark scale, respectively, and the sampling frequency of the speech was reduced to 8000 Hz in this experiment, which can be expressed as the following function

y(k，n)＝BFB(y(n))))

Where n is the number of samples, K is the kth subband after we split the audio into K subbands, and BFB represents the Bark filter bank. We also need to compute the energy of each subband after splitting into different subbands as follows:

E_total(k，n)＝|y(k，n)|²

2) computation of sub-band noise energy

Fig. 2 shows the proposed subband noise energy estimation method, in the training phase, we input the subband energy into the proposed subband noise estimation network (SNENet) to estimate the subband noise energy ratio, and the label in the training process is calculated by the following formula:

where R ═ R1(1), R (2),.., R (k), ] N is the total number of samples in a frame of speech, and R (k) is the noise energy of the kth subband

The ratio of the amount to the ratio is calculated by training a neural network g in the training process_θSo that

The value of (d) is minimal;

wherein, R is a set of noise energy ratios of each sub-band; g is the proposed subband noise energy estimation network (SNENet);

in the decoding (estimation) stage, we directly apply the subband energy E of the test data_k,totalThe estimated sub-band noise energy ratio can be obtained by inputting the noise energy ratio into the trained network, and the final sub-band noise energy can be obtained by multiplying the sub-band noise energy ratio and the total sub-band energy, as shown in the following formula:

wherein the content of the first and second substances,

for the estimated noise ratio of the kth subband, E_T(k) For the found magnitude of the noise energy in each subband.

Fig. 3 shows the proposed structure of SNENet, in which a CNN codec is used in order to obtain a more accurate local context pattern from a given sub-band speech energy. Not only for the full connectivity layer, we also use another convolutional network structure, namely a CNN codec (C-ED) network. As shown in FIG. 3, C-ED consists of convolution, average pooling, batch normalization, and ReLU layers. The number of encoder and decoder filters is corresponding, with the number of encoder filters gradually increasing and the number of decoder filters gradually decreasing. The number of channels of the convolutional layers in the convolutional neural network corresponds to different sub-bands, the average pooling layer is used for reducing the number of parameters, and in addition, in order to improve the generalization capability of the model, different convolutional kernels are arranged in the CNN model to learn different context modes.

In order to estimate the noise more accurately, a fully connected layer based network is used in SNENet. Through deeper nonlinear operation, the network can predict more detailed information, and the learning of the sub-band noise ratio is facilitated. The post-mapping network consists of two fully connected layers, where the activation function is ReLU. Finally, the final sub-band energy-to-noise ratio can be obtained through a full-connection network with a layer of activation function being Sigmoid.

3) Calculation of global signal-to-noise ratio

In this method, the power of the speech waveform is calculated from the sum of the powers of all sub-bands. Since the threshold is designed separately for each subband, the estimation of noise and speech power is much more accurate than the direct estimation in the time domain. The final global SNR is obtained by power fusion of all sub-bands as follows

Wherein P is_S(k) For the sum of the energies, P, of all clean speech in the k-th subband_N(k) For the energy of all noise in the k-th subband

The sum of these subband energy sums is added to obtain the final estimated global SNR

Wherein P is_N(k) The following results are obtained by calculation:

wherein L is_NIs as follows

And when the number of the speech frames is larger than P, the estimated noise ratio is not completely correct, and when the number of the speech frames is larger than a certain value, the global signal-to-noise ratio is calculated most accurately, wherein L is the total number of the speech frames. Finally, P is obtained by subtracting the energy of all the energy and all the noise_S(k)。

Advantageous effects

The invention mainly provides a dynamic noise estimation method based on an ear filter bank and aiming at a multi-subband convolutional neural network aiming at global signal-to-noise ratio estimation in a noise environment.

1. Through the human ear filter bank, the auditory mechanism of human ears in a noise environment is utilized to divide the noisy speech into a plurality of different sub-bands, and higher resolution is set for the middle and low frequency bands, so that the estimation capability of the sub-band energy in the middle and low frequency bands is improved.

2. Aiming at the energy of different sub-bands, a convolutional neural network is utilized, and a noise ratio estimation method is provided, which can dynamically estimate the noise energy ratio of different sub-bands.

3. And the dynamic sub-band noise energy is utilized to further fuse the sub-band to the full-band signal-to-noise ratio estimation method, so that the accuracy of the global signal-to-noise ratio calculation is further improved.

Drawings

FIG. 1 is a flow diagram of a global SNR estimation system;

FIG. 2 is a flow chart of the computation of subband noise energy;

FIG. 3 a framework of estimation of sub-band noise ratios;

fig. 4 estimates the MAE between the global snr and the true global snr.

Detailed Description

The action and effect of the present invention will be shown below with reference to the accompanying drawings and tables.

This example gives an embodiment of the invention based on the speech data set AURORA-2J and the noise data set NOISEX-92 as an example. The whole system algorithm flow is shown in fig. 1, and comprises 4 steps of data set production, subband feature extraction, SNENet model training and global signal-to-noise ratio calculation.

The method comprises the following specific steps:

1) data set production

We used the AURORA-2J and NOISEX-92 datasets for speech data assessment. 8440 AURORA-2J clean voices were selected as clean voices in the training dataset. In NOISEX-92, white noise, pink noise, factory noise, and babble noise are used as background noise. The SNRs are set to 20, 15, 10, 5, 0, -5 and-10, respectively, which are generated by adding noise and speech signals and then using these clean and noisy speech signals for signal-to-noise ratio design. The sampling frequency is 8khz and the number of subbands is 17.

In the test set, we used 1001 sentences of clean speech in AURORA-2J to add different noise types and different signal-to-noise ratio of audio respectively for testing of the proposed method.

2) Extraction of sub-band features

We resample all audio sampling frequencies to 8khz, with the number of subbands set to 17. The setting of the number of sub-bands is the same as the technical scheme.

3) Training SNENet model

The structure of SNENet is shown in FIG. 3, and the SNENet is trained by using a CNN coding and decoding model of Tensorflow and subband energy characteristics. All hidden layers use ReLU as the activation function. We use the Adam algorithm as an optimizer. The number of convolutional layer filters is 17, 40, 64, 128, 64, 40, 17, and the length and width of the kernel are set to 2, 3, 5, 7, 5, 3, 2. In the mapping network, the concealment size is set to 512. After the SNENet model is trained, a subband noise energy calculation is performed, as shown in fig. 2.

4) Global signal-to-noise ratio calculation

After calculating the sub-band noise energy, we need to estimate the global snr. We estimated the proposed method using the absolute mean error (MAE), as shown in the following equation

Wherein G is_iFor the estimated global noise, R_iFor true global signal-to-noise ratio, N is the number of all test data, where N equals 1001.

Fig. 4 shows the result of MAE, wherein the previously proposed method is a global snr estimation method based on signal processing multiple subbands. From the results it can be seen that the effect of our proposed method under steady state noise conditions is basically the same as the real results as shown in a and b in fig. 4, but slightly decreases under the conditions of plant noise and babble noise and low signal-to-noise ratio as shown in c and d in fig. 4, which is a disadvantage of all methods, because the distribution of noise and signal is very similar under these non-steady state noise conditions, and we have difficulty to propose a method to perfectly estimate the noise.

Claims

1. The global signal-to-noise ratio estimation method based on the auditory filter bank and the convolutional neural network is characterized by comprising the following steps of:

1) for noisy speech, dividing audio into different sub-bands by utilizing a high-pass filter and a low-pass filter according to a bark scale, and calculating the energy of each sub-band;

2) constructing a convolutional neural network, calculating the noise proportion in each sub-band, and further calculating the noise energy in the sub-band;

3) calculating a global SNR;

the method comprises the following specific steps:

1) filter bank based on Bark scale

Dividing the noisy speech into sub-bands of different frequencies by using a multi-sub-band method;

using a Bark scale based filterbank consisting of bandpass filters with constant bandwidth, the cut-off frequencies of the filters were set to [ 100200300400510630770920108012701480172020002320270031503700 ] according to the Bark scale, respectively, the sampling frequency of speech was reduced to 8000 Hz in this experiment, which can be expressed as the function y (k, n) ═ BFB (y (n))

Wherein n is the number of sampling points, K is the kth sub-band after the audio is divided into K sub-bands, and BFB represents a Bark filter bank;

after division into different sub-bands, the energy of each sub-band needs to be calculated as follows: e_total(k，n)＝|y(k，n)|²

2) Computation of sub-band noise energy

In the training stage, the sub-band energy is input into the proposed sub-band noise estimation network to estimate the sub-band noise energy ratio, and the label in the training process is calculated by the following formula:

wherein, R ═ R (1), R (2),.., R (K)]N is the total number of sampling points in a frame of voice, r (k) is the noise energy ratio of the kth sub-band, and a neural network g is trained in the training process_θSo that

The value of (c) is minimal. (ii) a

in the decoding/estimating stage, the sub-band energy E of the test data is directly used_k,totalThe estimated sub-band noise energy ratio is obtained after the sub-band noise energy ratio is input into a trained network, and the final sub-band noise energy can be obtained by multiplying the sub-band noise energy ratio and the total sub-band energy, wherein the following formula is shown:

wherein the content of the first and second substances,

for the estimated noise ratio of the kth subband, E_T(k) Obtaining the magnitude of noise energy in each sub-band;

3) calculation of global signal-to-noise ratio

The power of the speech waveform is calculated from the sum of the powers of all the sub-bands, and finally the global SNR is obtained by the power fusion of all the sub-bands as follows:

wherein, P_S(k) For the sum of the energies, P, of all clean speech in the k-th subband_N(k) For the energy of all noise in the k-th subband

And, by adding these subband energy sums, the final estimated global SNR, i.e., the sum of the subband energies, is obtained

)；

Wherein, P_N(k) The calculation results; l is_NIs as follows

When the number of the voice frames is larger than P, calculating the global signal-to-noise ratio most accurately when the number of the voice frames is larger than a certain value, wherein L is the total number of the voice frames;

finally, P is obtained by subtracting the energy of all the energy and all the noise_S(k)。

2. The auditory filterbank and convolutional neural network-based global snr estimation method according to claim 1, wherein a CNN codec is used in SNENet not only for the fully connected layer, but also another convolutional network structure, namely a CNN codec C-ED network, where C-ED consists of convolution, average pooling, batch normalization and ReLU layers;

the number of the encoder and decoder filters is corresponding, the number of the encoder filters is gradually increased, and the number of the decoder filters is gradually decreased;

the number of channels of the convolutional layer in the convolutional neural network corresponds to different sub-bands, the average pooling layer is used for reducing the number of parameters, and different convolutional cores are arranged in the CNN model to learn different context modes.

3. The auditory filterbank and convolutional neural network-based global signal-to-noise ratio estimation method of claim 1, wherein a fully-connected layer-based network is used in SNENet; the post-mapping network consists of two fully connected layers, wherein an activation function is a ReLU; and finally, obtaining the final sub-band energy-to-noise ratio through a full-connection network with a layer of activation function being Sigmoid.