CN112599145A

CN112599145A - Bone conduction voice enhancement method based on generation of countermeasure network

Info

Publication number: CN112599145A
Application number: CN202011427512.8A
Authority: CN
Inventors: 魏建国; 周秋闰; 何宇清; 路文焕
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2021-04-02

Abstract

The invention relates to the field of voice signal processing and deep learning, and aims to enable bone conduction equipment to obtain a better effect in extremely strong noise industries such as fire fighting, special duty, mining, emergency rescue and the like and obtain good voice communication quality under a strong noise background. Therefore, the technical scheme adopted by the invention is that based on a bone conduction voice enhancement method for generating an antagonistic network, preprocessing including short-time Fourier transform and cutting is carried out on collected bone conduction voice and air conduction voice; secondly, inputting the voice data obtained by preprocessing into a constructed confrontation generation network for training; and finally, inputting the bone conduction voice to be enhanced into the generator G of the confrontation generation network which is trained, and performing inverse short-time Fourier transform reconstruction on the obtained output result to generate the enhanced bone conduction voice. The method is mainly applied to the occasions of processing the bone conduction voice signals.

Description

Bone conduction voice enhancement method based on generation of countermeasure network

Technical Field

The invention relates to the field of voice signal processing and deep learning, in particular to a voice enhancement method based on a confrontation generation network, which is used for enhancing bone conduction voice and facilitating communication by using bone conduction equipment.

Background

Speech is an original and important carrier for communication between people and machines, and conveys various information, and is widely used in various scenes such as speech communication and instruction issue. However, the environment where we live is often full of various noises, and people or people are inevitably interfered by the environmental noises in the process of receiving voice signals. Speech enhancement is a process of extracting clean useful signals from speech signals polluted by noise as much as possible by applying a prediction estimation method, and research on speech enhancement technology has important value in actual life and production. Although modern speech enhancement techniques have made significant progress, existing speech enhancement algorithms have degraded significantly in complex, noisy environments.

Unlike the conventional air conduction microphone, which collects sound, the bone conduction microphone collects vibration of bone through a vibration sensor instead of air, and then converts the vibration into an audio signal. The sound source shields most of the ambient noise, so that the communication useful signal can be well transmitted even under the strong noise environment. Although the bone conduction voice can effectively resist the interference of environmental noise, due to the change of a sound transmission path and the limitation of the sensor process level, noise such as skin friction and sensor friction, strong wind friction and the like can be mixed in the bone conduction voice, and the quality of the bone conduction voice is obviously reduced compared with that of the air conduction voice. Therefore, the research on the bone conduction voice enhancement algorithm is developed, and the method has important theoretical significance and practical value for further improving the voice communication quality in a strong noise environment and further expanding the application range of the bone conduction microphone.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a bone conduction voice enhancement method based on an antagonism generation network, which solves the problems that the communication quality is poor in the current complex strong noise environment and the effect of the current voice enhancement technology is unsatisfactory, enables bone conduction equipment to obtain a better effect in extremely strong noise industries such as fire fighting, special duty, mining, emergency rescue and the like, and obtains good voice communication quality in the strong noise background. Therefore, the technical scheme adopted by the invention is that based on a bone conduction voice enhancement method for generating an antagonistic network, preprocessing including short-time Fourier transform and cutting is carried out on collected bone conduction voice and air conduction voice; secondly, inputting the voice data obtained by preprocessing into a constructed confrontation generation network for training; and finally, inputting the bone conduction voice to be enhanced into the generator G of the confrontation generation network which is trained, and performing inverse short-time Fourier transform reconstruction on the obtained output result to generate the enhanced bone conduction voice.

Step one, voice data preprocessing:

firstly, recording bone conduction and air conduction data through a bone conduction microphone and an air conduction microphone device, then carrying out windowing and frame division operation on the bone conduction and air conduction data, intercepting a frame for 10-30 ms, setting a proper frame shift when intercepting a voice frame, namely overlapping the frame length which is less than or equal to a half frame length between a front frame and a rear frame, selecting a window, and carrying out weighting processing on a voice signal by adopting a Hamming window, wherein a window function of the Hamming window is given by an equation (1):

using a short-time fourier transform: the non-stationary signal f (t) is stationary for a short time interval of the analysis window w (t), and if the analysis window function is shifted such that f (t) w (t- τ) is also stationary for different finite time periods, the power spectrum of the non-stationary signal at each different time instant can be calculated, the short time fourier transform of the non-stationary signal f (t) being expressed as:

then, STFT transformation is carried out on the original voice data;

step two: constructing and generating a countermeasure network and training:

and (3) probability generation model: in a continuous or discrete high-dimensional space χ, the existence of a random vector X obeys an unknown data distribution p_r(x) X ∈ χ, the generative model is a parameterized model p learned from some observable samples x (1), x (2), … …, x (N)_θ(x) To approximate the unknown distribution p_r(x) And using this model to generate samples such that the "generated" samples and the "real" samples are as similar as possible;

the deep generation model is to model a complex distribution p by utilizing the capability of a deep neural network to approximate an arbitrary function_r(x) Assuming a random vector Z obeys a simple distribution p (Z) Z ∈ Z, a deep neural network g: Z → χ is used, and g (Z) is made obey p (Z)_r(x) In the low-dimensional space Z, a simple and easily sampled distribution p (Z) is provided, the p (Z) is usually a standard multivariate normal distribution N (0, I), a mapping function G: Z → chi is constructed by a neural network, which is called a generating network, and the fitting capacity of the neural network is utilized to ensure that G (Z) obeys the data distribution p_r(x)；

Discriminator (Discriminator)

The goal of (1) is to distinguish that a sample x is from the true distribution p_r(x) Or from the generative model p_θ(x) The label y-1 indicates that the sample is from the true distribution, y-0 indicates that the sample is from the model, and the output of the network D (x; phi) is determined as the probability that x belongs to the true data distribution, that is, the probability

Then the probability that the sample comes from model generation is:

given a sample (x, y)) And y is {1,0} denotes a group derived from p_r(x) Or p_θ(x) And judging the objective function of the network as the minimum cross entropy, namely, the maximum log likelihood:

wherein θ and

respectively generating parameters of the network and judging parameters of the network;

the Generator (Generator) is aimed at the opposite of the discrimination network, i.e. let the discrimination network discriminate the self-generated sample as a real sample:

combining the discrimination network and the generation network, and regarding the whole objective function of the whole generation countermeasure network as a minimized and maximized game, wherein the whole objective function is as follows:

min_Gmax_DV(D,G)＝E_x～pdata(x)[logD(x)]+E_z～pz(z)[log(1-D(G(z))] (7)

respectively using the short-time Fourier transform speech spectrum data of the gas conduction speech and the short-time Fourier transform speech spectrum data of the bone conduction speech as input for generating an antagonistic network, wherein the speech spectrum data of the gas conduction speech is regarded as a sample from real distribution, the enhanced speech spectrum data of the bone conduction speech generated by a generator G is regarded as data from model generation, and the enhanced bone conduction speech spectrum data similar to the gas conduction speech is obtained through an antagonistic training generator G and a discriminator D, so that the aim of enhancing the bone conduction speech is fulfilled;

the generator G is composed of a full convolution neural network with jump connection, the generator G has 8 layers in total, the number of convolution kernels is set to be 64, and the generator G is divided into an Encoder Encoder part and a Decoder Decode part;

the discriminator D is a two-classification convolutional neural network, 3 layers of convolutional layers are shared in all, the air conduction voice spectrum data and the nuclear bone conduction voice spectrum data are input into the discriminator D for training, the training discriminator identifies the air conduction voice data from the real data and gives a high score close to 1, and the enhanced bone conduction voice data generated by the generator G is identified and gives a low score.

The generation of the countermeasure network was trained using a random gradient descent algorithm and optimized using an Adam optimizer, with the weights of the network initialized using a normal distribution with a mean of zero and a standard deviation of 0.02, and with the addition of an L1 loss function of coefficient 100 in the loss function for generation of the countermeasure network.

The invention has the characteristics and beneficial effects that:

the invention uses the antagonism generation network to enhance the bone conduction voice with poor tone quality, and the enhanced bone conduction voice has obvious improvement in tone quality and intelligibility. The invention has important guiding significance and practical value for improving the voice communication quality in the strong noise environment and further expanding the application range of the bone conduction microphone.

Description of the drawings:

fig. 1 is a flow chart of a method for generating bone conduction speech enhancement of an anti-network.

Fig. 2 generates a countermeasure network flow diagram.

FIG. 3 is a flow diagram of enhanced speech reconstruction.

Fig. 4 is a comparison graph of bone conduction speech and enhanced bone conduction speech spectra.

Detailed Description

The invention is that the method first enhances bone conduction speech using a generative countermeasure network. And carrying out short-time Fourier transform, cutting and other preprocessing on the collected bone conduction voice and the collected air conduction voice. Secondly, inputting the voice data obtained by preprocessing into a constructed confrontation generation network for training. Finally, the method inputs the bone conduction voice to be enhanced into the trained generator G, and the obtained output result is subjected to inverse short-time Fourier transform reconstruction to generate the enhanced bone conduction voice.

Overall, the invention comprises three functional modules, respectively: the system comprises a bone conduction voice and gas conduction voice preprocessing module, a confrontation generation network training module and a bone conduction voice enhancement module. The bone conduction and gas conduction voice preprocessing module is used for preprocessing collected bone conduction voice and gas conduction voice such as short-time Fourier transform and size cutting; the antagonism generation network training module is used for carrying out antagonism training on the generator G and the discriminator D on the basis of the input bone conduction and air conduction data; the bone conduction voice enhancement module is used for enhancing the bone conduction voice by using the trained generator G, and then carrying out inverse short-time Fourier transform on the enhanced data to obtain the final enhanced bone conduction voice.

The specific implementation steps of the bone conduction voice enhancement method based on the generation countermeasure network are as follows:

step one, voice data preprocessing:

firstly, bone conduction data and air conduction data recorded by a bone conduction microphone and an air conduction microphone device are respectively saved as wav files at a sampling rate of 16kHz and named by corresponding file names. Then, windowing and framing operation needs to be performed on the voice signal, generally, 10-30 ms is intercepted as one frame, because the voice signal is regarded as a stationary process in the time period, and a proper frame shift needs to be set when the voice frame is intercepted, that is, an overlap less than or equal to half of the frame length exists between the front frame and the rear frame, which is mainly for smooth transition between voice frames. The selection of the window (including shape and length) needs to reduce the truncation effect of the speech frame as much as possible, namely, the slopes at two ends of the time window are reduced to ensure that two ends of the edge of the window do not cause rapid change and transition stably, the method adopts finite length window functions such as a Hamming window and the like to carry out weighting processing on the speech signal, and the window function of the Hamming window is given by the formula (1):

after the windowing, the original speech signal is cut into a plurality of short-time speech frames with stable characteristics, and then further speech research can be realized by extracting speech characteristic parameters.

Traditional signal analysis is based on Fourier transform, and the Fourier analysis realizes a global transform, either completely in a time domain or completely in a frequency domain, so that the time-frequency local property of the signal cannot be expressed, and the property is the most fundamental and critical property of a non-stationary signal. In order to analyze and process non-stationary signals, researchers have popularized and even fundamentally changed Fourier analysis, and put forward and develop a series of new signal analysis theories. The method uses short-time Fourier transform, and the basic idea is as follows: assuming that the non-stationary signal f (t) is stationary for a short time interval of the analysis window w (t), if the analysis window function is shifted such that f (t) w (t- τ) is also stationary for different finite time periods, the power spectrum of the non-stationary signal at each different instant can be calculated. The short-time fourier transform of the non-stationary signal f (t) can be expressed as:

and performing STFT on the original voice data, setting parameters to select a sampling rate of 16kHz, setting FFT sampling points to be 512, setting the length of a Hamming window to be 32ms, and setting frame overlapping to be 16 ms. With such parameter settings, the frequency resolution is 16 kHz/512-31.25 Hz. Due to symmetry, only the STFT magnitude vector covering 257 points of positive frequency needs to be considered. While ignoring the last line in the STFT spectrum, i.e. the frequency bin representing the signal up to 31.25 Hz. The impact of such data loss is essentially negligible, but may allow later design of the 2-exponent size generator and discriminator inputs, making training of the generation countermeasure network more efficient thereafter.

Step two: constructing and generating a countermeasure network and training:

probabilistic generative model, generative model for short(Generation Model), which is an important Model in probability statistics and machine learning, refers to a series of models for randomly generating observable data. Assuming that in a continuous or discrete high-dimensional space χ, there is a random vector X obeying an unknown data distribution p_r(x) And x ∈ χ. The generative model is a parameterized model p learned from a number of observable samples x (1), x (2), … …, x (N)_θ(x) To approximate the unknown distribution p_r(x) And some samples can be generated with this model so that the "generated" samples and the "real" samples are as similar as possible.

The deep generation model is to model a complex distribution p by utilizing the capability of a deep neural network to approximate an arbitrary function_r(x) In that respect Assuming a random vector Z obeys a simple distribution p (Z), Z ∈ Z (such as the standard normal distribution), a deep neural network g: Z → χ can be used, and g (Z) obeys p (Z)_r(x) In that respect Assume that there is a simple, easily sampled distribution p (Z) in the low-dimensional space Z, which is usually a standard multivariate normal distribution N (0, I). A mapping function G, Z → chi is constructed by a neural network, and the mapping function G, Z → chi is called a generation network. Making use of the powerful fitting power of neural networks to make G (z) obey the data distribution p_r(x)。

Generating a countermeasure network (GAN) is to make the samples generated by the generating network obey the real data distribution by means of countermeasure training. In generating the countermeasure network, there are two networks for the countermeasure training. One is to judge the network, and the aim is to judge whether a sample is from real data or generated by the network as accurately as possible; the other is to generate a network, and the aim is to generate a sample which can not distinguish the source of the network as much as possible. The two networks with opposite targets are continuously trained alternately. When the data is finally converged, if the judging network can not judge the source of a sample any more, the method is equivalent to that the generating network can generate the sample which accords with the real data distribution.

Discriminator (Discriminator)

The goal of (1) is to distinguish that a sample x is from the true distribution p_r(x) Or from the generative model p_θ(x) Thus, the discrimination network is actually a two-class classifier. The label y-1 indicates that the sample comes from the true distribution, y-0 indicates that the sample comes from the model, and the output of the network D (x; phi) is judged as the probability that x belongs to the true data distribution, that is, the output of the network D (x; phi) is

Then the probability that the sample comes from model generation is:

given a sample (x, y), y ═ {1,0} indicates that it is from p_r(x) Or p_θ(x) And judging the objective function of the network as the minimum cross entropy, namely, the maximum log likelihood:

wherein θ and

parameters for generating a network and discriminating a network, respectively.

The Generator (Generator Network) is targeted to be just opposite to the discrimination Network, i.e. let the discrimination Network discriminate the self-generated sample as a real sample:

combining the discrimination network and the generation network, and regarding the whole objective function of the whole generation countermeasure network as a minimized and maximized Game (Minimax Game), wherein the whole objective function is as follows:

min_Gmax_DV(D,G)＝E_x～pdata(x)[logD(x)]+E_z～pz(z)[log(1-D(G(z))] (7)

in the present invention, short-time fourier transform speech spectral data of air-conduction speech and short-time fourier transform speech spectral data of bone-conduction speech are used as inputs for generating a countermeasure network, respectively. Wherein the speech spectral data of the gas conduction speech is considered as samples from the true distribution and the enhanced speech spectral data of the bone conduction speech generated by the generator G is considered as data from the model generation. And obtaining the enhanced bone conduction speech spectrum data similar to the air conduction speech through the confrontation training generator G and the discriminator D, thereby achieving the purpose of enhancing the bone conduction speech.

The generator G consists of a full convolution neural network with hopping connections. The generator G has 8 layers in total, the number of convolution kernels is set to 64, and the generator G can be divided into an Encoder Encoder part and a Decoder Decoder part. The output input size of the generator G is 256x256, and the downsampling convolution is first performed in the Encoder part: the step size of the first layer of convolution is 2, the number of convolution kernels is 64, and the output size obtained after convolution is 128x128x 64; the convolution step size of the second layer is 2, the number of convolution kernels is 128, and the output size obtained after convolution is 64x64x 128; the step size of the convolution of the third layer is 2, the number of convolution kernels is 256, and the output size after convolution is 32x32x 256; the step size of the convolution of the fourth layer is 2, the number of convolution kernels is 512, and the output size after convolution is 16x16x 512; the fifth layer of convolution step size is 2, the number of convolution kernels is 512, and the output size after convolution is 8x 512; the step size of the sixth layer of convolution is 2, the number of convolution kernels is 512, and the output size after convolution is 4x 512; the seventh layer of convolution has a step size of 2, the number of convolution kernels is 512, and the output size after convolution is 2x 512. After seven layers of convolution, the input data reaches the bottleneck layer. Data reaching the bottleneck layer is sampled by a decoder, and deconvolution is carried out by the same parameter setting to restore the size of 256 × 256 for output. Meanwhile, jump connection is established between sampling layers on corresponding down-sampling layer cores, and training of a deeper layer (richer global information) and training of a shallower layer (more local details) are combined, so that local training can be performed while observing global training, and the performance of the generator is further improved.

The discriminator D is a two-class convolutional neural network, and has 3 convolutional layers in total. The training scheduler inputs the air conduction speech spectrum data and the nuclear bone conduction speech spectrum data into a discriminator D for training, the training discriminator identifies the air conduction speech data from the real data and gives a high score (close to 1), and the recognition generator G generates enhanced bone conduction speech data and gives a low score (close to 0).

The method uses a random gradient descent algorithm to train and generate the countermeasure network, and adopts an Adam optimizer to optimize. The training period was set to 400 times, the learning rate was set to 0.0002, and the learning rate was set to start linearly decaying at half the training period. The weights of the network were initialized using a normal distribution with a mean of zero and a standard deviation of 0.02, and an L1 loss function with a coefficient of 100 was added in generating the loss function against the network.

Step three: and (3) using the trained network to enhance the bone conduction voice:

firstly, reading the configuration file of the generator G trained in the step two. And performing short-time Fourier transform processing on the bone conduction voice data to be enhanced, performing modulus operation and phase operation on the obtained short-time Fourier spectrums respectively, and inputting the data obtained by the modulus operation into a generator G for enhancement. And the enhanced data obtained by the generator is combined with the phase part of the short-time Fourier spectrum before, and the enhanced bone conduction voice is obtained through quasi-Fourier transform reconstruction.

Claims

1. A bone conduction voice enhancement method based on generation countermeasure network is characterized in that collected bone conduction voice and air conduction voice are preprocessed through short-time Fourier transform and cutting; secondly, inputting the voice data obtained by preprocessing into a constructed confrontation generation network for training; and finally, inputting the bone conduction voice to be enhanced into the generator G of the confrontation generation network which is trained, and performing inverse short-time Fourier transform reconstruction on the obtained output result to generate the enhanced bone conduction voice.

2. The bone conduction speech enhancement method based on the generative countermeasure network as claimed in claim 1, wherein the step one, speech data preprocessing:

then, STFT transformation is carried out on the original voice data;

step two: constructing and generating a countermeasure network and training:

and (3) probability generation model: in a continuous or discrete high-dimensional space χ, the existence of a random vector X obeys an unknown data distribution p_r(x) X ∈ χ, the generative model is a parameterized model p learned from some observable samples x (1), x (2), … …, x (N)_θ(x) To approximate the unknown distribution p_r(x)And using this model to generate samples such that the "generated" samples and the "real" samples are as similar as possible;

Discriminator (Discriminator)

Then the probability that the sample comes from model generation is:

wherein θ and

min_Gmax_DV(D,G)＝E_x～pdata(x)[logD(x)]+E_z～pz(z)[log(1-D(G(z))] (7)

3. The generative confrontational network-based bone conduction speech enhancement method of claim 1, wherein the generative confrontational network is trained using a stochastic gradient descent algorithm and optimized using an Adam optimizer, weights of the network are initialized using a normal distribution with a mean of zero and a standard deviation of 0.02, and an L1 loss function with a coefficient of 100 is added to the loss function for generating the confrontational network.