CN112599145A - Bone conduction voice enhancement method based on generation of countermeasure network - Google Patents

Bone conduction voice enhancement method based on generation of countermeasure network Download PDF

Info

Publication number
CN112599145A
CN112599145A CN202011427512.8A CN202011427512A CN112599145A CN 112599145 A CN112599145 A CN 112599145A CN 202011427512 A CN202011427512 A CN 202011427512A CN 112599145 A CN112599145 A CN 112599145A
Authority
CN
China
Prior art keywords
network
bone conduction
voice
data
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011427512.8A
Other languages
Chinese (zh)
Inventor
魏建国
周秋闰
何宇清
路文焕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202011427512.8A priority Critical patent/CN112599145A/en
Publication of CN112599145A publication Critical patent/CN112599145A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/04Circuits for transducers, loudspeakers or microphones for correcting frequency response

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the field of voice signal processing and deep learning, and aims to enable bone conduction equipment to obtain a better effect in extremely strong noise industries such as fire fighting, special duty, mining, emergency rescue and the like and obtain good voice communication quality under a strong noise background. Therefore, the technical scheme adopted by the invention is that based on a bone conduction voice enhancement method for generating an antagonistic network, preprocessing including short-time Fourier transform and cutting is carried out on collected bone conduction voice and air conduction voice; secondly, inputting the voice data obtained by preprocessing into a constructed confrontation generation network for training; and finally, inputting the bone conduction voice to be enhanced into the generator G of the confrontation generation network which is trained, and performing inverse short-time Fourier transform reconstruction on the obtained output result to generate the enhanced bone conduction voice. The method is mainly applied to the occasions of processing the bone conduction voice signals.

Description

Bone conduction voice enhancement method based on generation of countermeasure network
Technical Field
The invention relates to the field of voice signal processing and deep learning, in particular to a voice enhancement method based on a confrontation generation network, which is used for enhancing bone conduction voice and facilitating communication by using bone conduction equipment.
Background
Speech is an original and important carrier for communication between people and machines, and conveys various information, and is widely used in various scenes such as speech communication and instruction issue. However, the environment where we live is often full of various noises, and people or people are inevitably interfered by the environmental noises in the process of receiving voice signals. Speech enhancement is a process of extracting clean useful signals from speech signals polluted by noise as much as possible by applying a prediction estimation method, and research on speech enhancement technology has important value in actual life and production. Although modern speech enhancement techniques have made significant progress, existing speech enhancement algorithms have degraded significantly in complex, noisy environments.
Unlike the conventional air conduction microphone, which collects sound, the bone conduction microphone collects vibration of bone through a vibration sensor instead of air, and then converts the vibration into an audio signal. The sound source shields most of the ambient noise, so that the communication useful signal can be well transmitted even under the strong noise environment. Although the bone conduction voice can effectively resist the interference of environmental noise, due to the change of a sound transmission path and the limitation of the sensor process level, noise such as skin friction and sensor friction, strong wind friction and the like can be mixed in the bone conduction voice, and the quality of the bone conduction voice is obviously reduced compared with that of the air conduction voice. Therefore, the research on the bone conduction voice enhancement algorithm is developed, and the method has important theoretical significance and practical value for further improving the voice communication quality in a strong noise environment and further expanding the application range of the bone conduction microphone.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a bone conduction voice enhancement method based on an antagonism generation network, which solves the problems that the communication quality is poor in the current complex strong noise environment and the effect of the current voice enhancement technology is unsatisfactory, enables bone conduction equipment to obtain a better effect in extremely strong noise industries such as fire fighting, special duty, mining, emergency rescue and the like, and obtains good voice communication quality in the strong noise background. Therefore, the technical scheme adopted by the invention is that based on a bone conduction voice enhancement method for generating an antagonistic network, preprocessing including short-time Fourier transform and cutting is carried out on collected bone conduction voice and air conduction voice; secondly, inputting the voice data obtained by preprocessing into a constructed confrontation generation network for training; and finally, inputting the bone conduction voice to be enhanced into the generator G of the confrontation generation network which is trained, and performing inverse short-time Fourier transform reconstruction on the obtained output result to generate the enhanced bone conduction voice.
Step one, voice data preprocessing:
firstly, recording bone conduction and air conduction data through a bone conduction microphone and an air conduction microphone device, then carrying out windowing and frame division operation on the bone conduction and air conduction data, intercepting a frame for 10-30 ms, setting a proper frame shift when intercepting a voice frame, namely overlapping the frame length which is less than or equal to a half frame length between a front frame and a rear frame, selecting a window, and carrying out weighting processing on a voice signal by adopting a Hamming window, wherein a window function of the Hamming window is given by an equation (1):
Figure BDA0002819583830000011
using a short-time fourier transform: the non-stationary signal f (t) is stationary for a short time interval of the analysis window w (t), and if the analysis window function is shifted such that f (t) w (t- τ) is also stationary for different finite time periods, the power spectrum of the non-stationary signal at each different time instant can be calculated, the short time fourier transform of the non-stationary signal f (t) being expressed as:
Figure BDA0002819583830000021
then, STFT transformation is carried out on the original voice data;
step two: constructing and generating a countermeasure network and training:
and (3) probability generation model: in a continuous or discrete high-dimensional space χ, the existence of a random vector X obeys an unknown data distribution pr(x) X ∈ χ, the generative model is a parameterized model p learned from some observable samples x (1), x (2), … …, x (N)θ(x) To approximate the unknown distribution pr(x) And using this model to generate samples such that the "generated" samples and the "real" samples are as similar as possible;
the deep generation model is to model a complex distribution p by utilizing the capability of a deep neural network to approximate an arbitrary functionr(x) Assuming a random vector Z obeys a simple distribution p (Z) Z ∈ Z, a deep neural network g: Z → χ is used, and g (Z) is made obey p (Z)r(x) In the low-dimensional space Z, a simple and easily sampled distribution p (Z) is provided, the p (Z) is usually a standard multivariate normal distribution N (0, I), a mapping function G: Z → chi is constructed by a neural network, which is called a generating network, and the fitting capacity of the neural network is utilized to ensure that G (Z) obeys the data distribution pr(x);
Discriminator (Discriminator)
Figure BDA0002819583830000022
The goal of (1) is to distinguish that a sample x is from the true distribution pr(x) Or from the generative model pθ(x) The label y-1 indicates that the sample is from the true distribution, y-0 indicates that the sample is from the model, and the output of the network D (x; phi) is determined as the probability that x belongs to the true data distribution, that is, the probability
Figure BDA0002819583830000023
Then the probability that the sample comes from model generation is:
Figure BDA0002819583830000024
given a sample (x, y)) And y is {1,0} denotes a group derived from pr(x) Or pθ(x) And judging the objective function of the network as the minimum cross entropy, namely, the maximum log likelihood:
Figure BDA0002819583830000025
Figure BDA0002819583830000026
wherein θ and
Figure BDA0002819583830000027
respectively generating parameters of the network and judging parameters of the network;
the Generator (Generator) is aimed at the opposite of the discrimination network, i.e. let the discrimination network discriminate the self-generated sample as a real sample:
Figure BDA0002819583830000028
combining the discrimination network and the generation network, and regarding the whole objective function of the whole generation countermeasure network as a minimized and maximized game, wherein the whole objective function is as follows:
minGmaxDV(D,G)=Ex~pdata(x)[logD(x)]+Ez~pz(z)[log(1-D(G(z))] (7)
respectively using the short-time Fourier transform speech spectrum data of the gas conduction speech and the short-time Fourier transform speech spectrum data of the bone conduction speech as input for generating an antagonistic network, wherein the speech spectrum data of the gas conduction speech is regarded as a sample from real distribution, the enhanced speech spectrum data of the bone conduction speech generated by a generator G is regarded as data from model generation, and the enhanced bone conduction speech spectrum data similar to the gas conduction speech is obtained through an antagonistic training generator G and a discriminator D, so that the aim of enhancing the bone conduction speech is fulfilled;
the generator G is composed of a full convolution neural network with jump connection, the generator G has 8 layers in total, the number of convolution kernels is set to be 64, and the generator G is divided into an Encoder Encoder part and a Decoder Decode part;
the discriminator D is a two-classification convolutional neural network, 3 layers of convolutional layers are shared in all, the air conduction voice spectrum data and the nuclear bone conduction voice spectrum data are input into the discriminator D for training, the training discriminator identifies the air conduction voice data from the real data and gives a high score close to 1, and the enhanced bone conduction voice data generated by the generator G is identified and gives a low score.
The generation of the countermeasure network was trained using a random gradient descent algorithm and optimized using an Adam optimizer, with the weights of the network initialized using a normal distribution with a mean of zero and a standard deviation of 0.02, and with the addition of an L1 loss function of coefficient 100 in the loss function for generation of the countermeasure network.
The invention has the characteristics and beneficial effects that:
the invention uses the antagonism generation network to enhance the bone conduction voice with poor tone quality, and the enhanced bone conduction voice has obvious improvement in tone quality and intelligibility. The invention has important guiding significance and practical value for improving the voice communication quality in the strong noise environment and further expanding the application range of the bone conduction microphone.
Description of the drawings:
fig. 1 is a flow chart of a method for generating bone conduction speech enhancement of an anti-network.
Fig. 2 generates a countermeasure network flow diagram.
FIG. 3 is a flow diagram of enhanced speech reconstruction.
Fig. 4 is a comparison graph of bone conduction speech and enhanced bone conduction speech spectra.
Detailed Description
The invention is that the method first enhances bone conduction speech using a generative countermeasure network. And carrying out short-time Fourier transform, cutting and other preprocessing on the collected bone conduction voice and the collected air conduction voice. Secondly, inputting the voice data obtained by preprocessing into a constructed confrontation generation network for training. Finally, the method inputs the bone conduction voice to be enhanced into the trained generator G, and the obtained output result is subjected to inverse short-time Fourier transform reconstruction to generate the enhanced bone conduction voice.
Overall, the invention comprises three functional modules, respectively: the system comprises a bone conduction voice and gas conduction voice preprocessing module, a confrontation generation network training module and a bone conduction voice enhancement module. The bone conduction and gas conduction voice preprocessing module is used for preprocessing collected bone conduction voice and gas conduction voice such as short-time Fourier transform and size cutting; the antagonism generation network training module is used for carrying out antagonism training on the generator G and the discriminator D on the basis of the input bone conduction and air conduction data; the bone conduction voice enhancement module is used for enhancing the bone conduction voice by using the trained generator G, and then carrying out inverse short-time Fourier transform on the enhanced data to obtain the final enhanced bone conduction voice.
The specific implementation steps of the bone conduction voice enhancement method based on the generation countermeasure network are as follows:
step one, voice data preprocessing:
firstly, bone conduction data and air conduction data recorded by a bone conduction microphone and an air conduction microphone device are respectively saved as wav files at a sampling rate of 16kHz and named by corresponding file names. Then, windowing and framing operation needs to be performed on the voice signal, generally, 10-30 ms is intercepted as one frame, because the voice signal is regarded as a stationary process in the time period, and a proper frame shift needs to be set when the voice frame is intercepted, that is, an overlap less than or equal to half of the frame length exists between the front frame and the rear frame, which is mainly for smooth transition between voice frames. The selection of the window (including shape and length) needs to reduce the truncation effect of the speech frame as much as possible, namely, the slopes at two ends of the time window are reduced to ensure that two ends of the edge of the window do not cause rapid change and transition stably, the method adopts finite length window functions such as a Hamming window and the like to carry out weighting processing on the speech signal, and the window function of the Hamming window is given by the formula (1):
Figure BDA0002819583830000041
after the windowing, the original speech signal is cut into a plurality of short-time speech frames with stable characteristics, and then further speech research can be realized by extracting speech characteristic parameters.
Traditional signal analysis is based on Fourier transform, and the Fourier analysis realizes a global transform, either completely in a time domain or completely in a frequency domain, so that the time-frequency local property of the signal cannot be expressed, and the property is the most fundamental and critical property of a non-stationary signal. In order to analyze and process non-stationary signals, researchers have popularized and even fundamentally changed Fourier analysis, and put forward and develop a series of new signal analysis theories. The method uses short-time Fourier transform, and the basic idea is as follows: assuming that the non-stationary signal f (t) is stationary for a short time interval of the analysis window w (t), if the analysis window function is shifted such that f (t) w (t- τ) is also stationary for different finite time periods, the power spectrum of the non-stationary signal at each different instant can be calculated. The short-time fourier transform of the non-stationary signal f (t) can be expressed as:
Figure BDA0002819583830000042
and performing STFT on the original voice data, setting parameters to select a sampling rate of 16kHz, setting FFT sampling points to be 512, setting the length of a Hamming window to be 32ms, and setting frame overlapping to be 16 ms. With such parameter settings, the frequency resolution is 16 kHz/512-31.25 Hz. Due to symmetry, only the STFT magnitude vector covering 257 points of positive frequency needs to be considered. While ignoring the last line in the STFT spectrum, i.e. the frequency bin representing the signal up to 31.25 Hz. The impact of such data loss is essentially negligible, but may allow later design of the 2-exponent size generator and discriminator inputs, making training of the generation countermeasure network more efficient thereafter.
Step two: constructing and generating a countermeasure network and training:
probabilistic generative model, generative model for short(Generation Model), which is an important Model in probability statistics and machine learning, refers to a series of models for randomly generating observable data. Assuming that in a continuous or discrete high-dimensional space χ, there is a random vector X obeying an unknown data distribution pr(x) And x ∈ χ. The generative model is a parameterized model p learned from a number of observable samples x (1), x (2), … …, x (N)θ(x) To approximate the unknown distribution pr(x) And some samples can be generated with this model so that the "generated" samples and the "real" samples are as similar as possible.
The deep generation model is to model a complex distribution p by utilizing the capability of a deep neural network to approximate an arbitrary functionr(x) In that respect Assuming a random vector Z obeys a simple distribution p (Z), Z ∈ Z (such as the standard normal distribution), a deep neural network g: Z → χ can be used, and g (Z) obeys p (Z)r(x) In that respect Assume that there is a simple, easily sampled distribution p (Z) in the low-dimensional space Z, which is usually a standard multivariate normal distribution N (0, I). A mapping function G, Z → chi is constructed by a neural network, and the mapping function G, Z → chi is called a generation network. Making use of the powerful fitting power of neural networks to make G (z) obey the data distribution pr(x)。
Generating a countermeasure network (GAN) is to make the samples generated by the generating network obey the real data distribution by means of countermeasure training. In generating the countermeasure network, there are two networks for the countermeasure training. One is to judge the network, and the aim is to judge whether a sample is from real data or generated by the network as accurately as possible; the other is to generate a network, and the aim is to generate a sample which can not distinguish the source of the network as much as possible. The two networks with opposite targets are continuously trained alternately. When the data is finally converged, if the judging network can not judge the source of a sample any more, the method is equivalent to that the generating network can generate the sample which accords with the real data distribution.
Discriminator (Discriminator)
Figure BDA0002819583830000051
The goal of (1) is to distinguish that a sample x is from the true distribution pr(x) Or from the generative model pθ(x) Thus, the discrimination network is actually a two-class classifier. The label y-1 indicates that the sample comes from the true distribution, y-0 indicates that the sample comes from the model, and the output of the network D (x; phi) is judged as the probability that x belongs to the true data distribution, that is, the output of the network D (x; phi) is
Figure BDA0002819583830000052
Then the probability that the sample comes from model generation is:
Figure BDA0002819583830000053
given a sample (x, y), y ═ {1,0} indicates that it is from pr(x) Or pθ(x) And judging the objective function of the network as the minimum cross entropy, namely, the maximum log likelihood:
Figure BDA0002819583830000054
Figure BDA0002819583830000055
wherein θ and
Figure BDA0002819583830000056
parameters for generating a network and discriminating a network, respectively.
The Generator (Generator Network) is targeted to be just opposite to the discrimination Network, i.e. let the discrimination Network discriminate the self-generated sample as a real sample:
Figure BDA0002819583830000057
combining the discrimination network and the generation network, and regarding the whole objective function of the whole generation countermeasure network as a minimized and maximized Game (Minimax Game), wherein the whole objective function is as follows:
minGmaxDV(D,G)=Ex~pdata(x)[logD(x)]+Ez~pz(z)[log(1-D(G(z))] (7)
in the present invention, short-time fourier transform speech spectral data of air-conduction speech and short-time fourier transform speech spectral data of bone-conduction speech are used as inputs for generating a countermeasure network, respectively. Wherein the speech spectral data of the gas conduction speech is considered as samples from the true distribution and the enhanced speech spectral data of the bone conduction speech generated by the generator G is considered as data from the model generation. And obtaining the enhanced bone conduction speech spectrum data similar to the air conduction speech through the confrontation training generator G and the discriminator D, thereby achieving the purpose of enhancing the bone conduction speech.
The generator G consists of a full convolution neural network with hopping connections. The generator G has 8 layers in total, the number of convolution kernels is set to 64, and the generator G can be divided into an Encoder Encoder part and a Decoder Decoder part. The output input size of the generator G is 256x256, and the downsampling convolution is first performed in the Encoder part: the step size of the first layer of convolution is 2, the number of convolution kernels is 64, and the output size obtained after convolution is 128x128x 64; the convolution step size of the second layer is 2, the number of convolution kernels is 128, and the output size obtained after convolution is 64x64x 128; the step size of the convolution of the third layer is 2, the number of convolution kernels is 256, and the output size after convolution is 32x32x 256; the step size of the convolution of the fourth layer is 2, the number of convolution kernels is 512, and the output size after convolution is 16x16x 512; the fifth layer of convolution step size is 2, the number of convolution kernels is 512, and the output size after convolution is 8x 512; the step size of the sixth layer of convolution is 2, the number of convolution kernels is 512, and the output size after convolution is 4x 512; the seventh layer of convolution has a step size of 2, the number of convolution kernels is 512, and the output size after convolution is 2x 512. After seven layers of convolution, the input data reaches the bottleneck layer. Data reaching the bottleneck layer is sampled by a decoder, and deconvolution is carried out by the same parameter setting to restore the size of 256 × 256 for output. Meanwhile, jump connection is established between sampling layers on corresponding down-sampling layer cores, and training of a deeper layer (richer global information) and training of a shallower layer (more local details) are combined, so that local training can be performed while observing global training, and the performance of the generator is further improved.
The discriminator D is a two-class convolutional neural network, and has 3 convolutional layers in total. The training scheduler inputs the air conduction speech spectrum data and the nuclear bone conduction speech spectrum data into a discriminator D for training, the training discriminator identifies the air conduction speech data from the real data and gives a high score (close to 1), and the recognition generator G generates enhanced bone conduction speech data and gives a low score (close to 0).
The method uses a random gradient descent algorithm to train and generate the countermeasure network, and adopts an Adam optimizer to optimize. The training period was set to 400 times, the learning rate was set to 0.0002, and the learning rate was set to start linearly decaying at half the training period. The weights of the network were initialized using a normal distribution with a mean of zero and a standard deviation of 0.02, and an L1 loss function with a coefficient of 100 was added in generating the loss function against the network.
Step three: and (3) using the trained network to enhance the bone conduction voice:
firstly, reading the configuration file of the generator G trained in the step two. And performing short-time Fourier transform processing on the bone conduction voice data to be enhanced, performing modulus operation and phase operation on the obtained short-time Fourier spectrums respectively, and inputting the data obtained by the modulus operation into a generator G for enhancement. And the enhanced data obtained by the generator is combined with the phase part of the short-time Fourier spectrum before, and the enhanced bone conduction voice is obtained through quasi-Fourier transform reconstruction.

Claims (3)

1. A bone conduction voice enhancement method based on generation countermeasure network is characterized in that collected bone conduction voice and air conduction voice are preprocessed through short-time Fourier transform and cutting; secondly, inputting the voice data obtained by preprocessing into a constructed confrontation generation network for training; and finally, inputting the bone conduction voice to be enhanced into the generator G of the confrontation generation network which is trained, and performing inverse short-time Fourier transform reconstruction on the obtained output result to generate the enhanced bone conduction voice.
2. The bone conduction speech enhancement method based on the generative countermeasure network as claimed in claim 1, wherein the step one, speech data preprocessing:
firstly, recording bone conduction and air conduction data through a bone conduction microphone and an air conduction microphone device, then carrying out windowing and frame division operation on the bone conduction and air conduction data, intercepting a frame for 10-30 ms, setting a proper frame shift when intercepting a voice frame, namely overlapping the frame length which is less than or equal to a half frame length between a front frame and a rear frame, selecting a window, and carrying out weighting processing on a voice signal by adopting a Hamming window, wherein a window function of the Hamming window is given by an equation (1):
Figure FDA0002819583820000011
using a short-time fourier transform: the non-stationary signal f (t) is stationary for a short time interval of the analysis window w (t), and if the analysis window function is shifted such that f (t) w (t- τ) is also stationary for different finite time periods, the power spectrum of the non-stationary signal at each different time instant can be calculated, the short time fourier transform of the non-stationary signal f (t) being expressed as:
Figure FDA0002819583820000012
then, STFT transformation is carried out on the original voice data;
step two: constructing and generating a countermeasure network and training:
and (3) probability generation model: in a continuous or discrete high-dimensional space χ, the existence of a random vector X obeys an unknown data distribution pr(x) X ∈ χ, the generative model is a parameterized model p learned from some observable samples x (1), x (2), … …, x (N)θ(x) To approximate the unknown distribution pr(x)And using this model to generate samples such that the "generated" samples and the "real" samples are as similar as possible;
the deep generation model is to model a complex distribution p by utilizing the capability of a deep neural network to approximate an arbitrary functionr(x) Assuming a random vector Z obeys a simple distribution p (Z) Z ∈ Z, a deep neural network g: Z → χ is used, and g (Z) is made obey p (Z)r(x) In the low-dimensional space Z, a simple and easily sampled distribution p (Z) is provided, the p (Z) is usually a standard multivariate normal distribution N (0, I), a mapping function G: Z → chi is constructed by a neural network, which is called a generating network, and the fitting capacity of the neural network is utilized to ensure that G (Z) obeys the data distribution pr(x);
Discriminator (Discriminator)
Figure FDA0002819583820000013
The goal of (1) is to distinguish that a sample x is from the true distribution pr(x) Or from the generative model pθ(x) The label y-1 indicates that the sample is from the true distribution, y-0 indicates that the sample is from the model, and the output of the network D (x; phi) is determined as the probability that x belongs to the true data distribution, that is, the probability
Figure FDA0002819583820000014
Then the probability that the sample comes from model generation is:
Figure FDA0002819583820000015
given a sample (x, y), y ═ {1,0} indicates that it is from pr(x) Or pθ(x) And judging the objective function of the network as the minimum cross entropy, namely, the maximum log likelihood:
Figure FDA0002819583820000016
Figure FDA0002819583820000017
wherein θ and
Figure FDA0002819583820000021
respectively generating parameters of the network and judging parameters of the network;
the Generator (Generator) is aimed at the opposite of the discrimination network, i.e. let the discrimination network discriminate the self-generated sample as a real sample:
Figure FDA0002819583820000022
combining the discrimination network and the generation network, and regarding the whole objective function of the whole generation countermeasure network as a minimized and maximized game, wherein the whole objective function is as follows:
minGmaxDV(D,G)=Ex~pdata(x)[logD(x)]+Ez~pz(z)[log(1-D(G(z))] (7)
respectively using the short-time Fourier transform speech spectrum data of the gas conduction speech and the short-time Fourier transform speech spectrum data of the bone conduction speech as input for generating an antagonistic network, wherein the speech spectrum data of the gas conduction speech is regarded as a sample from real distribution, the enhanced speech spectrum data of the bone conduction speech generated by a generator G is regarded as data from model generation, and the enhanced bone conduction speech spectrum data similar to the gas conduction speech is obtained through an antagonistic training generator G and a discriminator D, so that the aim of enhancing the bone conduction speech is fulfilled;
the generator G is composed of a full convolution neural network with jump connection, the generator G has 8 layers in total, the number of convolution kernels is set to be 64, and the generator G is divided into an Encoder Encoder part and a Decoder Decode part;
the discriminator D is a two-classification convolutional neural network, 3 layers of convolutional layers are shared in all, the air conduction voice spectrum data and the nuclear bone conduction voice spectrum data are input into the discriminator D for training, the training discriminator identifies the air conduction voice data from the real data and gives a high score close to 1, and the enhanced bone conduction voice data generated by the generator G is identified and gives a low score.
3. The generative confrontational network-based bone conduction speech enhancement method of claim 1, wherein the generative confrontational network is trained using a stochastic gradient descent algorithm and optimized using an Adam optimizer, weights of the network are initialized using a normal distribution with a mean of zero and a standard deviation of 0.02, and an L1 loss function with a coefficient of 100 is added to the loss function for generating the confrontational network.
CN202011427512.8A 2020-12-07 2020-12-07 Bone conduction voice enhancement method based on generation of countermeasure network Pending CN112599145A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011427512.8A CN112599145A (en) 2020-12-07 2020-12-07 Bone conduction voice enhancement method based on generation of countermeasure network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011427512.8A CN112599145A (en) 2020-12-07 2020-12-07 Bone conduction voice enhancement method based on generation of countermeasure network

Publications (1)

Publication Number Publication Date
CN112599145A true CN112599145A (en) 2021-04-02

Family

ID=75191383

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011427512.8A Pending CN112599145A (en) 2020-12-07 2020-12-07 Bone conduction voice enhancement method based on generation of countermeasure network

Country Status (1)

Country Link
CN (1) CN112599145A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113314109A (en) * 2021-07-29 2021-08-27 南京烽火星空通信发展有限公司 Voice generation method based on cycle generation network
CN113420870A (en) * 2021-07-04 2021-09-21 西北工业大学 U-Net structure generation countermeasure network and method for underwater acoustic target recognition
CN114495958A (en) * 2022-04-14 2022-05-13 齐鲁工业大学 Voice enhancement system for generating confrontation network based on time modeling
CN115497496A (en) * 2022-09-22 2022-12-20 东南大学 FirePS convolutional neural network-based voice enhancement method
CN116416963A (en) * 2023-06-12 2023-07-11 深圳市遐拓科技有限公司 Speech synthesis method suitable for bone conduction clear processing model in fire-fighting helmet
CN117633528A (en) * 2023-11-21 2024-03-01 元始智能科技(南通)有限公司 Manufacturing workshop energy consumption prediction technology based on small sample data restoration and enhancement
WO2024050802A1 (en) * 2022-09-09 2024-03-14 华为技术有限公司 Speech signal processing method, neural network training method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107886967A (en) * 2017-11-18 2018-04-06 中国人民解放军陆军工程大学 Bone conduction voice enhancement method of deep bidirectional gate recurrent neural network
CN110136731A (en) * 2019-05-13 2019-08-16 天津大学 Empty cause and effect convolution generates the confrontation blind Enhancement Method of network end-to-end bone conduction voice
CN110648684A (en) * 2019-07-02 2020-01-03 中国人民解放军陆军工程大学 Bone conduction voice enhancement waveform generation method based on WaveNet
CN110718232A (en) * 2019-09-23 2020-01-21 东南大学 Speech enhancement method for generating countermeasure network based on two-dimensional spectrogram and condition
US20200265857A1 (en) * 2019-02-15 2020-08-20 Shenzhen GOODIX Technology Co., Ltd. Speech enhancement method and apparatus, device and storage mediem
CN111968627A (en) * 2020-08-13 2020-11-20 中国科学技术大学 Bone conduction speech enhancement method based on joint dictionary learning and sparse representation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107886967A (en) * 2017-11-18 2018-04-06 中国人民解放军陆军工程大学 Bone conduction voice enhancement method of deep bidirectional gate recurrent neural network
US20200265857A1 (en) * 2019-02-15 2020-08-20 Shenzhen GOODIX Technology Co., Ltd. Speech enhancement method and apparatus, device and storage mediem
CN110136731A (en) * 2019-05-13 2019-08-16 天津大学 Empty cause and effect convolution generates the confrontation blind Enhancement Method of network end-to-end bone conduction voice
CN110648684A (en) * 2019-07-02 2020-01-03 中国人民解放军陆军工程大学 Bone conduction voice enhancement waveform generation method based on WaveNet
CN110718232A (en) * 2019-09-23 2020-01-21 东南大学 Speech enhancement method for generating countermeasure network based on two-dimensional spectrogram and condition
CN111968627A (en) * 2020-08-13 2020-11-20 中国科学技术大学 Bone conduction speech enhancement method based on joint dictionary learning and sparse representation

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DAIKI WATANABE ET AL.: "《Speech enhancement for bone-conducted speech based on low-order cepstrum restoration》", 《2017 INTERNATIONAL SYMPOSIUM ON INTELLIGENT SIGNAL PROCESSING AND COMMUNICATION SYSTEMS (ISPACS)》 *
QING PAN ET AL.: "《Bone-Conducted Speech to Air-Conducted Speech Conversion Based on CycleConsistent Adversarial Networks》", 《2020 IEEE 3RD INTERNATIONAL CONFERENCE ON INFORMATION COMMUNICATION AND SIGNAL PROCESSING (ICICSP)》 *
张雄伟等: "骨导麦克风语音盲增强技术研究现状及展望", 《数据采集与处理》 *
樊良辉 等: "《基于条件生成对抗网络的语音增强》", 《计算机与数字工程》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420870A (en) * 2021-07-04 2021-09-21 西北工业大学 U-Net structure generation countermeasure network and method for underwater acoustic target recognition
CN113420870B (en) * 2021-07-04 2023-12-22 西北工业大学 U-Net structure generation countermeasure network and method for underwater sound target recognition
CN113314109A (en) * 2021-07-29 2021-08-27 南京烽火星空通信发展有限公司 Voice generation method based on cycle generation network
CN113314109B (en) * 2021-07-29 2021-11-02 南京烽火星空通信发展有限公司 Voice generation method based on cycle generation network
CN114495958A (en) * 2022-04-14 2022-05-13 齐鲁工业大学 Voice enhancement system for generating confrontation network based on time modeling
CN114495958B (en) * 2022-04-14 2022-07-05 齐鲁工业大学 Speech enhancement system for generating confrontation network based on time modeling
WO2024050802A1 (en) * 2022-09-09 2024-03-14 华为技术有限公司 Speech signal processing method, neural network training method and device
CN115497496A (en) * 2022-09-22 2022-12-20 东南大学 FirePS convolutional neural network-based voice enhancement method
CN115497496B (en) * 2022-09-22 2023-11-14 东南大学 Voice enhancement method based on FirePS convolutional neural network
CN116416963A (en) * 2023-06-12 2023-07-11 深圳市遐拓科技有限公司 Speech synthesis method suitable for bone conduction clear processing model in fire-fighting helmet
CN116416963B (en) * 2023-06-12 2024-02-06 深圳市遐拓科技有限公司 Speech synthesis method suitable for bone conduction clear processing model in fire-fighting helmet
CN117633528A (en) * 2023-11-21 2024-03-01 元始智能科技(南通)有限公司 Manufacturing workshop energy consumption prediction technology based on small sample data restoration and enhancement

Similar Documents

Publication Publication Date Title
CN112599145A (en) Bone conduction voice enhancement method based on generation of countermeasure network
Yin et al. Phasen: A phase-and-harmonics-aware speech enhancement network
CN109671433B (en) Keyword detection method and related device
CN107452389B (en) Universal single-track real-time noise reduction method
Wang et al. On training targets for supervised speech separation
US8880396B1 (en) Spectrum reconstruction for automatic speech recognition
CN103489446B (en) Based on the twitter identification method that adaptive energy detects under complex environment
CN106504763A (en) Based on blind source separating and the microphone array multiple target sound enhancement method of spectrum-subtraction
Xiao et al. Normalization of the speech modulation spectra for robust speech recognition
TW201248613A (en) System and method for monaural audio processing based preserving speech information
Shahnaz et al. Pitch estimation based on a harmonic sinusoidal autocorrelation model and a time-domain matching scheme
CN111312275B (en) On-line sound source separation enhancement system based on sub-band decomposition
Wang et al. A structure-preserving training target for supervised speech separation
CN111192598A (en) Voice enhancement method for jump connection deep neural network
CN103021405A (en) Voice signal dynamic feature extraction method based on MUSIC and modulation spectrum filter
CN111899750B (en) Speech enhancement algorithm combining cochlear speech features and hopping deep neural network
Ince et al. Ego noise suppression of a robot using template subtraction
CN114041185A (en) Method and apparatus for determining a depth filter
Paikrao et al. Consumer Personalized Gesture Recognition in UAV Based Industry 5.0 Applications
Li et al. A si-sdr loss function based monaural source separation
CN118212929A (en) Personalized Ambiosonic voice enhancement method
CN116994600B (en) Method and system for driving character mouth shape based on audio frequency
Selvi et al. Hybridization of spectral filtering with particle swarm optimization for speech signal enhancement
CN114189781A (en) Noise reduction method and system for double-microphone neural network noise reduction earphone
Jamal et al. A comparative study of IBM and IRM target mask for supervised malay speech separation from noisy background

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210402