CN112599145A - Bone conduction voice enhancement method based on generation of countermeasure network - Google Patents
Bone conduction voice enhancement method based on generation of countermeasure network Download PDFInfo
- Publication number
- CN112599145A CN112599145A CN202011427512.8A CN202011427512A CN112599145A CN 112599145 A CN112599145 A CN 112599145A CN 202011427512 A CN202011427512 A CN 202011427512A CN 112599145 A CN112599145 A CN 112599145A
- Authority
- CN
- China
- Prior art keywords
- network
- bone conduction
- voice
- data
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 210000000988 bone and bone Anatomy 0.000 title claims abstract description 76
- 238000000034 method Methods 0.000 title claims abstract description 25
- 238000012549 training Methods 0.000 claims abstract description 29
- 238000007781 pre-processing Methods 0.000 claims abstract description 13
- 238000012545 processing Methods 0.000 claims abstract description 7
- 230000003042 antagnostic effect Effects 0.000 claims abstract description 6
- 238000005520 cutting process Methods 0.000 claims abstract description 5
- 238000009826 distribution Methods 0.000 claims description 39
- 238000001228 spectrum Methods 0.000 claims description 24
- 238000013528 artificial neural network Methods 0.000 claims description 15
- 238000004458 analytical method Methods 0.000 claims description 10
- 230000002708 enhancing effect Effects 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 238000013527 convolutional neural network Methods 0.000 claims description 3
- 230000037433 frameshift Effects 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 2
- 238000004891 communication Methods 0.000 abstract description 9
- 230000000694 effects Effects 0.000 abstract description 4
- 238000013135 deep learning Methods 0.000 abstract description 2
- 238000005065 mining Methods 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 18
- 238000005070 sampling Methods 0.000 description 5
- 230000008485 antagonism Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000003595 spectral effect Effects 0.000 description 4
- 238000011160 research Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000005654 stationary process Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R3/00—Circuits for transducers, loudspeakers or microphones
- H04R3/04—Circuits for transducers, loudspeakers or microphones for correcting frequency response
Landscapes
- Engineering & Computer Science (AREA)
- Acoustics & Sound (AREA)
- Physics & Mathematics (AREA)
- Signal Processing (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to the field of voice signal processing and deep learning, and aims to enable bone conduction equipment to obtain a better effect in extremely strong noise industries such as fire fighting, special duty, mining, emergency rescue and the like and obtain good voice communication quality under a strong noise background. Therefore, the technical scheme adopted by the invention is that based on a bone conduction voice enhancement method for generating an antagonistic network, preprocessing including short-time Fourier transform and cutting is carried out on collected bone conduction voice and air conduction voice; secondly, inputting the voice data obtained by preprocessing into a constructed confrontation generation network for training; and finally, inputting the bone conduction voice to be enhanced into the generator G of the confrontation generation network which is trained, and performing inverse short-time Fourier transform reconstruction on the obtained output result to generate the enhanced bone conduction voice. The method is mainly applied to the occasions of processing the bone conduction voice signals.
Description
Technical Field
The invention relates to the field of voice signal processing and deep learning, in particular to a voice enhancement method based on a confrontation generation network, which is used for enhancing bone conduction voice and facilitating communication by using bone conduction equipment.
Background
Speech is an original and important carrier for communication between people and machines, and conveys various information, and is widely used in various scenes such as speech communication and instruction issue. However, the environment where we live is often full of various noises, and people or people are inevitably interfered by the environmental noises in the process of receiving voice signals. Speech enhancement is a process of extracting clean useful signals from speech signals polluted by noise as much as possible by applying a prediction estimation method, and research on speech enhancement technology has important value in actual life and production. Although modern speech enhancement techniques have made significant progress, existing speech enhancement algorithms have degraded significantly in complex, noisy environments.
Unlike the conventional air conduction microphone, which collects sound, the bone conduction microphone collects vibration of bone through a vibration sensor instead of air, and then converts the vibration into an audio signal. The sound source shields most of the ambient noise, so that the communication useful signal can be well transmitted even under the strong noise environment. Although the bone conduction voice can effectively resist the interference of environmental noise, due to the change of a sound transmission path and the limitation of the sensor process level, noise such as skin friction and sensor friction, strong wind friction and the like can be mixed in the bone conduction voice, and the quality of the bone conduction voice is obviously reduced compared with that of the air conduction voice. Therefore, the research on the bone conduction voice enhancement algorithm is developed, and the method has important theoretical significance and practical value for further improving the voice communication quality in a strong noise environment and further expanding the application range of the bone conduction microphone.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a bone conduction voice enhancement method based on an antagonism generation network, which solves the problems that the communication quality is poor in the current complex strong noise environment and the effect of the current voice enhancement technology is unsatisfactory, enables bone conduction equipment to obtain a better effect in extremely strong noise industries such as fire fighting, special duty, mining, emergency rescue and the like, and obtains good voice communication quality in the strong noise background. Therefore, the technical scheme adopted by the invention is that based on a bone conduction voice enhancement method for generating an antagonistic network, preprocessing including short-time Fourier transform and cutting is carried out on collected bone conduction voice and air conduction voice; secondly, inputting the voice data obtained by preprocessing into a constructed confrontation generation network for training; and finally, inputting the bone conduction voice to be enhanced into the generator G of the confrontation generation network which is trained, and performing inverse short-time Fourier transform reconstruction on the obtained output result to generate the enhanced bone conduction voice.
Step one, voice data preprocessing:
firstly, recording bone conduction and air conduction data through a bone conduction microphone and an air conduction microphone device, then carrying out windowing and frame division operation on the bone conduction and air conduction data, intercepting a frame for 10-30 ms, setting a proper frame shift when intercepting a voice frame, namely overlapping the frame length which is less than or equal to a half frame length between a front frame and a rear frame, selecting a window, and carrying out weighting processing on a voice signal by adopting a Hamming window, wherein a window function of the Hamming window is given by an equation (1):
using a short-time fourier transform: the non-stationary signal f (t) is stationary for a short time interval of the analysis window w (t), and if the analysis window function is shifted such that f (t) w (t- τ) is also stationary for different finite time periods, the power spectrum of the non-stationary signal at each different time instant can be calculated, the short time fourier transform of the non-stationary signal f (t) being expressed as:
then, STFT transformation is carried out on the original voice data;
step two: constructing and generating a countermeasure network and training:
and (3) probability generation model: in a continuous or discrete high-dimensional space χ, the existence of a random vector X obeys an unknown data distribution pr(x) X ∈ χ, the generative model is a parameterized model p learned from some observable samples x (1), x (2), … …, x (N)θ(x) To approximate the unknown distribution pr(x) And using this model to generate samples such that the "generated" samples and the "real" samples are as similar as possible;
the deep generation model is to model a complex distribution p by utilizing the capability of a deep neural network to approximate an arbitrary functionr(x) Assuming a random vector Z obeys a simple distribution p (Z) Z ∈ Z, a deep neural network g: Z → χ is used, and g (Z) is made obey p (Z)r(x) In the low-dimensional space Z, a simple and easily sampled distribution p (Z) is provided, the p (Z) is usually a standard multivariate normal distribution N (0, I), a mapping function G: Z → chi is constructed by a neural network, which is called a generating network, and the fitting capacity of the neural network is utilized to ensure that G (Z) obeys the data distribution pr(x);
Discriminator (Discriminator)The goal of (1) is to distinguish that a sample x is from the true distribution pr(x) Or from the generative model pθ(x) The label y-1 indicates that the sample is from the true distribution, y-0 indicates that the sample is from the model, and the output of the network D (x; phi) is determined as the probability that x belongs to the true data distribution, that is, the probability
Then the probability that the sample comes from model generation is:
given a sample (x, y)) And y is {1,0} denotes a group derived from pr(x) Or pθ(x) And judging the objective function of the network as the minimum cross entropy, namely, the maximum log likelihood:
wherein θ andrespectively generating parameters of the network and judging parameters of the network;
the Generator (Generator) is aimed at the opposite of the discrimination network, i.e. let the discrimination network discriminate the self-generated sample as a real sample:
combining the discrimination network and the generation network, and regarding the whole objective function of the whole generation countermeasure network as a minimized and maximized game, wherein the whole objective function is as follows:
minGmaxDV(D,G)=Ex~pdata(x)[logD(x)]+Ez~pz(z)[log(1-D(G(z))] (7)
respectively using the short-time Fourier transform speech spectrum data of the gas conduction speech and the short-time Fourier transform speech spectrum data of the bone conduction speech as input for generating an antagonistic network, wherein the speech spectrum data of the gas conduction speech is regarded as a sample from real distribution, the enhanced speech spectrum data of the bone conduction speech generated by a generator G is regarded as data from model generation, and the enhanced bone conduction speech spectrum data similar to the gas conduction speech is obtained through an antagonistic training generator G and a discriminator D, so that the aim of enhancing the bone conduction speech is fulfilled;
the generator G is composed of a full convolution neural network with jump connection, the generator G has 8 layers in total, the number of convolution kernels is set to be 64, and the generator G is divided into an Encoder Encoder part and a Decoder Decode part;
the discriminator D is a two-classification convolutional neural network, 3 layers of convolutional layers are shared in all, the air conduction voice spectrum data and the nuclear bone conduction voice spectrum data are input into the discriminator D for training, the training discriminator identifies the air conduction voice data from the real data and gives a high score close to 1, and the enhanced bone conduction voice data generated by the generator G is identified and gives a low score.
The generation of the countermeasure network was trained using a random gradient descent algorithm and optimized using an Adam optimizer, with the weights of the network initialized using a normal distribution with a mean of zero and a standard deviation of 0.02, and with the addition of an L1 loss function of coefficient 100 in the loss function for generation of the countermeasure network.
The invention has the characteristics and beneficial effects that:
the invention uses the antagonism generation network to enhance the bone conduction voice with poor tone quality, and the enhanced bone conduction voice has obvious improvement in tone quality and intelligibility. The invention has important guiding significance and practical value for improving the voice communication quality in the strong noise environment and further expanding the application range of the bone conduction microphone.
Description of the drawings:
fig. 1 is a flow chart of a method for generating bone conduction speech enhancement of an anti-network.
Fig. 2 generates a countermeasure network flow diagram.
FIG. 3 is a flow diagram of enhanced speech reconstruction.
Fig. 4 is a comparison graph of bone conduction speech and enhanced bone conduction speech spectra.
Detailed Description
The invention is that the method first enhances bone conduction speech using a generative countermeasure network. And carrying out short-time Fourier transform, cutting and other preprocessing on the collected bone conduction voice and the collected air conduction voice. Secondly, inputting the voice data obtained by preprocessing into a constructed confrontation generation network for training. Finally, the method inputs the bone conduction voice to be enhanced into the trained generator G, and the obtained output result is subjected to inverse short-time Fourier transform reconstruction to generate the enhanced bone conduction voice.
Overall, the invention comprises three functional modules, respectively: the system comprises a bone conduction voice and gas conduction voice preprocessing module, a confrontation generation network training module and a bone conduction voice enhancement module. The bone conduction and gas conduction voice preprocessing module is used for preprocessing collected bone conduction voice and gas conduction voice such as short-time Fourier transform and size cutting; the antagonism generation network training module is used for carrying out antagonism training on the generator G and the discriminator D on the basis of the input bone conduction and air conduction data; the bone conduction voice enhancement module is used for enhancing the bone conduction voice by using the trained generator G, and then carrying out inverse short-time Fourier transform on the enhanced data to obtain the final enhanced bone conduction voice.
The specific implementation steps of the bone conduction voice enhancement method based on the generation countermeasure network are as follows:
step one, voice data preprocessing:
firstly, bone conduction data and air conduction data recorded by a bone conduction microphone and an air conduction microphone device are respectively saved as wav files at a sampling rate of 16kHz and named by corresponding file names. Then, windowing and framing operation needs to be performed on the voice signal, generally, 10-30 ms is intercepted as one frame, because the voice signal is regarded as a stationary process in the time period, and a proper frame shift needs to be set when the voice frame is intercepted, that is, an overlap less than or equal to half of the frame length exists between the front frame and the rear frame, which is mainly for smooth transition between voice frames. The selection of the window (including shape and length) needs to reduce the truncation effect of the speech frame as much as possible, namely, the slopes at two ends of the time window are reduced to ensure that two ends of the edge of the window do not cause rapid change and transition stably, the method adopts finite length window functions such as a Hamming window and the like to carry out weighting processing on the speech signal, and the window function of the Hamming window is given by the formula (1):
after the windowing, the original speech signal is cut into a plurality of short-time speech frames with stable characteristics, and then further speech research can be realized by extracting speech characteristic parameters.
Traditional signal analysis is based on Fourier transform, and the Fourier analysis realizes a global transform, either completely in a time domain or completely in a frequency domain, so that the time-frequency local property of the signal cannot be expressed, and the property is the most fundamental and critical property of a non-stationary signal. In order to analyze and process non-stationary signals, researchers have popularized and even fundamentally changed Fourier analysis, and put forward and develop a series of new signal analysis theories. The method uses short-time Fourier transform, and the basic idea is as follows: assuming that the non-stationary signal f (t) is stationary for a short time interval of the analysis window w (t), if the analysis window function is shifted such that f (t) w (t- τ) is also stationary for different finite time periods, the power spectrum of the non-stationary signal at each different instant can be calculated. The short-time fourier transform of the non-stationary signal f (t) can be expressed as:
and performing STFT on the original voice data, setting parameters to select a sampling rate of 16kHz, setting FFT sampling points to be 512, setting the length of a Hamming window to be 32ms, and setting frame overlapping to be 16 ms. With such parameter settings, the frequency resolution is 16 kHz/512-31.25 Hz. Due to symmetry, only the STFT magnitude vector covering 257 points of positive frequency needs to be considered. While ignoring the last line in the STFT spectrum, i.e. the frequency bin representing the signal up to 31.25 Hz. The impact of such data loss is essentially negligible, but may allow later design of the 2-exponent size generator and discriminator inputs, making training of the generation countermeasure network more efficient thereafter.
Step two: constructing and generating a countermeasure network and training:
probabilistic generative model, generative model for short(Generation Model), which is an important Model in probability statistics and machine learning, refers to a series of models for randomly generating observable data. Assuming that in a continuous or discrete high-dimensional space χ, there is a random vector X obeying an unknown data distribution pr(x) And x ∈ χ. The generative model is a parameterized model p learned from a number of observable samples x (1), x (2), … …, x (N)θ(x) To approximate the unknown distribution pr(x) And some samples can be generated with this model so that the "generated" samples and the "real" samples are as similar as possible.
The deep generation model is to model a complex distribution p by utilizing the capability of a deep neural network to approximate an arbitrary functionr(x) In that respect Assuming a random vector Z obeys a simple distribution p (Z), Z ∈ Z (such as the standard normal distribution), a deep neural network g: Z → χ can be used, and g (Z) obeys p (Z)r(x) In that respect Assume that there is a simple, easily sampled distribution p (Z) in the low-dimensional space Z, which is usually a standard multivariate normal distribution N (0, I). A mapping function G, Z → chi is constructed by a neural network, and the mapping function G, Z → chi is called a generation network. Making use of the powerful fitting power of neural networks to make G (z) obey the data distribution pr(x)。
Generating a countermeasure network (GAN) is to make the samples generated by the generating network obey the real data distribution by means of countermeasure training. In generating the countermeasure network, there are two networks for the countermeasure training. One is to judge the network, and the aim is to judge whether a sample is from real data or generated by the network as accurately as possible; the other is to generate a network, and the aim is to generate a sample which can not distinguish the source of the network as much as possible. The two networks with opposite targets are continuously trained alternately. When the data is finally converged, if the judging network can not judge the source of a sample any more, the method is equivalent to that the generating network can generate the sample which accords with the real data distribution.
Discriminator (Discriminator)The goal of (1) is to distinguish that a sample x is from the true distribution pr(x) Or from the generative model pθ(x) Thus, the discrimination network is actually a two-class classifier. The label y-1 indicates that the sample comes from the true distribution, y-0 indicates that the sample comes from the model, and the output of the network D (x; phi) is judged as the probability that x belongs to the true data distribution, that is, the output of the network D (x; phi) is
Then the probability that the sample comes from model generation is:
given a sample (x, y), y ═ {1,0} indicates that it is from pr(x) Or pθ(x) And judging the objective function of the network as the minimum cross entropy, namely, the maximum log likelihood:
The Generator (Generator Network) is targeted to be just opposite to the discrimination Network, i.e. let the discrimination Network discriminate the self-generated sample as a real sample:
combining the discrimination network and the generation network, and regarding the whole objective function of the whole generation countermeasure network as a minimized and maximized Game (Minimax Game), wherein the whole objective function is as follows:
minGmaxDV(D,G)=Ex~pdata(x)[logD(x)]+Ez~pz(z)[log(1-D(G(z))] (7)
in the present invention, short-time fourier transform speech spectral data of air-conduction speech and short-time fourier transform speech spectral data of bone-conduction speech are used as inputs for generating a countermeasure network, respectively. Wherein the speech spectral data of the gas conduction speech is considered as samples from the true distribution and the enhanced speech spectral data of the bone conduction speech generated by the generator G is considered as data from the model generation. And obtaining the enhanced bone conduction speech spectrum data similar to the air conduction speech through the confrontation training generator G and the discriminator D, thereby achieving the purpose of enhancing the bone conduction speech.
The generator G consists of a full convolution neural network with hopping connections. The generator G has 8 layers in total, the number of convolution kernels is set to 64, and the generator G can be divided into an Encoder Encoder part and a Decoder Decoder part. The output input size of the generator G is 256x256, and the downsampling convolution is first performed in the Encoder part: the step size of the first layer of convolution is 2, the number of convolution kernels is 64, and the output size obtained after convolution is 128x128x 64; the convolution step size of the second layer is 2, the number of convolution kernels is 128, and the output size obtained after convolution is 64x64x 128; the step size of the convolution of the third layer is 2, the number of convolution kernels is 256, and the output size after convolution is 32x32x 256; the step size of the convolution of the fourth layer is 2, the number of convolution kernels is 512, and the output size after convolution is 16x16x 512; the fifth layer of convolution step size is 2, the number of convolution kernels is 512, and the output size after convolution is 8x 512; the step size of the sixth layer of convolution is 2, the number of convolution kernels is 512, and the output size after convolution is 4x 512; the seventh layer of convolution has a step size of 2, the number of convolution kernels is 512, and the output size after convolution is 2x 512. After seven layers of convolution, the input data reaches the bottleneck layer. Data reaching the bottleneck layer is sampled by a decoder, and deconvolution is carried out by the same parameter setting to restore the size of 256 × 256 for output. Meanwhile, jump connection is established between sampling layers on corresponding down-sampling layer cores, and training of a deeper layer (richer global information) and training of a shallower layer (more local details) are combined, so that local training can be performed while observing global training, and the performance of the generator is further improved.
The discriminator D is a two-class convolutional neural network, and has 3 convolutional layers in total. The training scheduler inputs the air conduction speech spectrum data and the nuclear bone conduction speech spectrum data into a discriminator D for training, the training discriminator identifies the air conduction speech data from the real data and gives a high score (close to 1), and the recognition generator G generates enhanced bone conduction speech data and gives a low score (close to 0).
The method uses a random gradient descent algorithm to train and generate the countermeasure network, and adopts an Adam optimizer to optimize. The training period was set to 400 times, the learning rate was set to 0.0002, and the learning rate was set to start linearly decaying at half the training period. The weights of the network were initialized using a normal distribution with a mean of zero and a standard deviation of 0.02, and an L1 loss function with a coefficient of 100 was added in generating the loss function against the network.
Step three: and (3) using the trained network to enhance the bone conduction voice:
firstly, reading the configuration file of the generator G trained in the step two. And performing short-time Fourier transform processing on the bone conduction voice data to be enhanced, performing modulus operation and phase operation on the obtained short-time Fourier spectrums respectively, and inputting the data obtained by the modulus operation into a generator G for enhancement. And the enhanced data obtained by the generator is combined with the phase part of the short-time Fourier spectrum before, and the enhanced bone conduction voice is obtained through quasi-Fourier transform reconstruction.
Claims (3)
1. A bone conduction voice enhancement method based on generation countermeasure network is characterized in that collected bone conduction voice and air conduction voice are preprocessed through short-time Fourier transform and cutting; secondly, inputting the voice data obtained by preprocessing into a constructed confrontation generation network for training; and finally, inputting the bone conduction voice to be enhanced into the generator G of the confrontation generation network which is trained, and performing inverse short-time Fourier transform reconstruction on the obtained output result to generate the enhanced bone conduction voice.
2. The bone conduction speech enhancement method based on the generative countermeasure network as claimed in claim 1, wherein the step one, speech data preprocessing:
firstly, recording bone conduction and air conduction data through a bone conduction microphone and an air conduction microphone device, then carrying out windowing and frame division operation on the bone conduction and air conduction data, intercepting a frame for 10-30 ms, setting a proper frame shift when intercepting a voice frame, namely overlapping the frame length which is less than or equal to a half frame length between a front frame and a rear frame, selecting a window, and carrying out weighting processing on a voice signal by adopting a Hamming window, wherein a window function of the Hamming window is given by an equation (1):
using a short-time fourier transform: the non-stationary signal f (t) is stationary for a short time interval of the analysis window w (t), and if the analysis window function is shifted such that f (t) w (t- τ) is also stationary for different finite time periods, the power spectrum of the non-stationary signal at each different time instant can be calculated, the short time fourier transform of the non-stationary signal f (t) being expressed as:
then, STFT transformation is carried out on the original voice data;
step two: constructing and generating a countermeasure network and training:
and (3) probability generation model: in a continuous or discrete high-dimensional space χ, the existence of a random vector X obeys an unknown data distribution pr(x) X ∈ χ, the generative model is a parameterized model p learned from some observable samples x (1), x (2), … …, x (N)θ(x) To approximate the unknown distribution pr(x)And using this model to generate samples such that the "generated" samples and the "real" samples are as similar as possible;
the deep generation model is to model a complex distribution p by utilizing the capability of a deep neural network to approximate an arbitrary functionr(x) Assuming a random vector Z obeys a simple distribution p (Z) Z ∈ Z, a deep neural network g: Z → χ is used, and g (Z) is made obey p (Z)r(x) In the low-dimensional space Z, a simple and easily sampled distribution p (Z) is provided, the p (Z) is usually a standard multivariate normal distribution N (0, I), a mapping function G: Z → chi is constructed by a neural network, which is called a generating network, and the fitting capacity of the neural network is utilized to ensure that G (Z) obeys the data distribution pr(x);
Discriminator (Discriminator)The goal of (1) is to distinguish that a sample x is from the true distribution pr(x) Or from the generative model pθ(x) The label y-1 indicates that the sample is from the true distribution, y-0 indicates that the sample is from the model, and the output of the network D (x; phi) is determined as the probability that x belongs to the true data distribution, that is, the probability
Then the probability that the sample comes from model generation is:
given a sample (x, y), y ═ {1,0} indicates that it is from pr(x) Or pθ(x) And judging the objective function of the network as the minimum cross entropy, namely, the maximum log likelihood:
wherein θ andrespectively generating parameters of the network and judging parameters of the network;
the Generator (Generator) is aimed at the opposite of the discrimination network, i.e. let the discrimination network discriminate the self-generated sample as a real sample:
combining the discrimination network and the generation network, and regarding the whole objective function of the whole generation countermeasure network as a minimized and maximized game, wherein the whole objective function is as follows:
minGmaxDV(D,G)=Ex~pdata(x)[logD(x)]+Ez~pz(z)[log(1-D(G(z))] (7)
respectively using the short-time Fourier transform speech spectrum data of the gas conduction speech and the short-time Fourier transform speech spectrum data of the bone conduction speech as input for generating an antagonistic network, wherein the speech spectrum data of the gas conduction speech is regarded as a sample from real distribution, the enhanced speech spectrum data of the bone conduction speech generated by a generator G is regarded as data from model generation, and the enhanced bone conduction speech spectrum data similar to the gas conduction speech is obtained through an antagonistic training generator G and a discriminator D, so that the aim of enhancing the bone conduction speech is fulfilled;
the generator G is composed of a full convolution neural network with jump connection, the generator G has 8 layers in total, the number of convolution kernels is set to be 64, and the generator G is divided into an Encoder Encoder part and a Decoder Decode part;
the discriminator D is a two-classification convolutional neural network, 3 layers of convolutional layers are shared in all, the air conduction voice spectrum data and the nuclear bone conduction voice spectrum data are input into the discriminator D for training, the training discriminator identifies the air conduction voice data from the real data and gives a high score close to 1, and the enhanced bone conduction voice data generated by the generator G is identified and gives a low score.
3. The generative confrontational network-based bone conduction speech enhancement method of claim 1, wherein the generative confrontational network is trained using a stochastic gradient descent algorithm and optimized using an Adam optimizer, weights of the network are initialized using a normal distribution with a mean of zero and a standard deviation of 0.02, and an L1 loss function with a coefficient of 100 is added to the loss function for generating the confrontational network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011427512.8A CN112599145A (en) | 2020-12-07 | 2020-12-07 | Bone conduction voice enhancement method based on generation of countermeasure network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011427512.8A CN112599145A (en) | 2020-12-07 | 2020-12-07 | Bone conduction voice enhancement method based on generation of countermeasure network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112599145A true CN112599145A (en) | 2021-04-02 |
Family
ID=75191383
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011427512.8A Pending CN112599145A (en) | 2020-12-07 | 2020-12-07 | Bone conduction voice enhancement method based on generation of countermeasure network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112599145A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113314109A (en) * | 2021-07-29 | 2021-08-27 | 南京烽火星空通信发展有限公司 | Voice generation method based on cycle generation network |
CN113420870A (en) * | 2021-07-04 | 2021-09-21 | 西北工业大学 | U-Net structure generation countermeasure network and method for underwater acoustic target recognition |
CN114495958A (en) * | 2022-04-14 | 2022-05-13 | 齐鲁工业大学 | Voice enhancement system for generating confrontation network based on time modeling |
CN115497496A (en) * | 2022-09-22 | 2022-12-20 | 东南大学 | FirePS convolutional neural network-based voice enhancement method |
CN116416963A (en) * | 2023-06-12 | 2023-07-11 | 深圳市遐拓科技有限公司 | Speech synthesis method suitable for bone conduction clear processing model in fire-fighting helmet |
CN117633528A (en) * | 2023-11-21 | 2024-03-01 | 元始智能科技(南通)有限公司 | Manufacturing workshop energy consumption prediction technology based on small sample data restoration and enhancement |
WO2024050802A1 (en) * | 2022-09-09 | 2024-03-14 | 华为技术有限公司 | Speech signal processing method, neural network training method and device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107886967A (en) * | 2017-11-18 | 2018-04-06 | 中国人民解放军陆军工程大学 | Bone conduction voice enhancement method of deep bidirectional gate recurrent neural network |
CN110136731A (en) * | 2019-05-13 | 2019-08-16 | 天津大学 | Empty cause and effect convolution generates the confrontation blind Enhancement Method of network end-to-end bone conduction voice |
CN110648684A (en) * | 2019-07-02 | 2020-01-03 | 中国人民解放军陆军工程大学 | Bone conduction voice enhancement waveform generation method based on WaveNet |
CN110718232A (en) * | 2019-09-23 | 2020-01-21 | 东南大学 | Speech enhancement method for generating countermeasure network based on two-dimensional spectrogram and condition |
US20200265857A1 (en) * | 2019-02-15 | 2020-08-20 | Shenzhen GOODIX Technology Co., Ltd. | Speech enhancement method and apparatus, device and storage mediem |
CN111968627A (en) * | 2020-08-13 | 2020-11-20 | 中国科学技术大学 | Bone conduction speech enhancement method based on joint dictionary learning and sparse representation |
-
2020
- 2020-12-07 CN CN202011427512.8A patent/CN112599145A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107886967A (en) * | 2017-11-18 | 2018-04-06 | 中国人民解放军陆军工程大学 | Bone conduction voice enhancement method of deep bidirectional gate recurrent neural network |
US20200265857A1 (en) * | 2019-02-15 | 2020-08-20 | Shenzhen GOODIX Technology Co., Ltd. | Speech enhancement method and apparatus, device and storage mediem |
CN110136731A (en) * | 2019-05-13 | 2019-08-16 | 天津大学 | Empty cause and effect convolution generates the confrontation blind Enhancement Method of network end-to-end bone conduction voice |
CN110648684A (en) * | 2019-07-02 | 2020-01-03 | 中国人民解放军陆军工程大学 | Bone conduction voice enhancement waveform generation method based on WaveNet |
CN110718232A (en) * | 2019-09-23 | 2020-01-21 | 东南大学 | Speech enhancement method for generating countermeasure network based on two-dimensional spectrogram and condition |
CN111968627A (en) * | 2020-08-13 | 2020-11-20 | 中国科学技术大学 | Bone conduction speech enhancement method based on joint dictionary learning and sparse representation |
Non-Patent Citations (4)
Title |
---|
DAIKI WATANABE ET AL.: "《Speech enhancement for bone-conducted speech based on low-order cepstrum restoration》", 《2017 INTERNATIONAL SYMPOSIUM ON INTELLIGENT SIGNAL PROCESSING AND COMMUNICATION SYSTEMS (ISPACS)》 * |
QING PAN ET AL.: "《Bone-Conducted Speech to Air-Conducted Speech Conversion Based on CycleConsistent Adversarial Networks》", 《2020 IEEE 3RD INTERNATIONAL CONFERENCE ON INFORMATION COMMUNICATION AND SIGNAL PROCESSING (ICICSP)》 * |
张雄伟等: "骨导麦克风语音盲增强技术研究现状及展望", 《数据采集与处理》 * |
樊良辉 等: "《基于条件生成对抗网络的语音增强》", 《计算机与数字工程》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113420870A (en) * | 2021-07-04 | 2021-09-21 | 西北工业大学 | U-Net structure generation countermeasure network and method for underwater acoustic target recognition |
CN113420870B (en) * | 2021-07-04 | 2023-12-22 | 西北工业大学 | U-Net structure generation countermeasure network and method for underwater sound target recognition |
CN113314109A (en) * | 2021-07-29 | 2021-08-27 | 南京烽火星空通信发展有限公司 | Voice generation method based on cycle generation network |
CN113314109B (en) * | 2021-07-29 | 2021-11-02 | 南京烽火星空通信发展有限公司 | Voice generation method based on cycle generation network |
CN114495958A (en) * | 2022-04-14 | 2022-05-13 | 齐鲁工业大学 | Voice enhancement system for generating confrontation network based on time modeling |
CN114495958B (en) * | 2022-04-14 | 2022-07-05 | 齐鲁工业大学 | Speech enhancement system for generating confrontation network based on time modeling |
WO2024050802A1 (en) * | 2022-09-09 | 2024-03-14 | 华为技术有限公司 | Speech signal processing method, neural network training method and device |
CN115497496A (en) * | 2022-09-22 | 2022-12-20 | 东南大学 | FirePS convolutional neural network-based voice enhancement method |
CN115497496B (en) * | 2022-09-22 | 2023-11-14 | 东南大学 | Voice enhancement method based on FirePS convolutional neural network |
CN116416963A (en) * | 2023-06-12 | 2023-07-11 | 深圳市遐拓科技有限公司 | Speech synthesis method suitable for bone conduction clear processing model in fire-fighting helmet |
CN116416963B (en) * | 2023-06-12 | 2024-02-06 | 深圳市遐拓科技有限公司 | Speech synthesis method suitable for bone conduction clear processing model in fire-fighting helmet |
CN117633528A (en) * | 2023-11-21 | 2024-03-01 | 元始智能科技(南通)有限公司 | Manufacturing workshop energy consumption prediction technology based on small sample data restoration and enhancement |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112599145A (en) | Bone conduction voice enhancement method based on generation of countermeasure network | |
Yin et al. | Phasen: A phase-and-harmonics-aware speech enhancement network | |
CN109671433B (en) | Keyword detection method and related device | |
CN107452389B (en) | Universal single-track real-time noise reduction method | |
Wang et al. | On training targets for supervised speech separation | |
US8880396B1 (en) | Spectrum reconstruction for automatic speech recognition | |
CN103489446B (en) | Based on the twitter identification method that adaptive energy detects under complex environment | |
CN106504763A (en) | Based on blind source separating and the microphone array multiple target sound enhancement method of spectrum-subtraction | |
Xiao et al. | Normalization of the speech modulation spectra for robust speech recognition | |
TW201248613A (en) | System and method for monaural audio processing based preserving speech information | |
Shahnaz et al. | Pitch estimation based on a harmonic sinusoidal autocorrelation model and a time-domain matching scheme | |
CN111312275B (en) | On-line sound source separation enhancement system based on sub-band decomposition | |
Wang et al. | A structure-preserving training target for supervised speech separation | |
CN111192598A (en) | Voice enhancement method for jump connection deep neural network | |
CN103021405A (en) | Voice signal dynamic feature extraction method based on MUSIC and modulation spectrum filter | |
CN111899750B (en) | Speech enhancement algorithm combining cochlear speech features and hopping deep neural network | |
Ince et al. | Ego noise suppression of a robot using template subtraction | |
CN114041185A (en) | Method and apparatus for determining a depth filter | |
Paikrao et al. | Consumer Personalized Gesture Recognition in UAV Based Industry 5.0 Applications | |
Li et al. | A si-sdr loss function based monaural source separation | |
CN118212929A (en) | Personalized Ambiosonic voice enhancement method | |
CN116994600B (en) | Method and system for driving character mouth shape based on audio frequency | |
Selvi et al. | Hybridization of spectral filtering with particle swarm optimization for speech signal enhancement | |
CN114189781A (en) | Noise reduction method and system for double-microphone neural network noise reduction earphone | |
Jamal et al. | A comparative study of IBM and IRM target mask for supervised malay speech separation from noisy background |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210402 |