CN110648684A

CN110648684A - Bone conduction voice enhancement waveform generation method based on WaveNet

Info

Publication number: CN110648684A
Application number: CN201910590941.8A
Authority: CN
Inventors: 张雄伟; 郑昌艳; 杨吉斌; 曹铁勇; 李莉; 孙蒙
Original assignee: Army Engineering University of PLA
Current assignee: Army Engineering University of PLA
Priority date: 2019-07-02
Filing date: 2019-07-02
Publication date: 2020-01-03
Anticipated expiration: 2039-07-02
Also published as: CN110648684B

Abstract

The invention discloses a bone conduction voice enhancement waveform generation method based on WaveNet. The method adopts a WaveNet model to generate high-quality voice on the basis of the enhancement of the bone conduction voice amplitude spectrum based on the BLSTM model. Firstly, a BLSTM model and a WaveNet model are constructed, an up-sampling module of a cross-sampling rate is introduced into the WaveNet model, and the two models are respectively trained; and then sending the amplitude spectrum of the bone conduction voice under the low sampling rate to be enhanced into the trained BLSTM to obtain an enhanced amplitude spectrum, and then sending the enhanced amplitude spectrum and the bone conduction voice phase information into the trained WaveNet model in combination to obtain an enhanced voice waveform under the high sampling rate. The invention effectively utilizes the bone conduction voice phase information and has the function of spectrum expansion, and can directly generate the enhanced high-sampling-rate voice waveform from the enhanced bone conduction voice amplitude spectrum and the bone conduction voice phase information, thereby obviously improving the quality of the bone conduction voice.

Description

Bone conduction voice enhancement waveform generation method based on WaveNet

Technical Field

The invention relates to the technical field of bone conduction, in particular to a bone conduction voice enhancement waveform generation method based on WaveNet.

Background

The bone conduction microphone obtains the bone conduction voice of the voice signal by utilizing the vibration generated by the skull, the larynx and other body parts when a person makes a sound, and because the signal transmission channel of the bone conduction microphone shields the influence of the ambient noise, compared with the voice generated by the traditional air conduction microphone, the bone conduction microphone has strong anti-noise performance, and has wide application prospect in the military and civil fields. However, due to the low-pass property of human signal transmission, the high-frequency components of bone conduction speech are seriously attenuated, the frequency components are usually below 2.5kHz, and the electric sound signals generated by vibration do not pass through sound tuning areas such as oral cavity, nasal cavity and lip, so that consonant syllables such as fricatives, plosives, unvoiced sounds and the like related to the areas are seriously lost, so that the bone conduction speech sounds dull, unclear and unnatural, and the intelligibility is low.

Bone conduction speech enhancement refers to processing bone conduction speech to improve speech quality, and can be modeled as a conversion problem from bone conduction speech to corresponding pure air conduction speech. Early algorithms generally decomposed speech into spectral envelope features and excitation features based on a source-filter model of the speech signal. Compared with excitation characteristics, human ears are more sensitive to spectral envelope characteristics, so that the research focuses on the conversion from bone conduction speech to air conduction speech low-dimensional spectral envelope, and the conversion relation between the characteristics is learned by using a Gaussian mixture model or a shallow neural network. In recent years, with the rapid development of deep learning technology, deep neural networks are beginning to be used to more accurately learn the nonlinear conversion relationship between two speech spectral envelope features, and because deep neural networks have a good capability of characterizing high-dimensional data, researchers have begun to better characterize the gap between two speeches with a high-dimensional Short-Time fourier transform (STFT) spectrum, such as document 1(Liu H P, Tsao Y, Fuh C S, "Bone-connected speech enhancement using noise encoder," speech communication, vol.104, pp.106-112,2018) using deep networks to learn the conversion between two speech high-dimensional mel spectral features. In the previous work, document 2(Changyan Zheng, xingwei Zhang, men Sun, jibenyang, Yibo Xing, "a Novel thread Microphone Speech performance Enhancement Framework based on Deep blue Speech network," in proc. ieee international conference on Computer and Communications (ICCC),2018) proposes a bone conduction Speech Enhancement method based on a Bi-directional Long Short Term Memory (BLSTM) Recurrent Neural network, and uses the context of a Speech frame to model the conversion relationship from bone conduction Speech to the corresponding high-dimensional air conduction Speech magnitude spectrum, thereby effectively improving the bone conduction Speech Enhancement quality.

After obtaining the transformed spectrum envelope or magnitude spectrum, the above method generally synthesizes the speech waveform directly by using the excitation signal or phase spectrum of the original bone conduction speech. However, in practice, there is also a difference between the excitation signal and the phase spectrum of bone-conducted speech and corresponding air-conducted speech. The waveform under the source-filter model adopts excitation-envelope synthesis, when excitation signals are not matched, the synthesized voice is discontinuous, and small distortion easily causes great reduction of auditory perception of human ears due to low characteristic dimension. There are also a few studies to explore the conversion of the excitation characteristics of bone conduction Speech and air conduction Speech, and document 3 (media a, reduce _ Mallidi S H, yegnarayana b. special-dependent mapping of source and system features for enhancement of the same micro-phone distance [ C ]// electric any numerical reference of the International specific Communication association.2010) uses the instantaneous Glottal Closure (GCI) feature instead of the excitation signal, and uses the shallow neural network to implement the conversion of the GCI feature, document 4 (tunnel e.a new statistical excitation mapping for the enhancement of the excitation of the micro-phone/coding [ C ]/3248), and uses the gradient of the Speech vector for the extraction of the gradient of the Speech vector (interval) model 3248. However, since the excitation signal is closely related to vocal cord motion and frictional noise, the characteristic regularity of the excitation signal is not obvious, so that the modeling difficulty is high, and the conversion effect is not ideal. According to the voice waveform synthesis method based on the short-time inverse Fourier transform, due to the fact that the short-time Fourier transform carries out overlapping and analysis on signals, and signal information can be partially shared among frames, specific correlation is introduced among STFT coefficients, if the voice waveform is synthesized by adopting the enhanced amplitude spectrum and the original bone conduction voice phase, original constraint of the STFT coefficients is lacked between the amplitude spectrum and the phase spectrum, and even if the amplitude spectrum is optimal due to mismatching, the synthesized voice has problems. In the field of bone conduction speech enhancement, because bone conduction speech and air conduction speech have higher similarity in a low-frequency part, the speech quality synthesized by adopting the phase of the original bone conduction speech is acceptable, but the tone quality loss caused by unmatched phase spectrums in the waveform synthesis process still obviously exists.

Disclosure of Invention

The invention aims to provide a WaveNet-based bone conduction voice enhanced waveform generation method for remarkably improving the quality of bone conduction voice, which can directly obtain an enhanced high-sampling-rate voice waveform by utilizing an enhanced bone conduction voice amplitude spectrum and bone conduction voice phase information.

The technical solution for realizing the purpose of the invention is as follows: a method for WaveNet-based generation of a bone conduction speech enhancement waveform, comprising the steps of:

step 1, constructing a BLSTM-based amplitude spectrum enhancement model and a WaveNet-based waveform generation model, and introducing an up-sampling module with a cross-sampling rate into the WaveNet-based waveform generation model;

step 2, respectively training an amplitude spectrum enhancement model based on BLSTM and a waveform generation model based on WaveNet, wherein the input of the amplitude spectrum enhancement model based on BLSTM is a sampling rate s_lowThe output target of the lower bone conduction voice amplitude spectrum is the sampling rate s_lowLeading the voice amplitude spectrum of the voice; the waveform generation model based on WaveNet has the input of a sampling rate s_lowThe phase information of the lower bone guide voice and the amplitude spectrum of the air guide voice are output with the sampling rate s as the target_highA descending air conduction voice waveform; wherein s is_low＜s_high；

Step 3, the sampling rate s to be enhanced_lowSending the lower bone conduction voice amplitude spectrum into a trained BLSTM-based amplitude spectrum enhancement model to obtain an enhanced amplitude spectrum, sending the enhanced amplitude spectrum and bone conduction voice phase information into a trained waveshape generation model based on WaveNet to obtain an enhanced sampling rate s_highA lower speech waveform.

Further, the cross-sampling-rate upsampling module in step 1 specifically includes:

setting two sampling rates, i.e. s_low、s_highUnder the condition, the frame window length of the speech features is consistent with the window shift time, so that the resolution of the frame-level features is 1/t_hop，t_hopRepresenting a framing window shift time; meanwhile, an up-sampling method of linear difference is adopted.

Further, in the step 2, the bone conduction voice amplitude spectrum, the air conduction voice amplitude spectrum and the bone conduction voice phase information are obtained in the following manner:

step 2.1, respectively carrying out waveform amplitude normalization on the bone conduction voice x and the air conduction voice y to between [ -1,1] to obtain normalized bone conduction voice x 'and normalized air conduction voice y';

step 2.2, extracting the acoustic characteristics of the bone conduction voice and the air conduction voice to obtain a bone conduction voice amplitude spectrum M_xAudio amplitude spectrum M of qi-conduction speech_yBone conduction voice phase information, wherein the bone conduction voice phase information is bone conduction voice group delay characteristic GD_x；

Step 2.3, conducting voice amplitude spectrum M to the bone_xAnd air conduction voice amplitude spectrum M_yRespectively carrying out acoustic characteristic log extraction and MVN normalization processing to obtain normalized bone conduction voice amplitude spectrum M'_xAnd a normalized air guide language magnitude spectrum M'_y。

Further, training the BLSTM-based magnitude spectrum enhancement model in step 2 specifically includes the following steps:

step 3.1, setting learning rate as eta_BTraining iteration number is N_B；

Step 3.2, after normalizationBone conduction voice amplitude spectrum M'_xFeeding the BLSTM-based amplitude spectrum enhancement model to obtain an estimated amplitude spectrum

Step 3.3, updating BLSTM parameter theta according to mean square error function MSE_BIs composed of

To representAnd M'_yMSE loss function error in between; wherein theta is_BParameters of a BLSTM-based amplitude spectrum enhancement model;

step 3.4, circularly iterating the steps 3.2-3.3 until the maximum iteration number N is reached_B。

Further, in step 2, training of the waveform generation model based on WaveNet specifically includes:

step 4.1, conducting the air conduction voice amplitude spectrum M_yAnd bone conduction voice group delay feature GD_xRespectively do [0,1]Normalization is carried out to obtain a normalized air conduction voice amplitude spectrum M_yAnd normalized bone conduction voice group delay characteristics GD_x；

Step 4.2, setting the joint condition characteristic H ═ concat [ M ″ ]_y；GD″_x]Combining the air conduction voice amplitude spectrum and the bone conduction voice group delay characteristics;

step 4.3, setting the learning rate to eta_WTraining iteration number is N_W；

Step 4.4, sending the joint condition characteristics H and the air conduction voice y into a waveform generation model based on WaveNet to obtain estimated air conduction voice

Step 4.5, updating the WaveNet parameter according to the cross entropyθ_WIs composed of

Representing the cross entropy loss function error between y' and y; theta_WGenerating parameters of a model for the WaveNet-based waveform;

step 4.6, circularly iterating the steps 4.4-4.5 until the result is converged or the maximum iteration number N is reached_B。

Further, the sampling rate s to be enhanced in step 3_lowSending the lower bone conduction voice amplitude spectrum into a trained BLSTM-based amplitude spectrum enhancement model to obtain an enhanced amplitude spectrum, sending the enhanced amplitude spectrum and bone conduction voice phase information into a trained waveshape generation model based on WaveNet to obtain an enhanced sampling rate s_highThe following speech waveforms are specified:

step 5.1, the waveform normalization processing is carried out on the bone conduction voice x to be enhanced, and the normalization is carried out to [ -1,1]Extracting the amplitude spectrum M of the bone-conduction speech_xAnd bone conduction voice group delay feature GD_x；

Step 5.2, conducting voice amplitude spectrum M to the bone_xObtaining a normalized bone conduction speech amplitude spectrum M 'by log extraction and MVN treatment'_xTo bone conduction voice group delay feature GD_xBy carrying out [0,1]Normalization is carried out to obtain normalized bone conduction voice group delay characteristics GD_x；

Step 5.3, normalized bone conduction voice amplitude spectrum M'_xSent into trained

Computing

Obtaining an enhanced normalized bone conduction speech magnitude spectrum

With the expression parameter theta_BThe BLSTM model of (1);

step 5.4, normalized bone conduction voice amplitude spectrum of the enhancement

MVN reverse normalization and log inversion operation are carried out to obtain an estimated bone conduction voice amplitude spectrum

Step 5.5, the estimated bone conduction voice amplitude spectrum

By carrying out [0,1]Normalized to obtain

Step 5.6, Association of features

Step 5.7, send H' into trained

Computing

Obtaining enhanced waveforms for bone conduction speech

With the expression parameter theta_WThe WaveNet model of (1).

Further, the bone conduction voice group delay feature GD in step 2.2_xThe group delay function extraction is as follows:

the group delay function γ (ω) is defined as the negative gradient of the phase spectrum over frequency:

wherein ω represents frequency and the phase spectrum θ (ω) is a continuous function of ω;

due to the correlation between the short-time fourier coefficients, the group delay function is calculated from the signal by the following formula:

in the formula, subscripts R and I denote real and imaginary parts of the short-time fourier transform, respectively, and X (ω) and Y (ω) denote fourier transforms of signals X (n) and nx (n).

Further, the output layer of the waveform generation model based on the WaveNet is changed to be based on discrete logic mixed distribution parameter prediction by a Softmax classifier:

the speech waveform sample point distribution v is characterized by a plurality of continuous logical mixture distribution functions:

wherein K represents the total number of logic distributions, alpha_iA weight representing the ith logical distribution; (mu.) a_i,s_i) A function parameter representing an ith logic distribution; and when the parameters of the logic mixed distribution are predicted, a logic mixed distribution function can be obtained, and the predicted waveform sample value is obtained by sampling the function.

Compared with the prior art, the invention has the remarkable advantages that: (1) adopting a spectrum expansion WaveNet model fused with phase information, sharing an enhanced bone conduction voice amplitude spectrum and bone conduction voice group delay characteristics under the condition of low sampling rate as the condition characteristics of WaveNet, and directly obtaining an enhanced time domain voice waveform with high sampling rate; (2) the method can effectively utilize the original bone conduction voice information, has a good spectrum expansion function, and obviously improves the quality of the bone conduction voice.

Drawings

Fig. 1 is a flow chart of the method for generating the waveform of the bone conduction speech enhancement based on WaveNet according to the present invention.

Fig. 2 is a schematic structural diagram of a waveform generation model based on WaveNet in the present invention.

Fig. 3 is a sentence acoustic feature diagram of air conduction speech in the present invention, where (a) is an amplitude spectrogram, (b) is a phase spectrogram, and (c) is a group delay feature diagram of air conduction speech.

Fig. 4 is a schematic diagram of an upsampling module across sampling rates in the present invention.

Fig. 5 is a schematic diagram of the process of expanding convolution in the present invention.

Fig. 6 is a MOS score chart under different waveform synthesis methods in the embodiment of the present invention.

FIG. 7 is a speech amplitude spectrogram obtained by different waveform synthesis methods in the embodiment of the present invention, where (a) - (i) are speech amplitude spectrograms obtained by a waveform synthesis method of bone conduction speech, air conduction speech, IFFT, Griffin-Lim, Lws, E-WN-M + GD, E-WN-M, WN-M, WN-M + GD in this order.

Fig. 8 is a voice group delay characteristic diagram under different waveform synthesis methods in the embodiment of the present invention, where (a) - (f) are voice group delay characteristic diagrams under the waveform synthesis methods of bone conduction voice, air conduction voice, IFFT, Griffin-Lim, Lws, and E-WN-M + GD in sequence.

Detailed Description

In the research fields of speech denoising, speech synthesis and the like, the waveform synthesis stage also faces the problems of phase information mismatching, no phase spectrum information, over-smooth synthesized speech parameters and the like. Researchers have generally adopted a phase spectrum estimation algorithm such as document 5(Griffin D, Lim J. Signal estimation from modulated short-time transformer [ J ]. IEEE Transactions on Acoustics, spech, and Signal Processing,1984,32(2): 236-. However, in the current waveform synthesis method based on waveform modeling, for example, based on a WaveNet model, a SimpleRNN model, etc., a speech waveform can be directly generated from given acoustic features by constructing a special deep neural network structure and directly modeling the joint probability density between speech waveform sample points. Compared with the estimation of a phase spectrum, the direct waveform generation mode avoids the problem caused by an overlap addition effect or over-smooth synthesis parameters in the STFT method, so that the speech synthesis field is developed in a breakthrough way, and the synthesized speech approaches to human pronunciation in naturalness.

In order to obtain enhanced bone conduction voice with higher quality, the invention provides a high-quality bone conduction voice Enhancement method based on WaveNet waveform modeling, which further solves the problem of tone quality loss caused by mismatching of an enhanced amplitude spectrum and a phase spectrum in a waveform synthesis stage on the basis of document 7(Changyan Zheng, Xiongwei Zhuang, Meng Sun, Jibin Yang, Yibo Xing and A Novel thread Microphone Speech Enhancement frame and Deep BLSTM Current Neural Networks and in Proc.

The invention relates to a method for generating a bone conduction voice enhancement waveform based on WaveNet, which is generally shown in figure 1, wherein 8kHz and 16kHz in the figure represent voice sampling rates, and the method comprises a training stage and a testing stage:

a training stage: firstly, extracting acoustic features of bone conduction voice and air conduction voice, and training a BLSTM-based amplitude spectrum enhancement model and a WaveNet-based waveform generation model; when a BLSTM-based amplitude spectrum enhancement model is trained, a bone conduction voice amplitude spectrum with the sampling rate of 8kHz is used as input, and a corresponding air conduction voice amplitude spectrum with the sampling rate of 8kHz is used as an output target; when the model is generated based on the WaveNet waveform, the amplitude spectrum of the air guide voice at the sampling rate of 8kHz and the time delay characteristics of the bone guide voice group at the sampling rate of 8kHz are input, and the air guide voice waveform with the sampling rate of 16kHz is output.

And (3) an enhancement stage: extracting the bone conduction voice characteristics under the sampling rate of 8kHz to be enhanced, sending the amplitude spectrum into the trained BLSTM to obtain an enhanced amplitude spectrum, and then sending the enhanced amplitude spectrum and the bone conduction voice group delay characteristics into the trained WaveNet model in a combined manner to obtain the enhanced voice with the sampling rate of 16 kHz. The communication transmission of the identification means that when the voice transmission is coded and decoded, the amplitude spectrum enhancement model can be arranged at the coding end, the coded and enhanced amplitude spectrum information is transmitted through a communication channel, the WaveNet model is configured at the decoding end, and 16kHz enhanced voice can be obtained after decoding. Therefore, in the communication transmission process, only the voice features with low sampling rate need to be transmitted, and finally the enhanced voice with high sampling rate can be obtained, namely, the voice spectrum expansion function is realized under the condition of not enhancing the communication cost.

With reference to fig. 1, the method for generating a waveform for enhancing bone conduction speech based on WaveNet according to the present invention includes the following steps:

the cross-sampling-rate upsampling module specifically comprises:

The output layer of the waveform generation model based on WaveNet is predicted based on discrete logic mixed distribution parameters instead of a Softmax classifier, and the voice waveform sampling point distribution v is characterized by a plurality of continuous logic mixed distribution functions:

wherein K represents the total number of logic distributions, alpha_iA weight representing the ith logical distribution; (mu.) a_i,s_i) And (3) representing the function parameter of the ith logic distribution, obtaining a logic mixing distribution function after predicting the parameter of the logic mixing distribution, and obtaining a predicted waveform sample value by sampling the function.

Step 2,Respectively training an amplitude spectrum enhancement model based on BLSTM and a waveform generation model based on WaveNet, wherein the input of the amplitude spectrum enhancement model based on BLSTM is a sampling rate s_lowThe output target of the lower bone conduction voice amplitude spectrum is the sampling rate s_lowLeading the voice amplitude spectrum of the voice; the waveform generation model based on WaveNet has the input of a sampling rate s_lowThe phase information of the lower bone conduction voice and the amplitude spectrum of the air conduction voice are output with the sampling rate s as the target_highA descending air conduction voice waveform; wherein s is_low＜s_high；

step 2.2, extracting the acoustic characteristics of the bone conduction voice and the air conduction voice to obtain a bone conduction voice amplitude spectrum M_xAudio amplitude spectrum M of qi-conduction speech_yBone conduction voice phase information, wherein the bone conduction voice phase information is bone conduction voice group delay characteristic GD_x(ii) a The bone conduction voice group delay feature GD_xThe group delay function extraction is as follows:

Step 2.3, conducting voice amplitude spectrum M to the bone_xAnd air conduction voice amplitude spectrum M_yRespectively carrying out acoustic characteristic log extraction and MVN (mean and Variance normalization) normalization processing to obtain normalized bone conduction voice amplitude spectrum M'_xAnd a normalized air guide language magnitude spectrum M'_y。

step 3.1, setting learning rate as eta_BTraining iteration number is N_B；

Step 3.2, normalizing the bone conduction voice amplitude spectrum M'_xFeeding the BLSTM-based amplitude spectrum enhancement model to obtain an estimated amplitude spectrum

To represent

And M'_yMSE loss function error in between; wherein theta is_BParameters of a BLSTM-based amplitude spectrum enhancement model;

step 4.3, setting the learning rate to eta_WTraining iteration number is N_W；

Step 4.5, updating the WaveNet parameter theta according to the cross entropy_WIs composed of

Step 3, the sampling rate s to be enhanced_lowSending the lower bone conduction voice amplitude spectrum into a trained BLSTM-based amplitude spectrum enhancement model to obtain an enhanced amplitude spectrum, sending the enhanced amplitude spectrum and bone conduction voice phase information into a trained waveshape generation model based on WaveNet to obtain an enhanced sampling rate s_highThe following speech waveforms are specified:

Step 5.2, conducting voice amplitude spectrum M to the bone_xObtaining a normalized bone conduction speech amplitude spectrum M 'by log extraction and MVN treatment'_xTo bone conduction voice group delay feature GD_xBy carrying out [0,1]Normalized to get normalizedBone conduction voice group delay feature GD_x；

Step 5.3, normalized bone conduction voice amplitude spectrum M'_xSent into trainedComputing

Obtaining an enhanced normalized bone conduction speech magnitude spectrum

With the expression parameter theta_BThe BLSTM model of (1);

Step 5.5, the estimated bone conduction voice amplitude spectrum

By carrying out [0,1]Normalized to obtain

Step 5.6, Association of features

Step 5.7, send H' into trained Model_MCalculating

Obtaining enhancement of bone conduction speechWave form

With the expression parameter theta_WThe WaveNet model of (1).

Examples

WaveNet based waveform generation is described in detail below:

WaveNet is a full-probabilistic autoregressive generation model, and realizes direct modeling of a speech waveform layer by constructing a special deep convolutional neural network structure, which usually needs to give additional input conditions to guide generation of speech waveforms with specific properties.

Let the phonetic waveform sequence be x ═ x₁,···,x_t-1Then its joint probability density distribution under the conditional feature λ can be expressed as the product of the following conditional probabilities:

the WaveNet adopts PixelCNN to realize the calculation of the probability distribution of the formula (1) by stacking well-designed convolutional layers, and adopts a deep residual error network and parameterized jump connection to construct a deeper network structure, realize the rapid convergence of a model and the like.

The method realizes generation of the bone conduction voice enhancement waveform based on a WaveNet waveform modeling method, and the constructed WaveNet is specifically shown in figure 2.

Firstly, group delay characteristics:

the phase spectrum of speech is often difficult to use efficiently because during the short-time fourier coefficient computation the phase spectrum is warped (warping) to within-pi values, and thus it is difficult to perceive signal features as intuitive as the amplitude spectrum. As shown in fig. 3(a) and 3(b), the amplitude spectrum and the phase spectrum of the air conduction speech are respectively, the amplitude spectrum can show a clear harmonic structure and a formant structure of a deep red part, and the phase spectrum has no obvious structural features.

Therefore, in order to effectively use the phase spectrum information of the bone conduction speech, it is necessary to process the phase spectrum. The need for meaningful utilization of the phase spectrum by existing methods involves the phase non-unique unwrapping process or design function to extract useful information from the phase signal, which i choose to extract the phase information in the manner of the design function since the non-unique unwrapping process is usually accompanied by a loss of information. The existing related functions comprise Instantaneous Frequency (IF) and group delay function extraction, and the like, but the group delay function is effectively applied to extracting various source and system parameters at present and is well utilized in speech signal processing, for example, the group delay function is used as a supplementary feature of the traditional MFCC feature to effectively improve the accuracy of speech recognition.

The group delay function (group delay function) is defined as the negative gradient of the phase spectrum over frequency:

where ω denotes frequency and the phase spectrum θ (ω) is defined as a continuous function of ω. Due to the correlation between the short-time fourier coefficients, the group delay function can also be calculated from the signal by the following formula:

Fig. 3(c) shows the group delay characteristics of air conduction speech, which can be seen to have very similar structural features to the amplitude spectrum. Since the phase spectrum is invertible by the conversion of the group delay function, the group delay profile contains the complete phase information.

An up-sampling module for crossing sampling rate:

the temporal resolution of local conditional features is much lower than that of speech waveform signalsSince the former is a frame-level feature, the feature is obtained by speech overlap plus framing, so its resolution and framing window shift time (hop time) t_hopProportional, i.e. characteristic resolution of 1/t_hopAnd the speech waveform is a sampled waveform point signal having a resolution of window shift time t_hopProduct with speech sampling rate s, i.e. 1/(t)_hopS) the resolution of the waveform point signal is s times that of the condition feature, so that the condition feature needs to be up-sampled in the condition WaveNet to adjust the resolution of the two signals to be consistent.

Since the method of the invention is based on a low sampling rate s_lowConditional feature generation for speech high sampling rate s_highThe waveform signal of (1) needs to consider not only the problem of the resolution inconsistency between the condition characteristics and the waveform points but also the problem of the sampling rate inconsistency between the condition characteristics and the waveform point characteristics in the up-sampling process, so that the frame division window time of the speech is consistent with the window shift time under the condition of setting two sampling rates, namely, the resolution of the frame level characteristics is 1/t_hop. Meanwhile, in order to keep the given condition characteristics of each waveform point signal to be differentiated, an up-sampling mode of linear difference is adopted, and the specific way is shown in fig. 4.

Expanding the rolling blocks:

due to the huge data dimension of the voice wave point, even if the context association with a short distance is realized, the corresponding network is required to have a large receptive field range. To solve this problem, WaveNet uses a "hole-in-hole" convolution, i.e., an expanded convolution, to achieve an increase in the receptive field. This extended convolution is similar to a zero-filled large scale filter, and by increasing the expansion coefficient, the exponential increase in the receptive field by the number of network layers can be achieved. As shown in FIG. 5, the receptive fields are {1,2,4, … }, respectively. The WaveNet multiplies the expansion factor layer by layer, and after reaching a certain value, forms a rolling block, namely an expansion rolling block shown in fig. 2, and then the receptive field is further increased by stacking the rolling blocks.

In order to solve the problem of deep network degradation, the network uses residual connection, the input information and the output information of the convolutional layer are added to fit the residual information which is easier to learn, and therefore a deeper network is constructed. The jump connection combines the information of each convolution layer to predict the final distribution, i.e. different information of each layer in the deep network is fused to help more accurate judgment. Both attachment means are visible in the particular expanded volume block of fig. 2.

Fourthly, gating an activation unit:

in WaveNet, a Gated Activation Unit (GAU) similar to PixelCNN is used, and as a specific expanded volume block in fig. 2, the conditional features are expressed by the Gated Activation Unit:

z＝tanh(W_f,k*x+V_f,k*λ)⊙σ(W_g,k*x+V_g,k*λ) (4)

wherein "-" denotes a convolution operation, "-" denotes a dot product operation, "(-) denotes a sigmoid function, k denotes a layer number index, f and g denote a filter function and a gate function, respectively, W denotes a learnable convolution filter, x denotes a voice wave dot, λ denotes a condition characteristic, and z is an output obtained through GAU. The nonlinear gating activation unit is similar to various gate processing in BLSTM, the performance of the nonlinear gating activation unit is far superior to that of nonlinear functions such as a modified linear activation unit, and the nonlinear gating activation unit is an important component of a speech signal which can be modeled by WaveNet.

The condition λ is input into WaveNet in both global and local features. The global condition features affect the output distribution of the whole time axis, such as the gender of a speaker in a TTS model, the features characterize the inherent characteristics of the speaker and the local condition features only affect the output distribution of the local time axis, such as the amplitude spectrum, the fundamental frequency, the text features and the like, and the whole model.

Fifthly, discrete logic mixed distribution:

the original WaveNet output layer adopts a classifier based on Softmax, and normally uses mu-law to quantize the audio signal by 8 bits, so that only 256 classification points are needed to be predicted, the feasibility of modeling is improved, and the loss of the original 16-bit audio quality is caused. If 16-bit quantization is adopted to characterize the amplitude value of each sampling point, the Softmax classification needs to predict 65536 numerical values, which is difficult to realize modeling, and Softmax adopts standard parameterized classes, and the correlation between the classes cannot be reflected, for example, the value 128 is actually close to the value 127 and the value 129, and is far away from the value 1, namely, the intrinsic correlation of data is destroyed, so that the accuracy of waveform prediction is influenced.

The output of the WaveNet model is changed from a Softmax classifier into the prediction based on discrete logic mixed distribution parameters according to PixelCNN + +, and the core principle is that the voice waveform sample point distribution v is characterized by a plurality of continuous logic mixed distribution functions: :

wherein K represents the total number of logic distributions, alpha_iWeight representing ith logical distribution (μ)_i,s_i) A function parameter representing the ith logic distribution. And when the parameters of the logic mixed distribution are predicted, a logic mixed distribution function can be obtained, and the predicted waveform sample value is obtained by sampling the function. The discrete logic mixed distribution mode can simulate continuous distribution, more accords with the actual distribution of original audio data, and has high data precision, thereby greatly improving the waveform generation precision. And under the distribution assumption, the predicted output of the neural network is a parameter alpha_i、μ_iAnd s_iExperiments prove that 10 logic mixed distributions can well express waveform point distribution, so that the neural network only needs to predict 30 values, and compared with predicting 256 categories, the memory overhead is reduced.

Sixthly, experimental data and evaluation indexes are as follows:

at present, no bone conduction speech database is publicly available at home and abroad, and for this purpose, a parallel speech database of a certain type of bone conduction microphone and a traditional air conduction microphone is manufactured, and the corpus is derived from newspapers, networks and some artificially constructed phoneme balance sentences. When recording in a sound darkroom, a speaker wears two microphones at the same time, the two microphones are recorded by Cooledit software, the sampling rate is set to be 32kHz, and the precision is set to be 16bit quantization. Each speaker records 200 sentences, and the average time length of each sentence is about 3-4 s. And randomly selecting 160 sentences of each speaker as a training set, and the remaining 40 sentences as a test set, wherein the training set and the test set do not contain repeated corpora.

In the embodiment, voice data of 3 boys and 3 girls are selected to perform speaker-dependent bone conduction voice enhancement experiments, that is, data of each speaker is trained and tested respectively.

In order to evaluate the speech quality, the present embodiment selects a log-spectral distance (LSD) and a short-time intelligibility (STOI) as objective evaluation indexes, and adopts a Mean Opinion Score (MOS) as a subjective evaluation index. The LSD index is used for measuring the short-time power spectrum difference between the enhanced speech obtained by different methods and the corresponding pure air conduction speech, and smaller values indicate that the spectrum distortion of the enhanced speech is smaller. STOI is an indicator of speech intelligibility with a score between [ 01 ], with higher values indicating higher intelligibility of the enhanced speech. Totally 10 people participate in the subjective MOS scoring of the voice quality, 20 test sentences are randomly selected from each enhancement method, 5 scoring system quantities are used by the audiences to evaluate the quality of the test signals, the scoring interval is set to be 0.5, and the final MOS result is obtained by calculating the average score of all the audiences.

Setting parameters:

(1) feature extraction

All voice data is first down-sampled to 8kHz and 16kHz for storage, respectively. During feature extraction, the frame length of a voice framing frame is set to be 32ms and the frame shift is set to be 8ms under the conditions of two sampling rates, a Hanning window is selected as a window function, short-time Fourier analysis is respectively carried out on 8kHz and 16kHz voice by adopting STFT (fast Fourier transform) of 256 points and 512 points, the amplitude spectrum and the group delay feature of the voice are extracted, and voice features of 129 dimensionality and 257 dimensionality are respectively obtained under the sampling rates of 8kHz and 16 kHz. Experiments show that the mu-law quantization processing of the voice waveform data is not different from the result obtained by directly adopting 16-bit original (raw) data, so that the original waveform data is directly adopted in the embodiment.

(2) BLSTM network setup

The amplitude spectrum estimation module of the bone conduction speech enhancement system totally comprises 3 BLSTM hidden layers, the number of hidden layer nodes is 512, a ReLU activation function is adopted, and Batch Normalization processing (Batch Normalization) is carried out on data after each BLSTM. To stabilize the training process, the actual inputs and outputs of the model are Mean and Variance Normalization (MVN) processed log-amplitude spectra. In the training process, a mini-batch is set to be 8, an Adam optimizer is adopted, the initial learning rate is set to be 0.001, and the iteration times are 20 times.

(3) WaveNet network setup

In a waveform generation module of the bone conduction voice enhancement system, the overall structure of a WaveNet model comprises 2 expansion convolution modules, each module comprises 8 layers of networks, so a causal convolution network and 2 layers of feature mapping convolution layers are added, and the whole network comprises 19 convolution layers; the initial convolution kernel size of the causal convolution layer and the extended convolution block is set to be 3 x 1, the convolution kernel number of the gate channel and the residual channel is set to be 256, mini-batch is set to be 8, and an Adam optimizer is adopted. Because the model structure of the WaveNet is deep, in order to ensure the stability of training, the learning rate adopts a warm-up (arm up) type adjustment strategy, the initial learning rate is set to be 0.0001, and the highest learning rate is set to be 0.002. In order to ensure that the amplitude spectrum of the WaveNet is close to the characteristic range of the group delay condition so as to be easy to train, the fed condition characteristics are normalized by 0, 1.

And (3) analyzing an experimental result:

waveform synthesis quality comparison:

since WaveNet of this embodiment incorporates phase information and has a spectrum spreading function, WaveNet, which can be called as blended phase information, is denoted as E-WN-M + gd (extension WaveNet condition on magnetic connected with Group Delay information). To verify the effectiveness of the method provided by the present invention, the method is compared with a waveform synthesis method (denoted as Lws) using an inverse fourier transform method (denoted as IFFT) of a bone conduction speech phase spectrum, a waveform synthesis method (denoted as Griffin-Lim) based on Griffin-Lim phase spectrum estimation, and a waveform synthesis method (denoted as Lws) based on local weight and (localwight summs) initialized fast phase spectrum estimation, wherein speech characteristics at a sampling rate of 16kHz are input in the IFFT, Griffin-Lim, and Lws methods.

The Griffin-Lim algorithm is based on the continuity of a short-time Fourier spectrum, and the core idea is that the STFT amplitude of an estimated real-valued signal is closest to a given amplitude spectrum in the least square sense, and the waveform estimation is realized through an iterative algorithm. Lws is an improved algorithm of Griffin-Lim, and the core idea is that short-time Fourier spectrum, namely numerical values of a complex number field are continuous through phase estimation, so that continuity among time frequency points can be directly considered, and the phase is initialized according to time frequency point correlation, so that waveform estimation is faster and more accurate.

Fig. 6 shows MOS scores of different waveform synthesis methods, table 1 shows objective evaluation index scores of different waveform synthesis methods, fig. 7 shows speech amplitude spectra obtained by different waveform synthesis methods, and fig. 8 shows group delay characteristics obtained by different waveform synthesis methods.

TABLE 1 Objective index Performance under different waveform Synthesis methods

As can be seen from table 1, among the four synthesis methods, the STOI and LSD scores of the IFFT are the closest to those of the method provided by the present invention, but in the MOS score of fig. 6, the method provided by the present invention is obviously superior to the other three synthesis methods, and the IFFT higher than the second of the MOS score is about 0.25, which is consistent with the most research results based on the WaveNet waveform generation method at present, because the conventional waveform synthesis method usually uses a similar short-time spectral distance as an algorithm optimization index, and the existing objective speech quality evaluation index is a gap related to a short-time spectrum as a measure, and the WaveNet itself optimization target directly targets the speech waveform, so compared with the conventional waveform synthesis method, WaveNet does not have a superior performance on the objective performance index. In general, compared with the MOS score of bone conduction voice, the method provided by the invention increases by about 1.2 scores and by about 54.5 percent, and fully illustrates the effectiveness of the method provided by the invention.

FIG. 7 shows a speech amplitude spectrogram under different waveform synthesis methods in the embodiment of the present invention, where (a) - (i) are speech amplitude spectrograms under the waveform synthesis methods of bone conduction speech, air conduction speech, IFFT, Griffin-Lim, Lws, E-WN-M + GD, E-WN-M, WN-M, WN-M + GD in sequence. Fig. 8 shows a speech group delay characteristic diagram under different waveform synthesis methods in an embodiment of the present invention, where (a) - (f) are speech group delay characteristic diagrams under the waveform synthesis methods of bone conduction speech, air conduction speech, IFFT, Griffin-Lim, Lws, E-WN-M + GD in sequence.

Comparing fig. 7(a) and 7(b) and fig. 8(a) and 8(b), it can be seen that bone conduction speech loses a large amount of high frequency components in both amplitude spectrum and group delay characteristics, and its low frequency harmonic structure is significantly greater than that of air conduction speech, which is why bone conduction speech sounds clunk and unclear.

As can be seen from fig. 7(d) and 8(d), the short-term spectrum obtained by the Griffin-Lim method is very clean, but the obtained harmonic structure and high-frequency component recovery are poor compared with other methods, because the Griffin-Lim algorithm highly depends on the given amplitude spectrum quality, and because the bone conduction speech loss information is more, the amplitude spectrum estimated based on the BLSTM model in the prior art has a certain difference from the target spectrum, so that the Griffin-Lim algorithm is difficult to obtain good speech quality.

As can be seen from fig. 7(e) and 8(e), the Lws method results in a very clear spectrum structure, which may be best represented on the LSD index in table 1 because the Lws algorithm focuses on the continuity of complex field values, which corresponds exactly to the structure of the short-term spectrum. However, as can be seen from the oval and circular boxes in fig. 7(b) and 7(e), the consonant generated by Lws has a serious over-smoothing problem, and the inferred harmonic structure does not correspond to the true harmonic structure, so that the speech has very large echo and trailing artificial mechanical sound, and therefore, the MOS score of Lws is significantly lower than that of the other three methods.

As can be seen from table 1, the IFFT method directly using bone conduction speech is superior to Grifflin-Lim and Lws, which are phase estimation-based waveform synthesis methods, in terms of STOI objective index and MOS score, because the amplitude spectrum estimated by the existing BLSTM is not ideal, the phase estimation algorithm that depends heavily on the quality of the given amplitude spectrum is not good, and the phase spectra of bone conduction speech and corresponding air conduction speech, as shown in fig. 8(a) and 8(b), have high similarity in the low frequency portion, so that better speech quality can be obtained by directly using the phase spectrum of bone conduction speech. However, as can be seen from the rectangular box in fig. 7(c), the IFFT has difficulty in recovering the high-frequency harmonic structure compared to other methods, because the phase spectrum of the original bone conduction speech is mismatched, because the amplitude spectrum obtained by the Griffin-Lim method in fig. 7(d) can be roughly regarded as the amplitude spectrum obtained by BLSTM, the harmonic structure in the high-frequency part is still clearly visible, and the harmonic structure of the amplitude spectrum is destroyed after the IFFT, so that the problem is the destruction caused by the mismatched phase spectrum.

It can be seen from the rectangular frame and the oval frame in fig. 7(f) and the oval frame in fig. 8(f) that the spectral components of the method of the present invention are closer to the corresponding air conduction speech than other waveform synthesis methods, and the input condition of WaveNet in the method of the present invention is the speech characteristic of 8kHz, and the corresponding spectral components are 0 to 4kHz, while it can be seen from fig. 7(f) and fig. 8(f) that the finally generated speech spectrum has frequency components above 4kHz, which proves that the WaveNet of the present invention has a good spectrum spreading function.

In summary, the invention provides a waveform generation method for enhancing bone conduction speech based on WaveNet, which is a spectrum extension WaveNet model fusing phase information, and the enhanced amplitude spectrum of bone conduction speech and the delay characteristics of bone conduction speech group under the condition of low sampling rate are both the condition characteristics of WaveNet, so that the enhanced time domain speech waveform with high sampling rate is directly obtained, and the model can effectively utilize the original bone conduction speech information and has good spectrum extension function. The experimental result shows that compared with the method for estimating the phase spectrum by adopting the original bone conduction voice phase and based on Griffin-Lim, the method provided by the invention obviously improves the voice quality.

Claims

1. A method for generating a bone conduction voice enhancement waveform based on WaveNet is characterized by comprising the following steps:

step 2, respectively training an amplitude spectrum enhancement model based on BLSTM and a waveform generation model based on WaveNet, wherein the input of the amplitude spectrum enhancement model based on BLSTM is a sampling rate s_lowThe output target of the lower bone conduction voice amplitude spectrum is the sampling rate s_lowLeading the voice amplitude spectrum of the voice; the waveform generation model based on WaveNet has the input of a sampling rate s_lowThe phase information of the lower bone conduction voice and the amplitude spectrum of the air conduction voice are output with the sampling rate s as the target_highA descending air conduction voice waveform; wherein s is_low＜s_high；

2. The method for generating a waveform for enhancing bone conduction speech based on WaveNet according to claim 1, wherein the step 1 includes an upsampling module for the cross-sampling rate, specifically:

3. The wavenetwork-based bone conduction speech enhancement waveform generation method according to claim 1 or 2, wherein the bone conduction speech amplitude spectrum, the air conduction speech amplitude spectrum and the bone conduction speech phase information in step 2 are obtained by:

step 2.2, extracting bone conduction wordsObtaining the bone conduction voice amplitude spectrum M by the acoustic characteristics of the voice and the air conduction voice_xAir conduction voice amplitude spectrum M_yBone conduction voice phase information, wherein the bone conduction voice phase information is bone conduction voice group delay characteristic GD_x；

4. The method for generating a waveguided speech enhancement waveform according to claim 3, wherein the training of the BLSTM-based amplitude spectrum enhancement model in step 2 is specifically as follows:

step 3.1, setting learning rate as eta_BTraining iteration number is N_B；

To represent

And M'_yMSE loss function error in between; wherein theta is_BParameters of a BLSTM-based magnitude spectrum enhancement model;

5. The method according to claim 3, wherein the training of the waveform generation model based on WaveNet in step 2 is as follows:

step 4.3, setting the learning rate to eta_WTraining iteration number is N_W；

6. The method according to claim 3, wherein the sampling rate s to be enhanced in step 3 is_lowThe lower bone conduction speech amplitude spectrum is sent into a trained BLSTM-based amplitude spectrum enhancement model to obtain an enhanced amplitude spectrum,then the enhanced amplitude spectrum and bone conduction voice phase information are sent into a trained waveform generation model based on WaveNet, and an enhanced sampling rate s is obtained_highThe following speech waveforms are specified:

step 5.1, the waveform normalization processing is carried out on the bone conduction voice x to be enhanced, and the normalization is carried out to [ -1,1]Extracting the bone conduction voice amplitude spectrum M_xAnd bone conduction voice group delay feature GD_x；

Step 5.2, conducting voice amplitude spectrum M to the bone_xLog extraction and MVN processing are carried out to obtain a normalized bone conduction voice amplitude spectrum M'_xTo bone conduction voice group delay feature GD_xBy carrying out [0,1]Normalization is carried out to obtain normalized bone conduction voice group delay characteristics GD_x；

ComputingObtaining an enhanced normalized bone conduction speech magnitude spectrum

With the expression parameter theta_BThe BLSTM model of (1);

step 5.4, normalized bone conduction voice amplitude spectrum of the enhancementMVN reverse normalization and log inversion operation are carried out to obtain an estimated bone conduction voice amplitude spectrum

Step 5.5, the estimated bone conduction voice amplitude spectrum

By carrying out [0,1]Normalized to obtain

Step 5.6, Association of features

Step 5.7, send H' into trainedComputing

Deriving enhanced waveforms for bone conduction speech

With the expression parameter theta_WThe WaveNet model of (1).

7. The method of claim 3, wherein the step 2.2 is implemented by using the bone conduction voice Group Delay (GD) feature_xThe group delay function extraction is as follows:

8. The method of claim 3, wherein the output layer of the WaveNet-based waveform generation model is modified by the Softmax classifier based on discrete logic mixture distribution parameter prediction:

wherein K represents the total number of logic distributions, alpha_iA weight representing the ith logical distribution; (mu.) a_i,s_i) A function parameter representing an ith logical distribution; and when the parameters of the logic mixed distribution are predicted, a logic mixed distribution function can be obtained, and the predicted waveform sample value is obtained by sampling the function.