CN110648684A - Bone conduction voice enhancement waveform generation method based on WaveNet - Google Patents

Bone conduction voice enhancement waveform generation method based on WaveNet Download PDF

Info

Publication number
CN110648684A
CN110648684A CN201910590941.8A CN201910590941A CN110648684A CN 110648684 A CN110648684 A CN 110648684A CN 201910590941 A CN201910590941 A CN 201910590941A CN 110648684 A CN110648684 A CN 110648684A
Authority
CN
China
Prior art keywords
bone conduction
amplitude spectrum
voice
conduction voice
wavenet
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910590941.8A
Other languages
Chinese (zh)
Other versions
CN110648684B (en
Inventor
张雄伟
郑昌艳
杨吉斌
曹铁勇
李莉
孙蒙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Army Engineering University of PLA
Original Assignee
Army Engineering University of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Army Engineering University of PLA filed Critical Army Engineering University of PLA
Priority to CN201910590941.8A priority Critical patent/CN110648684B/en
Publication of CN110648684A publication Critical patent/CN110648684A/en
Application granted granted Critical
Publication of CN110648684B publication Critical patent/CN110648684B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0324Details of processing therefor
    • G10L21/0332Details of processing therefor involving modification of waveforms
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Abstract

The invention discloses a bone conduction voice enhancement waveform generation method based on WaveNet. The method adopts a WaveNet model to generate high-quality voice on the basis of the enhancement of the bone conduction voice amplitude spectrum based on the BLSTM model. Firstly, a BLSTM model and a WaveNet model are constructed, an up-sampling module of a cross-sampling rate is introduced into the WaveNet model, and the two models are respectively trained; and then sending the amplitude spectrum of the bone conduction voice under the low sampling rate to be enhanced into the trained BLSTM to obtain an enhanced amplitude spectrum, and then sending the enhanced amplitude spectrum and the bone conduction voice phase information into the trained WaveNet model in combination to obtain an enhanced voice waveform under the high sampling rate. The invention effectively utilizes the bone conduction voice phase information and has the function of spectrum expansion, and can directly generate the enhanced high-sampling-rate voice waveform from the enhanced bone conduction voice amplitude spectrum and the bone conduction voice phase information, thereby obviously improving the quality of the bone conduction voice.

Description

Bone conduction voice enhancement waveform generation method based on WaveNet
Technical Field
The invention relates to the technical field of bone conduction, in particular to a bone conduction voice enhancement waveform generation method based on WaveNet.
Background
The bone conduction microphone obtains the bone conduction voice of the voice signal by utilizing the vibration generated by the skull, the larynx and other body parts when a person makes a sound, and because the signal transmission channel of the bone conduction microphone shields the influence of the ambient noise, compared with the voice generated by the traditional air conduction microphone, the bone conduction microphone has strong anti-noise performance, and has wide application prospect in the military and civil fields. However, due to the low-pass property of human signal transmission, the high-frequency components of bone conduction speech are seriously attenuated, the frequency components are usually below 2.5kHz, and the electric sound signals generated by vibration do not pass through sound tuning areas such as oral cavity, nasal cavity and lip, so that consonant syllables such as fricatives, plosives, unvoiced sounds and the like related to the areas are seriously lost, so that the bone conduction speech sounds dull, unclear and unnatural, and the intelligibility is low.
Bone conduction speech enhancement refers to processing bone conduction speech to improve speech quality, and can be modeled as a conversion problem from bone conduction speech to corresponding pure air conduction speech. Early algorithms generally decomposed speech into spectral envelope features and excitation features based on a source-filter model of the speech signal. Compared with excitation characteristics, human ears are more sensitive to spectral envelope characteristics, so that the research focuses on the conversion from bone conduction speech to air conduction speech low-dimensional spectral envelope, and the conversion relation between the characteristics is learned by using a Gaussian mixture model or a shallow neural network. In recent years, with the rapid development of deep learning technology, deep neural networks are beginning to be used to more accurately learn the nonlinear conversion relationship between two speech spectral envelope features, and because deep neural networks have a good capability of characterizing high-dimensional data, researchers have begun to better characterize the gap between two speeches with a high-dimensional Short-Time fourier transform (STFT) spectrum, such as document 1(Liu H P, Tsao Y, Fuh C S, "Bone-connected speech enhancement using noise encoder," speech communication, vol.104, pp.106-112,2018) using deep networks to learn the conversion between two speech high-dimensional mel spectral features. In the previous work, document 2(Changyan Zheng, xingwei Zhang, men Sun, jibenyang, Yibo Xing, "a Novel thread Microphone Speech performance Enhancement Framework based on Deep blue Speech network," in proc. ieee international conference on Computer and Communications (ICCC),2018) proposes a bone conduction Speech Enhancement method based on a Bi-directional Long Short Term Memory (BLSTM) Recurrent Neural network, and uses the context of a Speech frame to model the conversion relationship from bone conduction Speech to the corresponding high-dimensional air conduction Speech magnitude spectrum, thereby effectively improving the bone conduction Speech Enhancement quality.
After obtaining the transformed spectrum envelope or magnitude spectrum, the above method generally synthesizes the speech waveform directly by using the excitation signal or phase spectrum of the original bone conduction speech. However, in practice, there is also a difference between the excitation signal and the phase spectrum of bone-conducted speech and corresponding air-conducted speech. The waveform under the source-filter model adopts excitation-envelope synthesis, when excitation signals are not matched, the synthesized voice is discontinuous, and small distortion easily causes great reduction of auditory perception of human ears due to low characteristic dimension. There are also a few studies to explore the conversion of the excitation characteristics of bone conduction Speech and air conduction Speech, and document 3 (media a, reduce _ Mallidi S H, yegnarayana b. special-dependent mapping of source and system features for enhancement of the same micro-phone distance [ C ]// electric any numerical reference of the International specific Communication association.2010) uses the instantaneous Glottal Closure (GCI) feature instead of the excitation signal, and uses the shallow neural network to implement the conversion of the GCI feature, document 4 (tunnel e.a new statistical excitation mapping for the enhancement of the excitation of the micro-phone/coding [ C ]/3248), and uses the gradient of the Speech vector for the extraction of the gradient of the Speech vector (interval) model 3248. However, since the excitation signal is closely related to vocal cord motion and frictional noise, the characteristic regularity of the excitation signal is not obvious, so that the modeling difficulty is high, and the conversion effect is not ideal. According to the voice waveform synthesis method based on the short-time inverse Fourier transform, due to the fact that the short-time Fourier transform carries out overlapping and analysis on signals, and signal information can be partially shared among frames, specific correlation is introduced among STFT coefficients, if the voice waveform is synthesized by adopting the enhanced amplitude spectrum and the original bone conduction voice phase, original constraint of the STFT coefficients is lacked between the amplitude spectrum and the phase spectrum, and even if the amplitude spectrum is optimal due to mismatching, the synthesized voice has problems. In the field of bone conduction speech enhancement, because bone conduction speech and air conduction speech have higher similarity in a low-frequency part, the speech quality synthesized by adopting the phase of the original bone conduction speech is acceptable, but the tone quality loss caused by unmatched phase spectrums in the waveform synthesis process still obviously exists.
Disclosure of Invention
The invention aims to provide a WaveNet-based bone conduction voice enhanced waveform generation method for remarkably improving the quality of bone conduction voice, which can directly obtain an enhanced high-sampling-rate voice waveform by utilizing an enhanced bone conduction voice amplitude spectrum and bone conduction voice phase information.
The technical solution for realizing the purpose of the invention is as follows: a method for WaveNet-based generation of a bone conduction speech enhancement waveform, comprising the steps of:
step 1, constructing a BLSTM-based amplitude spectrum enhancement model and a WaveNet-based waveform generation model, and introducing an up-sampling module with a cross-sampling rate into the WaveNet-based waveform generation model;
step 2, respectively training an amplitude spectrum enhancement model based on BLSTM and a waveform generation model based on WaveNet, wherein the input of the amplitude spectrum enhancement model based on BLSTM is a sampling rate slowThe output target of the lower bone conduction voice amplitude spectrum is the sampling rate slowLeading the voice amplitude spectrum of the voice; the waveform generation model based on WaveNet has the input of a sampling rate slowThe phase information of the lower bone guide voice and the amplitude spectrum of the air guide voice are output with the sampling rate s as the targethighA descending air conduction voice waveform; wherein s islow<shigh
Step 3, the sampling rate s to be enhancedlowSending the lower bone conduction voice amplitude spectrum into a trained BLSTM-based amplitude spectrum enhancement model to obtain an enhanced amplitude spectrum, sending the enhanced amplitude spectrum and bone conduction voice phase information into a trained waveshape generation model based on WaveNet to obtain an enhanced sampling rate shighA lower speech waveform.
Further, the cross-sampling-rate upsampling module in step 1 specifically includes:
setting two sampling rates, i.e. slow、shighUnder the condition, the frame window length of the speech features is consistent with the window shift time, so that the resolution of the frame-level features is 1/thop,thopRepresenting a framing window shift time; meanwhile, an up-sampling method of linear difference is adopted.
Further, in the step 2, the bone conduction voice amplitude spectrum, the air conduction voice amplitude spectrum and the bone conduction voice phase information are obtained in the following manner:
step 2.1, respectively carrying out waveform amplitude normalization on the bone conduction voice x and the air conduction voice y to between [ -1,1] to obtain normalized bone conduction voice x 'and normalized air conduction voice y';
step 2.2, extracting the acoustic characteristics of the bone conduction voice and the air conduction voice to obtain a bone conduction voice amplitude spectrum MxAudio amplitude spectrum M of qi-conduction speechyBone conduction voice phase information, wherein the bone conduction voice phase information is bone conduction voice group delay characteristic GDx
Step 2.3, conducting voice amplitude spectrum M to the bonexAnd air conduction voice amplitude spectrum MyRespectively carrying out acoustic characteristic log extraction and MVN normalization processing to obtain normalized bone conduction voice amplitude spectrum M'xAnd a normalized air guide language magnitude spectrum M'y
Further, training the BLSTM-based magnitude spectrum enhancement model in step 2 specifically includes the following steps:
step 3.1, setting learning rate as etaBTraining iteration number is NB
Step 3.2, after normalizationBone conduction voice amplitude spectrum M'xFeeding the BLSTM-based amplitude spectrum enhancement model to obtain an estimated amplitude spectrum
Figure BDA0002116106910000031
Step 3.3, updating BLSTM parameter theta according to mean square error function MSEBIs composed of
Figure BDA0002116106910000041
Figure BDA0002116106910000042
To representAnd M'yMSE loss function error in between; wherein theta isBParameters of a BLSTM-based amplitude spectrum enhancement model;
step 3.4, circularly iterating the steps 3.2-3.3 until the maximum iteration number N is reachedB
Further, in step 2, training of the waveform generation model based on WaveNet specifically includes:
step 4.1, conducting the air conduction voice amplitude spectrum MyAnd bone conduction voice group delay feature GDxRespectively do [0,1]Normalization is carried out to obtain a normalized air conduction voice amplitude spectrum MyAnd normalized bone conduction voice group delay characteristics GDx
Step 4.2, setting the joint condition characteristic H ═ concat [ M ″ ]y;GD″x]Combining the air conduction voice amplitude spectrum and the bone conduction voice group delay characteristics;
step 4.3, setting the learning rate to etaWTraining iteration number is NW
Step 4.4, sending the joint condition characteristics H and the air conduction voice y into a waveform generation model based on WaveNet to obtain estimated air conduction voice
Figure BDA0002116106910000044
Step 4.5, updating the WaveNet parameter according to the cross entropyθWIs composed of
Figure BDA0002116106910000045
Figure BDA0002116106910000046
Representing the cross entropy loss function error between y' and y; thetaWGenerating parameters of a model for the WaveNet-based waveform;
step 4.6, circularly iterating the steps 4.4-4.5 until the result is converged or the maximum iteration number N is reachedB
Further, the sampling rate s to be enhanced in step 3lowSending the lower bone conduction voice amplitude spectrum into a trained BLSTM-based amplitude spectrum enhancement model to obtain an enhanced amplitude spectrum, sending the enhanced amplitude spectrum and bone conduction voice phase information into a trained waveshape generation model based on WaveNet to obtain an enhanced sampling rate shighThe following speech waveforms are specified:
step 5.1, the waveform normalization processing is carried out on the bone conduction voice x to be enhanced, and the normalization is carried out to [ -1,1]Extracting the amplitude spectrum M of the bone-conduction speechxAnd bone conduction voice group delay feature GDx
Step 5.2, conducting voice amplitude spectrum M to the bonexObtaining a normalized bone conduction speech amplitude spectrum M 'by log extraction and MVN treatment'xTo bone conduction voice group delay feature GDxBy carrying out [0,1]Normalization is carried out to obtain normalized bone conduction voice group delay characteristics GDx
Step 5.3, normalized bone conduction voice amplitude spectrum M'xSent into trained
Figure BDA0002116106910000051
Computing
Figure BDA0002116106910000052
Obtaining an enhanced normalized bone conduction speech magnitude spectrum
Figure BDA0002116106910000053
Figure BDA0002116106910000054
With the expression parameter thetaBThe BLSTM model of (1);
step 5.4, normalized bone conduction voice amplitude spectrum of the enhancement
Figure BDA0002116106910000055
MVN reverse normalization and log inversion operation are carried out to obtain an estimated bone conduction voice amplitude spectrum
Figure BDA0002116106910000056
Step 5.5, the estimated bone conduction voice amplitude spectrum
Figure BDA0002116106910000057
By carrying out [0,1]Normalized to obtain
Figure BDA0002116106910000058
Step 5.6, Association of features
Figure BDA0002116106910000059
Step 5.7, send H' into trained
Figure BDA00021161069100000510
Computing
Figure BDA00021161069100000511
Obtaining enhanced waveforms for bone conduction speech
Figure BDA00021161069100000512
With the expression parameter thetaWThe WaveNet model of (1).
Further, the bone conduction voice group delay feature GD in step 2.2xThe group delay function extraction is as follows:
the group delay function γ (ω) is defined as the negative gradient of the phase spectrum over frequency:
wherein ω represents frequency and the phase spectrum θ (ω) is a continuous function of ω;
due to the correlation between the short-time fourier coefficients, the group delay function is calculated from the signal by the following formula:
Figure BDA00021161069100000515
in the formula, subscripts R and I denote real and imaginary parts of the short-time fourier transform, respectively, and X (ω) and Y (ω) denote fourier transforms of signals X (n) and nx (n).
Further, the output layer of the waveform generation model based on the WaveNet is changed to be based on discrete logic mixed distribution parameter prediction by a Softmax classifier:
the speech waveform sample point distribution v is characterized by a plurality of continuous logical mixture distribution functions:
Figure BDA0002116106910000061
wherein K represents the total number of logic distributions, alphaiA weight representing the ith logical distribution; (mu.) ai,si) A function parameter representing an ith logic distribution; and when the parameters of the logic mixed distribution are predicted, a logic mixed distribution function can be obtained, and the predicted waveform sample value is obtained by sampling the function.
Compared with the prior art, the invention has the remarkable advantages that: (1) adopting a spectrum expansion WaveNet model fused with phase information, sharing an enhanced bone conduction voice amplitude spectrum and bone conduction voice group delay characteristics under the condition of low sampling rate as the condition characteristics of WaveNet, and directly obtaining an enhanced time domain voice waveform with high sampling rate; (2) the method can effectively utilize the original bone conduction voice information, has a good spectrum expansion function, and obviously improves the quality of the bone conduction voice.
Drawings
Fig. 1 is a flow chart of the method for generating the waveform of the bone conduction speech enhancement based on WaveNet according to the present invention.
Fig. 2 is a schematic structural diagram of a waveform generation model based on WaveNet in the present invention.
Fig. 3 is a sentence acoustic feature diagram of air conduction speech in the present invention, where (a) is an amplitude spectrogram, (b) is a phase spectrogram, and (c) is a group delay feature diagram of air conduction speech.
Fig. 4 is a schematic diagram of an upsampling module across sampling rates in the present invention.
Fig. 5 is a schematic diagram of the process of expanding convolution in the present invention.
Fig. 6 is a MOS score chart under different waveform synthesis methods in the embodiment of the present invention.
FIG. 7 is a speech amplitude spectrogram obtained by different waveform synthesis methods in the embodiment of the present invention, where (a) - (i) are speech amplitude spectrograms obtained by a waveform synthesis method of bone conduction speech, air conduction speech, IFFT, Griffin-Lim, Lws, E-WN-M + GD, E-WN-M, WN-M, WN-M + GD in this order.
Fig. 8 is a voice group delay characteristic diagram under different waveform synthesis methods in the embodiment of the present invention, where (a) - (f) are voice group delay characteristic diagrams under the waveform synthesis methods of bone conduction voice, air conduction voice, IFFT, Griffin-Lim, Lws, and E-WN-M + GD in sequence.
Detailed Description
In the research fields of speech denoising, speech synthesis and the like, the waveform synthesis stage also faces the problems of phase information mismatching, no phase spectrum information, over-smooth synthesized speech parameters and the like. Researchers have generally adopted a phase spectrum estimation algorithm such as document 5(Griffin D, Lim J. Signal estimation from modulated short-time transformer [ J ]. IEEE Transactions on Acoustics, spech, and Signal Processing,1984,32(2): 236-. However, in the current waveform synthesis method based on waveform modeling, for example, based on a WaveNet model, a SimpleRNN model, etc., a speech waveform can be directly generated from given acoustic features by constructing a special deep neural network structure and directly modeling the joint probability density between speech waveform sample points. Compared with the estimation of a phase spectrum, the direct waveform generation mode avoids the problem caused by an overlap addition effect or over-smooth synthesis parameters in the STFT method, so that the speech synthesis field is developed in a breakthrough way, and the synthesized speech approaches to human pronunciation in naturalness.
In order to obtain enhanced bone conduction voice with higher quality, the invention provides a high-quality bone conduction voice Enhancement method based on WaveNet waveform modeling, which further solves the problem of tone quality loss caused by mismatching of an enhanced amplitude spectrum and a phase spectrum in a waveform synthesis stage on the basis of document 7(Changyan Zheng, Xiongwei Zhuang, Meng Sun, Jibin Yang, Yibo Xing and A Novel thread Microphone Speech Enhancement frame and Deep BLSTM Current Neural Networks and in Proc.
The invention relates to a method for generating a bone conduction voice enhancement waveform based on WaveNet, which is generally shown in figure 1, wherein 8kHz and 16kHz in the figure represent voice sampling rates, and the method comprises a training stage and a testing stage:
a training stage: firstly, extracting acoustic features of bone conduction voice and air conduction voice, and training a BLSTM-based amplitude spectrum enhancement model and a WaveNet-based waveform generation model; when a BLSTM-based amplitude spectrum enhancement model is trained, a bone conduction voice amplitude spectrum with the sampling rate of 8kHz is used as input, and a corresponding air conduction voice amplitude spectrum with the sampling rate of 8kHz is used as an output target; when the model is generated based on the WaveNet waveform, the amplitude spectrum of the air guide voice at the sampling rate of 8kHz and the time delay characteristics of the bone guide voice group at the sampling rate of 8kHz are input, and the air guide voice waveform with the sampling rate of 16kHz is output.
And (3) an enhancement stage: extracting the bone conduction voice characteristics under the sampling rate of 8kHz to be enhanced, sending the amplitude spectrum into the trained BLSTM to obtain an enhanced amplitude spectrum, and then sending the enhanced amplitude spectrum and the bone conduction voice group delay characteristics into the trained WaveNet model in a combined manner to obtain the enhanced voice with the sampling rate of 16 kHz. The communication transmission of the identification means that when the voice transmission is coded and decoded, the amplitude spectrum enhancement model can be arranged at the coding end, the coded and enhanced amplitude spectrum information is transmitted through a communication channel, the WaveNet model is configured at the decoding end, and 16kHz enhanced voice can be obtained after decoding. Therefore, in the communication transmission process, only the voice features with low sampling rate need to be transmitted, and finally the enhanced voice with high sampling rate can be obtained, namely, the voice spectrum expansion function is realized under the condition of not enhancing the communication cost.
With reference to fig. 1, the method for generating a waveform for enhancing bone conduction speech based on WaveNet according to the present invention includes the following steps:
step 1, constructing a BLSTM-based amplitude spectrum enhancement model and a WaveNet-based waveform generation model, and introducing an up-sampling module with a cross-sampling rate into the WaveNet-based waveform generation model;
the cross-sampling-rate upsampling module specifically comprises:
setting two sampling rates, i.e. slow、shighUnder the condition, the frame window length of the speech features is consistent with the window shift time, so that the resolution of the frame-level features is 1/thop,thopRepresenting a framing window shift time; meanwhile, an up-sampling method of linear difference is adopted.
The output layer of the waveform generation model based on WaveNet is predicted based on discrete logic mixed distribution parameters instead of a Softmax classifier, and the voice waveform sampling point distribution v is characterized by a plurality of continuous logic mixed distribution functions:
Figure BDA0002116106910000081
wherein K represents the total number of logic distributions, alphaiA weight representing the ith logical distribution; (mu.) ai,si) And (3) representing the function parameter of the ith logic distribution, obtaining a logic mixing distribution function after predicting the parameter of the logic mixing distribution, and obtaining a predicted waveform sample value by sampling the function.
Step 2,Respectively training an amplitude spectrum enhancement model based on BLSTM and a waveform generation model based on WaveNet, wherein the input of the amplitude spectrum enhancement model based on BLSTM is a sampling rate slowThe output target of the lower bone conduction voice amplitude spectrum is the sampling rate slowLeading the voice amplitude spectrum of the voice; the waveform generation model based on WaveNet has the input of a sampling rate slowThe phase information of the lower bone conduction voice and the amplitude spectrum of the air conduction voice are output with the sampling rate s as the targethighA descending air conduction voice waveform; wherein s islow<shigh
Further, in the step 2, the bone conduction voice amplitude spectrum, the air conduction voice amplitude spectrum and the bone conduction voice phase information are obtained in the following manner:
step 2.1, respectively carrying out waveform amplitude normalization on the bone conduction voice x and the air conduction voice y to between [ -1,1] to obtain normalized bone conduction voice x 'and normalized air conduction voice y';
step 2.2, extracting the acoustic characteristics of the bone conduction voice and the air conduction voice to obtain a bone conduction voice amplitude spectrum MxAudio amplitude spectrum M of qi-conduction speechyBone conduction voice phase information, wherein the bone conduction voice phase information is bone conduction voice group delay characteristic GDx(ii) a The bone conduction voice group delay feature GDxThe group delay function extraction is as follows:
the group delay function γ (ω) is defined as the negative gradient of the phase spectrum over frequency:
Figure BDA0002116106910000082
wherein ω represents frequency and the phase spectrum θ (ω) is a continuous function of ω;
due to the correlation between the short-time fourier coefficients, the group delay function is calculated from the signal by the following formula:
in the formula, subscripts R and I denote real and imaginary parts of the short-time fourier transform, respectively, and X (ω) and Y (ω) denote fourier transforms of signals X (n) and nx (n).
Step 2.3, conducting voice amplitude spectrum M to the bonexAnd air conduction voice amplitude spectrum MyRespectively carrying out acoustic characteristic log extraction and MVN (mean and Variance normalization) normalization processing to obtain normalized bone conduction voice amplitude spectrum M'xAnd a normalized air guide language magnitude spectrum M'y
Further, training the BLSTM-based magnitude spectrum enhancement model in step 2 specifically includes the following steps:
step 3.1, setting learning rate as etaBTraining iteration number is NB
Step 3.2, normalizing the bone conduction voice amplitude spectrum M'xFeeding the BLSTM-based amplitude spectrum enhancement model to obtain an estimated amplitude spectrum
Figure BDA0002116106910000092
Step 3.3, updating BLSTM parameter theta according to mean square error function MSEBIs composed of
Figure BDA0002116106910000093
Figure BDA0002116106910000094
To represent
Figure BDA0002116106910000095
And M'yMSE loss function error in between; wherein theta isBParameters of a BLSTM-based amplitude spectrum enhancement model;
step 3.4, circularly iterating the steps 3.2-3.3 until the maximum iteration number N is reachedB
Further, in step 2, training of the waveform generation model based on WaveNet specifically includes:
step 4.1, conducting the air conduction voice amplitude spectrum MyAnd bone conduction voice group delay feature GDxRespectively do [0,1]Normalization is carried out to obtain a normalized air conduction voice amplitude spectrum MyAnd normalized bone conduction voice group delay characteristics GDx
Step 4.2, setting the joint condition characteristic H ═ concat [ M ″ ]y;GD″x]Combining the air conduction voice amplitude spectrum and the bone conduction voice group delay characteristics;
step 4.3, setting the learning rate to etaWTraining iteration number is NW
Step 4.4, sending the joint condition characteristics H and the air conduction voice y into a waveform generation model based on WaveNet to obtain estimated air conduction voice
Figure BDA0002116106910000096
Step 4.5, updating the WaveNet parameter theta according to the cross entropyWIs composed of
Figure BDA0002116106910000097
Figure BDA0002116106910000101
Representing the cross entropy loss function error between y' and y; thetaWGenerating parameters of a model for the WaveNet-based waveform;
step 4.6, circularly iterating the steps 4.4-4.5 until the result is converged or the maximum iteration number N is reachedB
Step 3, the sampling rate s to be enhancedlowSending the lower bone conduction voice amplitude spectrum into a trained BLSTM-based amplitude spectrum enhancement model to obtain an enhanced amplitude spectrum, sending the enhanced amplitude spectrum and bone conduction voice phase information into a trained waveshape generation model based on WaveNet to obtain an enhanced sampling rate shighThe following speech waveforms are specified:
step 5.1, the waveform normalization processing is carried out on the bone conduction voice x to be enhanced, and the normalization is carried out to [ -1,1]Extracting the amplitude spectrum M of the bone-conduction speechxAnd bone conduction voice group delay feature GDx
Step 5.2, conducting voice amplitude spectrum M to the bonexObtaining a normalized bone conduction speech amplitude spectrum M 'by log extraction and MVN treatment'xTo bone conduction voice group delay feature GDxBy carrying out [0,1]Normalized to get normalizedBone conduction voice group delay feature GDx
Step 5.3, normalized bone conduction voice amplitude spectrum M'xSent into trainedComputing
Figure BDA0002116106910000103
Obtaining an enhanced normalized bone conduction speech magnitude spectrum
Figure BDA0002116106910000104
Figure BDA0002116106910000105
With the expression parameter thetaBThe BLSTM model of (1);
step 5.4, normalized bone conduction voice amplitude spectrum of the enhancement
Figure BDA0002116106910000106
MVN reverse normalization and log inversion operation are carried out to obtain an estimated bone conduction voice amplitude spectrum
Figure BDA0002116106910000107
Step 5.5, the estimated bone conduction voice amplitude spectrum
Figure BDA0002116106910000108
By carrying out [0,1]Normalized to obtain
Figure BDA0002116106910000109
Step 5.6, Association of features
Step 5.7, send H' into trained ModelMCalculating
Figure BDA00021161069100001011
Obtaining enhancement of bone conduction speechWave form
Figure BDA00021161069100001012
Figure BDA00021161069100001013
With the expression parameter thetaWThe WaveNet model of (1).
Examples
WaveNet based waveform generation is described in detail below:
WaveNet is a full-probabilistic autoregressive generation model, and realizes direct modeling of a speech waveform layer by constructing a special deep convolutional neural network structure, which usually needs to give additional input conditions to guide generation of speech waveforms with specific properties.
Let the phonetic waveform sequence be x ═ x1,···,xt-1Then its joint probability density distribution under the conditional feature λ can be expressed as the product of the following conditional probabilities:
Figure BDA0002116106910000111
the WaveNet adopts PixelCNN to realize the calculation of the probability distribution of the formula (1) by stacking well-designed convolutional layers, and adopts a deep residual error network and parameterized jump connection to construct a deeper network structure, realize the rapid convergence of a model and the like.
The method realizes generation of the bone conduction voice enhancement waveform based on a WaveNet waveform modeling method, and the constructed WaveNet is specifically shown in figure 2.
Firstly, group delay characteristics:
the phase spectrum of speech is often difficult to use efficiently because during the short-time fourier coefficient computation the phase spectrum is warped (warping) to within-pi values, and thus it is difficult to perceive signal features as intuitive as the amplitude spectrum. As shown in fig. 3(a) and 3(b), the amplitude spectrum and the phase spectrum of the air conduction speech are respectively, the amplitude spectrum can show a clear harmonic structure and a formant structure of a deep red part, and the phase spectrum has no obvious structural features.
Therefore, in order to effectively use the phase spectrum information of the bone conduction speech, it is necessary to process the phase spectrum. The need for meaningful utilization of the phase spectrum by existing methods involves the phase non-unique unwrapping process or design function to extract useful information from the phase signal, which i choose to extract the phase information in the manner of the design function since the non-unique unwrapping process is usually accompanied by a loss of information. The existing related functions comprise Instantaneous Frequency (IF) and group delay function extraction, and the like, but the group delay function is effectively applied to extracting various source and system parameters at present and is well utilized in speech signal processing, for example, the group delay function is used as a supplementary feature of the traditional MFCC feature to effectively improve the accuracy of speech recognition.
The group delay function (group delay function) is defined as the negative gradient of the phase spectrum over frequency:
Figure BDA0002116106910000112
where ω denotes frequency and the phase spectrum θ (ω) is defined as a continuous function of ω. Due to the correlation between the short-time fourier coefficients, the group delay function can also be calculated from the signal by the following formula:
Figure BDA0002116106910000113
in the formula, subscripts R and I denote real and imaginary parts of the short-time fourier transform, respectively, and X (ω) and Y (ω) denote fourier transforms of signals X (n) and nx (n).
Fig. 3(c) shows the group delay characteristics of air conduction speech, which can be seen to have very similar structural features to the amplitude spectrum. Since the phase spectrum is invertible by the conversion of the group delay function, the group delay profile contains the complete phase information.
An up-sampling module for crossing sampling rate:
the temporal resolution of local conditional features is much lower than that of speech waveform signalsSince the former is a frame-level feature, the feature is obtained by speech overlap plus framing, so its resolution and framing window shift time (hop time) thopProportional, i.e. characteristic resolution of 1/thopAnd the speech waveform is a sampled waveform point signal having a resolution of window shift time thopProduct with speech sampling rate s, i.e. 1/(t)hopS) the resolution of the waveform point signal is s times that of the condition feature, so that the condition feature needs to be up-sampled in the condition WaveNet to adjust the resolution of the two signals to be consistent.
Since the method of the invention is based on a low sampling rate slowConditional feature generation for speech high sampling rate shighThe waveform signal of (1) needs to consider not only the problem of the resolution inconsistency between the condition characteristics and the waveform points but also the problem of the sampling rate inconsistency between the condition characteristics and the waveform point characteristics in the up-sampling process, so that the frame division window time of the speech is consistent with the window shift time under the condition of setting two sampling rates, namely, the resolution of the frame level characteristics is 1/thop. Meanwhile, in order to keep the given condition characteristics of each waveform point signal to be differentiated, an up-sampling mode of linear difference is adopted, and the specific way is shown in fig. 4.
Expanding the rolling blocks:
due to the huge data dimension of the voice wave point, even if the context association with a short distance is realized, the corresponding network is required to have a large receptive field range. To solve this problem, WaveNet uses a "hole-in-hole" convolution, i.e., an expanded convolution, to achieve an increase in the receptive field. This extended convolution is similar to a zero-filled large scale filter, and by increasing the expansion coefficient, the exponential increase in the receptive field by the number of network layers can be achieved. As shown in FIG. 5, the receptive fields are {1,2,4, … }, respectively. The WaveNet multiplies the expansion factor layer by layer, and after reaching a certain value, forms a rolling block, namely an expansion rolling block shown in fig. 2, and then the receptive field is further increased by stacking the rolling blocks.
In order to solve the problem of deep network degradation, the network uses residual connection, the input information and the output information of the convolutional layer are added to fit the residual information which is easier to learn, and therefore a deeper network is constructed. The jump connection combines the information of each convolution layer to predict the final distribution, i.e. different information of each layer in the deep network is fused to help more accurate judgment. Both attachment means are visible in the particular expanded volume block of fig. 2.
Fourthly, gating an activation unit:
in WaveNet, a Gated Activation Unit (GAU) similar to PixelCNN is used, and as a specific expanded volume block in fig. 2, the conditional features are expressed by the Gated Activation Unit:
z=tanh(Wf,k*x+Vf,k*λ)⊙σ(Wg,k*x+Vg,k*λ) (4)
wherein "-" denotes a convolution operation, "-" denotes a dot product operation, "(-) denotes a sigmoid function, k denotes a layer number index, f and g denote a filter function and a gate function, respectively, W denotes a learnable convolution filter, x denotes a voice wave dot, λ denotes a condition characteristic, and z is an output obtained through GAU. The nonlinear gating activation unit is similar to various gate processing in BLSTM, the performance of the nonlinear gating activation unit is far superior to that of nonlinear functions such as a modified linear activation unit, and the nonlinear gating activation unit is an important component of a speech signal which can be modeled by WaveNet.
The condition λ is input into WaveNet in both global and local features. The global condition features affect the output distribution of the whole time axis, such as the gender of a speaker in a TTS model, the features characterize the inherent characteristics of the speaker and the local condition features only affect the output distribution of the local time axis, such as the amplitude spectrum, the fundamental frequency, the text features and the like, and the whole model.
Fifthly, discrete logic mixed distribution:
the original WaveNet output layer adopts a classifier based on Softmax, and normally uses mu-law to quantize the audio signal by 8 bits, so that only 256 classification points are needed to be predicted, the feasibility of modeling is improved, and the loss of the original 16-bit audio quality is caused. If 16-bit quantization is adopted to characterize the amplitude value of each sampling point, the Softmax classification needs to predict 65536 numerical values, which is difficult to realize modeling, and Softmax adopts standard parameterized classes, and the correlation between the classes cannot be reflected, for example, the value 128 is actually close to the value 127 and the value 129, and is far away from the value 1, namely, the intrinsic correlation of data is destroyed, so that the accuracy of waveform prediction is influenced.
The output of the WaveNet model is changed from a Softmax classifier into the prediction based on discrete logic mixed distribution parameters according to PixelCNN + +, and the core principle is that the voice waveform sample point distribution v is characterized by a plurality of continuous logic mixed distribution functions: :
Figure BDA0002116106910000131
wherein K represents the total number of logic distributions, alphaiWeight representing ith logical distribution (μ)i,si) A function parameter representing the ith logic distribution. And when the parameters of the logic mixed distribution are predicted, a logic mixed distribution function can be obtained, and the predicted waveform sample value is obtained by sampling the function. The discrete logic mixed distribution mode can simulate continuous distribution, more accords with the actual distribution of original audio data, and has high data precision, thereby greatly improving the waveform generation precision. And under the distribution assumption, the predicted output of the neural network is a parameter alphai、μiAnd siExperiments prove that 10 logic mixed distributions can well express waveform point distribution, so that the neural network only needs to predict 30 values, and compared with predicting 256 categories, the memory overhead is reduced.
Sixthly, experimental data and evaluation indexes are as follows:
at present, no bone conduction speech database is publicly available at home and abroad, and for this purpose, a parallel speech database of a certain type of bone conduction microphone and a traditional air conduction microphone is manufactured, and the corpus is derived from newspapers, networks and some artificially constructed phoneme balance sentences. When recording in a sound darkroom, a speaker wears two microphones at the same time, the two microphones are recorded by Cooledit software, the sampling rate is set to be 32kHz, and the precision is set to be 16bit quantization. Each speaker records 200 sentences, and the average time length of each sentence is about 3-4 s. And randomly selecting 160 sentences of each speaker as a training set, and the remaining 40 sentences as a test set, wherein the training set and the test set do not contain repeated corpora.
In the embodiment, voice data of 3 boys and 3 girls are selected to perform speaker-dependent bone conduction voice enhancement experiments, that is, data of each speaker is trained and tested respectively.
In order to evaluate the speech quality, the present embodiment selects a log-spectral distance (LSD) and a short-time intelligibility (STOI) as objective evaluation indexes, and adopts a Mean Opinion Score (MOS) as a subjective evaluation index. The LSD index is used for measuring the short-time power spectrum difference between the enhanced speech obtained by different methods and the corresponding pure air conduction speech, and smaller values indicate that the spectrum distortion of the enhanced speech is smaller. STOI is an indicator of speech intelligibility with a score between [ 01 ], with higher values indicating higher intelligibility of the enhanced speech. Totally 10 people participate in the subjective MOS scoring of the voice quality, 20 test sentences are randomly selected from each enhancement method, 5 scoring system quantities are used by the audiences to evaluate the quality of the test signals, the scoring interval is set to be 0.5, and the final MOS result is obtained by calculating the average score of all the audiences.
Setting parameters:
(1) feature extraction
All voice data is first down-sampled to 8kHz and 16kHz for storage, respectively. During feature extraction, the frame length of a voice framing frame is set to be 32ms and the frame shift is set to be 8ms under the conditions of two sampling rates, a Hanning window is selected as a window function, short-time Fourier analysis is respectively carried out on 8kHz and 16kHz voice by adopting STFT (fast Fourier transform) of 256 points and 512 points, the amplitude spectrum and the group delay feature of the voice are extracted, and voice features of 129 dimensionality and 257 dimensionality are respectively obtained under the sampling rates of 8kHz and 16 kHz. Experiments show that the mu-law quantization processing of the voice waveform data is not different from the result obtained by directly adopting 16-bit original (raw) data, so that the original waveform data is directly adopted in the embodiment.
(2) BLSTM network setup
The amplitude spectrum estimation module of the bone conduction speech enhancement system totally comprises 3 BLSTM hidden layers, the number of hidden layer nodes is 512, a ReLU activation function is adopted, and Batch Normalization processing (Batch Normalization) is carried out on data after each BLSTM. To stabilize the training process, the actual inputs and outputs of the model are Mean and Variance Normalization (MVN) processed log-amplitude spectra. In the training process, a mini-batch is set to be 8, an Adam optimizer is adopted, the initial learning rate is set to be 0.001, and the iteration times are 20 times.
(3) WaveNet network setup
In a waveform generation module of the bone conduction voice enhancement system, the overall structure of a WaveNet model comprises 2 expansion convolution modules, each module comprises 8 layers of networks, so a causal convolution network and 2 layers of feature mapping convolution layers are added, and the whole network comprises 19 convolution layers; the initial convolution kernel size of the causal convolution layer and the extended convolution block is set to be 3 x 1, the convolution kernel number of the gate channel and the residual channel is set to be 256, mini-batch is set to be 8, and an Adam optimizer is adopted. Because the model structure of the WaveNet is deep, in order to ensure the stability of training, the learning rate adopts a warm-up (arm up) type adjustment strategy, the initial learning rate is set to be 0.0001, and the highest learning rate is set to be 0.002. In order to ensure that the amplitude spectrum of the WaveNet is close to the characteristic range of the group delay condition so as to be easy to train, the fed condition characteristics are normalized by 0, 1.
And (3) analyzing an experimental result:
waveform synthesis quality comparison:
since WaveNet of this embodiment incorporates phase information and has a spectrum spreading function, WaveNet, which can be called as blended phase information, is denoted as E-WN-M + gd (extension WaveNet condition on magnetic connected with Group Delay information). To verify the effectiveness of the method provided by the present invention, the method is compared with a waveform synthesis method (denoted as Lws) using an inverse fourier transform method (denoted as IFFT) of a bone conduction speech phase spectrum, a waveform synthesis method (denoted as Griffin-Lim) based on Griffin-Lim phase spectrum estimation, and a waveform synthesis method (denoted as Lws) based on local weight and (localwight summs) initialized fast phase spectrum estimation, wherein speech characteristics at a sampling rate of 16kHz are input in the IFFT, Griffin-Lim, and Lws methods.
The Griffin-Lim algorithm is based on the continuity of a short-time Fourier spectrum, and the core idea is that the STFT amplitude of an estimated real-valued signal is closest to a given amplitude spectrum in the least square sense, and the waveform estimation is realized through an iterative algorithm. Lws is an improved algorithm of Griffin-Lim, and the core idea is that short-time Fourier spectrum, namely numerical values of a complex number field are continuous through phase estimation, so that continuity among time frequency points can be directly considered, and the phase is initialized according to time frequency point correlation, so that waveform estimation is faster and more accurate.
Fig. 6 shows MOS scores of different waveform synthesis methods, table 1 shows objective evaluation index scores of different waveform synthesis methods, fig. 7 shows speech amplitude spectra obtained by different waveform synthesis methods, and fig. 8 shows group delay characteristics obtained by different waveform synthesis methods.
TABLE 1 Objective index Performance under different waveform Synthesis methods
Figure BDA0002116106910000161
As can be seen from table 1, among the four synthesis methods, the STOI and LSD scores of the IFFT are the closest to those of the method provided by the present invention, but in the MOS score of fig. 6, the method provided by the present invention is obviously superior to the other three synthesis methods, and the IFFT higher than the second of the MOS score is about 0.25, which is consistent with the most research results based on the WaveNet waveform generation method at present, because the conventional waveform synthesis method usually uses a similar short-time spectral distance as an algorithm optimization index, and the existing objective speech quality evaluation index is a gap related to a short-time spectrum as a measure, and the WaveNet itself optimization target directly targets the speech waveform, so compared with the conventional waveform synthesis method, WaveNet does not have a superior performance on the objective performance index. In general, compared with the MOS score of bone conduction voice, the method provided by the invention increases by about 1.2 scores and by about 54.5 percent, and fully illustrates the effectiveness of the method provided by the invention.
FIG. 7 shows a speech amplitude spectrogram under different waveform synthesis methods in the embodiment of the present invention, where (a) - (i) are speech amplitude spectrograms under the waveform synthesis methods of bone conduction speech, air conduction speech, IFFT, Griffin-Lim, Lws, E-WN-M + GD, E-WN-M, WN-M, WN-M + GD in sequence. Fig. 8 shows a speech group delay characteristic diagram under different waveform synthesis methods in an embodiment of the present invention, where (a) - (f) are speech group delay characteristic diagrams under the waveform synthesis methods of bone conduction speech, air conduction speech, IFFT, Griffin-Lim, Lws, E-WN-M + GD in sequence.
Comparing fig. 7(a) and 7(b) and fig. 8(a) and 8(b), it can be seen that bone conduction speech loses a large amount of high frequency components in both amplitude spectrum and group delay characteristics, and its low frequency harmonic structure is significantly greater than that of air conduction speech, which is why bone conduction speech sounds clunk and unclear.
As can be seen from fig. 7(d) and 8(d), the short-term spectrum obtained by the Griffin-Lim method is very clean, but the obtained harmonic structure and high-frequency component recovery are poor compared with other methods, because the Griffin-Lim algorithm highly depends on the given amplitude spectrum quality, and because the bone conduction speech loss information is more, the amplitude spectrum estimated based on the BLSTM model in the prior art has a certain difference from the target spectrum, so that the Griffin-Lim algorithm is difficult to obtain good speech quality.
As can be seen from fig. 7(e) and 8(e), the Lws method results in a very clear spectrum structure, which may be best represented on the LSD index in table 1 because the Lws algorithm focuses on the continuity of complex field values, which corresponds exactly to the structure of the short-term spectrum. However, as can be seen from the oval and circular boxes in fig. 7(b) and 7(e), the consonant generated by Lws has a serious over-smoothing problem, and the inferred harmonic structure does not correspond to the true harmonic structure, so that the speech has very large echo and trailing artificial mechanical sound, and therefore, the MOS score of Lws is significantly lower than that of the other three methods.
As can be seen from table 1, the IFFT method directly using bone conduction speech is superior to Grifflin-Lim and Lws, which are phase estimation-based waveform synthesis methods, in terms of STOI objective index and MOS score, because the amplitude spectrum estimated by the existing BLSTM is not ideal, the phase estimation algorithm that depends heavily on the quality of the given amplitude spectrum is not good, and the phase spectra of bone conduction speech and corresponding air conduction speech, as shown in fig. 8(a) and 8(b), have high similarity in the low frequency portion, so that better speech quality can be obtained by directly using the phase spectrum of bone conduction speech. However, as can be seen from the rectangular box in fig. 7(c), the IFFT has difficulty in recovering the high-frequency harmonic structure compared to other methods, because the phase spectrum of the original bone conduction speech is mismatched, because the amplitude spectrum obtained by the Griffin-Lim method in fig. 7(d) can be roughly regarded as the amplitude spectrum obtained by BLSTM, the harmonic structure in the high-frequency part is still clearly visible, and the harmonic structure of the amplitude spectrum is destroyed after the IFFT, so that the problem is the destruction caused by the mismatched phase spectrum.
It can be seen from the rectangular frame and the oval frame in fig. 7(f) and the oval frame in fig. 8(f) that the spectral components of the method of the present invention are closer to the corresponding air conduction speech than other waveform synthesis methods, and the input condition of WaveNet in the method of the present invention is the speech characteristic of 8kHz, and the corresponding spectral components are 0 to 4kHz, while it can be seen from fig. 7(f) and fig. 8(f) that the finally generated speech spectrum has frequency components above 4kHz, which proves that the WaveNet of the present invention has a good spectrum spreading function.
In summary, the invention provides a waveform generation method for enhancing bone conduction speech based on WaveNet, which is a spectrum extension WaveNet model fusing phase information, and the enhanced amplitude spectrum of bone conduction speech and the delay characteristics of bone conduction speech group under the condition of low sampling rate are both the condition characteristics of WaveNet, so that the enhanced time domain speech waveform with high sampling rate is directly obtained, and the model can effectively utilize the original bone conduction speech information and has good spectrum extension function. The experimental result shows that compared with the method for estimating the phase spectrum by adopting the original bone conduction voice phase and based on Griffin-Lim, the method provided by the invention obviously improves the voice quality.

Claims (8)

1. A method for generating a bone conduction voice enhancement waveform based on WaveNet is characterized by comprising the following steps:
step 1, constructing a BLSTM-based amplitude spectrum enhancement model and a WaveNet-based waveform generation model, and introducing an up-sampling module with a cross-sampling rate into the WaveNet-based waveform generation model;
step 2, respectively training an amplitude spectrum enhancement model based on BLSTM and a waveform generation model based on WaveNet, wherein the input of the amplitude spectrum enhancement model based on BLSTM is a sampling rate slowThe output target of the lower bone conduction voice amplitude spectrum is the sampling rate slowLeading the voice amplitude spectrum of the voice; the waveform generation model based on WaveNet has the input of a sampling rate slowThe phase information of the lower bone conduction voice and the amplitude spectrum of the air conduction voice are output with the sampling rate s as the targethighA descending air conduction voice waveform; wherein s islow<shigh
Step 3, the sampling rate s to be enhancedlowSending the lower bone conduction voice amplitude spectrum into a trained BLSTM-based amplitude spectrum enhancement model to obtain an enhanced amplitude spectrum, sending the enhanced amplitude spectrum and bone conduction voice phase information into a trained waveshape generation model based on WaveNet to obtain an enhanced sampling rate shighA lower speech waveform.
2. The method for generating a waveform for enhancing bone conduction speech based on WaveNet according to claim 1, wherein the step 1 includes an upsampling module for the cross-sampling rate, specifically:
setting two sampling rates, i.e. slow、shighUnder the condition, the frame window length of the speech features is consistent with the window shift time, so that the resolution of the frame-level features is 1/thop,thopRepresenting a framing window shift time; meanwhile, an up-sampling method of linear difference is adopted.
3. The wavenetwork-based bone conduction speech enhancement waveform generation method according to claim 1 or 2, wherein the bone conduction speech amplitude spectrum, the air conduction speech amplitude spectrum and the bone conduction speech phase information in step 2 are obtained by:
step 2.1, respectively carrying out waveform amplitude normalization on the bone conduction voice x and the air conduction voice y to between [ -1,1] to obtain normalized bone conduction voice x 'and normalized air conduction voice y';
step 2.2, extracting bone conduction wordsObtaining the bone conduction voice amplitude spectrum M by the acoustic characteristics of the voice and the air conduction voicexAir conduction voice amplitude spectrum MyBone conduction voice phase information, wherein the bone conduction voice phase information is bone conduction voice group delay characteristic GDx
Step 2.3, conducting voice amplitude spectrum M to the bonexAnd air conduction voice amplitude spectrum MyRespectively carrying out acoustic characteristic log extraction and MVN normalization processing to obtain normalized bone conduction voice amplitude spectrum M'xAnd a normalized air guide language magnitude spectrum M'y
4. The method for generating a waveguided speech enhancement waveform according to claim 3, wherein the training of the BLSTM-based amplitude spectrum enhancement model in step 2 is specifically as follows:
step 3.1, setting learning rate as etaBTraining iteration number is NB
Step 3.2, normalizing the bone conduction voice amplitude spectrum M'xFeeding the BLSTM-based amplitude spectrum enhancement model to obtain an estimated amplitude spectrum
Figure FDA0002116106900000021
Step 3.3, updating BLSTM parameter theta according to mean square error function MSEBIs composed of
Figure FDA0002116106900000022
To represent
Figure FDA0002116106900000024
And M'yMSE loss function error in between; wherein theta isBParameters of a BLSTM-based magnitude spectrum enhancement model;
step 3.4, circularly iterating the steps 3.2-3.3 until the maximum iteration number N is reachedB
5. The method according to claim 3, wherein the training of the waveform generation model based on WaveNet in step 2 is as follows:
step 4.1, conducting the air conduction voice amplitude spectrum MyAnd bone conduction voice group delay feature GDxRespectively do [0,1]Normalization is carried out to obtain a normalized air conduction voice amplitude spectrum MyAnd normalized bone conduction voice group delay characteristics GDx
Step 4.2, setting the joint condition characteristic H ═ concat [ M ″ ]y;GD″x]Combining the air conduction voice amplitude spectrum and the bone conduction voice group delay characteristics;
step 4.3, setting the learning rate to etaWTraining iteration number is NW
Step 4.4, sending the joint condition characteristics H and the air conduction voice y into a waveform generation model based on WaveNet to obtain estimated air conduction voice
Figure FDA0002116106900000025
Step 4.5, updating the WaveNet parameter theta according to the cross entropyWIs composed of
Figure FDA0002116106900000026
Representing the cross entropy loss function error between y' and y; thetaWGenerating parameters of a model for the WaveNet-based waveform;
step 4.6, circularly iterating the steps 4.4-4.5 until the result is converged or the maximum iteration number N is reachedB
6. The method according to claim 3, wherein the sampling rate s to be enhanced in step 3 islowThe lower bone conduction speech amplitude spectrum is sent into a trained BLSTM-based amplitude spectrum enhancement model to obtain an enhanced amplitude spectrum,then the enhanced amplitude spectrum and bone conduction voice phase information are sent into a trained waveform generation model based on WaveNet, and an enhanced sampling rate s is obtainedhighThe following speech waveforms are specified:
step 5.1, the waveform normalization processing is carried out on the bone conduction voice x to be enhanced, and the normalization is carried out to [ -1,1]Extracting the bone conduction voice amplitude spectrum MxAnd bone conduction voice group delay feature GDx
Step 5.2, conducting voice amplitude spectrum M to the bonexLog extraction and MVN processing are carried out to obtain a normalized bone conduction voice amplitude spectrum M'xTo bone conduction voice group delay feature GDxBy carrying out [0,1]Normalization is carried out to obtain normalized bone conduction voice group delay characteristics GDx
Step 5.3, normalized bone conduction voice amplitude spectrum M'xSent into trained
Figure FDA0002116106900000031
ComputingObtaining an enhanced normalized bone conduction speech magnitude spectrum
Figure FDA0002116106900000033
Figure FDA0002116106900000034
With the expression parameter thetaBThe BLSTM model of (1);
step 5.4, normalized bone conduction voice amplitude spectrum of the enhancementMVN reverse normalization and log inversion operation are carried out to obtain an estimated bone conduction voice amplitude spectrum
Figure FDA0002116106900000036
Step 5.5, the estimated bone conduction voice amplitude spectrum
Figure FDA0002116106900000037
By carrying out [0,1]Normalized to obtain
Figure FDA0002116106900000038
Step 5.6, Association of features
Step 5.7, send H' into trainedComputing
Figure FDA00021161069000000311
Deriving enhanced waveforms for bone conduction speech
Figure FDA00021161069000000312
Figure FDA00021161069000000313
With the expression parameter thetaWThe WaveNet model of (1).
7. The method of claim 3, wherein the step 2.2 is implemented by using the bone conduction voice Group Delay (GD) featurexThe group delay function extraction is as follows:
the group delay function γ (ω) is defined as the negative gradient of the phase spectrum over frequency:
Figure FDA00021161069000000314
wherein ω represents frequency and the phase spectrum θ (ω) is a continuous function of ω;
due to the correlation between the short-time fourier coefficients, the group delay function is calculated from the signal by the following formula:
Figure FDA0002116106900000041
in the formula, subscripts R and I denote real and imaginary parts of the short-time fourier transform, respectively, and X (ω) and Y (ω) denote fourier transforms of signals X (n) and nx (n).
8. The method of claim 3, wherein the output layer of the WaveNet-based waveform generation model is modified by the Softmax classifier based on discrete logic mixture distribution parameter prediction:
the speech waveform sample point distribution v is characterized by a plurality of continuous logical mixture distribution functions:
wherein K represents the total number of logic distributions, alphaiA weight representing the ith logical distribution; (mu.) ai,si) A function parameter representing an ith logical distribution; and when the parameters of the logic mixed distribution are predicted, a logic mixed distribution function can be obtained, and the predicted waveform sample value is obtained by sampling the function.
CN201910590941.8A 2019-07-02 2019-07-02 Bone conduction voice enhancement waveform generation method based on WaveNet Active CN110648684B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910590941.8A CN110648684B (en) 2019-07-02 2019-07-02 Bone conduction voice enhancement waveform generation method based on WaveNet

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910590941.8A CN110648684B (en) 2019-07-02 2019-07-02 Bone conduction voice enhancement waveform generation method based on WaveNet

Publications (2)

Publication Number Publication Date
CN110648684A true CN110648684A (en) 2020-01-03
CN110648684B CN110648684B (en) 2022-02-18

Family

ID=69009424

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910590941.8A Active CN110648684B (en) 2019-07-02 2019-07-02 Bone conduction voice enhancement waveform generation method based on WaveNet

Country Status (1)

Country Link
CN (1) CN110648684B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111583904A (en) * 2020-05-13 2020-08-25 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN112562710A (en) * 2020-11-27 2021-03-26 天津大学 Stepped voice enhancement method based on deep learning
CN112599145A (en) * 2020-12-07 2021-04-02 天津大学 Bone conduction voice enhancement method based on generation of countermeasure network
CN113823314A (en) * 2021-08-12 2021-12-21 荣耀终端有限公司 Voice processing method and electronic equipment
CN114548221A (en) * 2022-01-17 2022-05-27 苏州大学 Generation type data enhancement method and system for small sample unbalanced voice database

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105023580A (en) * 2015-06-25 2015-11-04 中国人民解放军理工大学 Unsupervised noise estimation and speech enhancement method based on separable deep automatic encoding technology
CN107886967A (en) * 2017-11-18 2018-04-06 中国人民解放军陆军工程大学 A kind of bone conduction sound enhancement method of depth bidirectional gate recurrent neural network
US10068557B1 (en) * 2017-08-23 2018-09-04 Google Llc Generating music with deep neural networks
US20180336880A1 (en) * 2017-05-19 2018-11-22 Baidu Usa Llc Systems and methods for multi-speaker neural text-to-speech
CN108986834A (en) * 2018-08-22 2018-12-11 中国人民解放军陆军工程大学 The blind Enhancement Method of bone conduction voice based on codec framework and recurrent neural network
CN109817239A (en) * 2018-12-24 2019-05-28 龙马智芯(珠海横琴)科技有限公司 The noise-reduction method and device of voice

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105023580A (en) * 2015-06-25 2015-11-04 中国人民解放军理工大学 Unsupervised noise estimation and speech enhancement method based on separable deep automatic encoding technology
US20180336880A1 (en) * 2017-05-19 2018-11-22 Baidu Usa Llc Systems and methods for multi-speaker neural text-to-speech
US10068557B1 (en) * 2017-08-23 2018-09-04 Google Llc Generating music with deep neural networks
CN107886967A (en) * 2017-11-18 2018-04-06 中国人民解放军陆军工程大学 A kind of bone conduction sound enhancement method of depth bidirectional gate recurrent neural network
CN108986834A (en) * 2018-08-22 2018-12-11 中国人民解放军陆军工程大学 The blind Enhancement Method of bone conduction voice based on codec framework and recurrent neural network
CN109817239A (en) * 2018-12-24 2019-05-28 龙马智芯(珠海横琴)科技有限公司 The noise-reduction method and device of voice

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DARIO RETHAGE ET AL: "A WAVENET FOR SPEECH DENOISING", 《ICASSP 2018》 *
张雄伟等: "骨导麦克风语音盲增强技术研究现状及展望", 《数据采集与处理》 *
范存航等: "一种基于卷积神经网络的端到端语音分离方法", 《信号处理》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111583904A (en) * 2020-05-13 2020-08-25 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN111583904B (en) * 2020-05-13 2021-11-19 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN112562710A (en) * 2020-11-27 2021-03-26 天津大学 Stepped voice enhancement method based on deep learning
CN112562710B (en) * 2020-11-27 2022-09-30 天津大学 Stepped voice enhancement method based on deep learning
CN112599145A (en) * 2020-12-07 2021-04-02 天津大学 Bone conduction voice enhancement method based on generation of countermeasure network
CN113823314A (en) * 2021-08-12 2021-12-21 荣耀终端有限公司 Voice processing method and electronic equipment
CN114548221A (en) * 2022-01-17 2022-05-27 苏州大学 Generation type data enhancement method and system for small sample unbalanced voice database

Also Published As

Publication number Publication date
CN110648684B (en) 2022-02-18

Similar Documents

Publication Publication Date Title
CN110648684B (en) Bone conduction voice enhancement waveform generation method based on WaveNet
CN109767778B (en) Bi-L STM and WaveNet fused voice conversion method
US7792672B2 (en) Method and system for the quick conversion of a voice signal
Wali et al. Generative adversarial networks for speech processing: A review
Tachibana et al. An investigation of noise shaping with perceptual weighting for WaveNet-based speech generation
CN1815552B (en) Frequency spectrum modelling and voice reinforcing method based on line spectrum frequency and its interorder differential parameter
Huang et al. Refined wavenet vocoder for variational autoencoder based voice conversion
JP4382808B2 (en) Method for analyzing fundamental frequency information, and voice conversion method and system implementing this analysis method
Matsubara et al. Full-band LPCNet: A real-time neural vocoder for 48 kHz audio with a CPU
Ben Othmane et al. Enhancement of esophageal speech obtained by a voice conversion technique using time dilated fourier cepstra
Katsir et al. Speech bandwidth extension based on speech phonetic content and speaker vocal tract shape estimation
Cheng et al. DNN-based speech enhancement with self-attention on feature dimension
Katsir et al. Evaluation of a speech bandwidth extension algorithm based on vocal tract shape estimation
Al-Radhi et al. Continuous wavelet vocoder-based decomposition of parametric speech waveform synthesis
Al-Radhi et al. Continuous vocoder applied in deep neural network based voice conversion
Lian et al. Whisper to normal speech based on deep neural networks with MCC and F0 features
Othmane et al. Enhancement of esophageal speech using voice conversion techniques
Ou et al. Probabilistic acoustic tube: a probabilistic generative model of speech for speech analysis/synthesis
Narendra et al. Parameterization of excitation signal for improving the quality of HMM-based speech synthesis system
Akhter et al. An analysis of performance evaluation metrics for voice conversion models
Xie et al. Pitch transformation in neural network based voice conversion
CN114913844A (en) Broadcast language identification method for pitch normalization reconstruction
Zheng et al. Bandwidth extension WaveNet for bone-conducted speech enhancement
Lv et al. Objective evaluation method of broadcasting vocal timbre based on feature selection
Wang et al. Beijing opera synthesis based on straight algorithm and deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant