CN112562704B

CN112562704B - Frequency division topological anti-noise voice conversion method based on BLSTM

Info

Publication number: CN112562704B
Application number: CN202011288173.XA
Authority: CN
Inventors: 孙蒙; 苗晓孔; 张雄伟; 曹铁勇; 郑昌艳; 李莉
Original assignee: Army Engineering University of PLA
Current assignee: Army Engineering University of PLA
Priority date: 2020-11-17
Filing date: 2020-11-17
Publication date: 2023-08-18
Anticipated expiration: 2040-11-17
Also published as: CN112562704A

Abstract

The invention discloses a frequency division topological anti-noise voice conversion method based on BLSTM, which comprises the following specific steps: filtering the source voice and the target voice, and extracting voice characteristic parameters, wherein the voice characteristic parameters comprise fundamental frequency, sound channel spectrum and non-periodic components; carrying out dynamic time warping alignment on the extracted sound channel spectrums of the source voice and the target voice; respectively inputting the aligned source voice channel spectrum and the aligned target voice channel spectrum into a frequency division converted BLSTM network model for training to obtain a corresponding characteristic conversion network; constructing a global statistical variance consistency filtering model; after filtering the voice to be converted, extracting characteristic parameters of the voice to be converted, and preprocessing; and carrying out parameterized voice synthesis on the characteristic parameters of the preprocessed voice to be converted to generate final converted voice. The invention designs a brand new fusion rule, fuses the frequency division converted parts, and further obtains the sound channel spectrum which is closer to the target, thereby improving the similarity of voice conversion.

Description

Frequency division topological anti-noise voice conversion method based on BLSTM

Technical Field

The invention belongs to a voice signal processing technology, and particularly relates to a frequency division topological anti-noise voice conversion method based on BLSTM.

Background

Speech conversion refers to changing the speech personality of one speaker (source speaker) to have the speech personality of another speaker (target speaker), which is a speech-to-speech technique. Speech conversion can be divided into two categories: the voice conversion of the non-specific person is realized only by changing the voice of the source speaker, so that the voice of the source speaker is used for enabling the opposite party to hear the scene without self-identity; another type is person-specific voice conversion, which is a scenario in which the voice of a source speaker is converted into the voice of a specific target person, for masquerading the identity of the target person. The conversion of specific human voice meets the technical requirements of personalized voice generation and is one of the main hot spots of current research.

Speaker-specific speech conversion can also be divided into: at present, a system with higher conversion quality and similarity is generally a conversion method based on parallel corpus, and the current research status of the technology is briefly summarized as follows:

the earliest speech conversion dates back to the fifth sixty of the last century, from the most classical gaussian mixture model (Gaussian Mixture Model, GMM) to the now efficient representation of deep neural networks etc. models that represent high dimensional sequence data, such as: a full convolutional neural network (Fully Convolutional Network, FCN), a generative antagonism network (Kaneko t., kameoka H, hiramatsu K, kashino K, sequence-to-Sequence Voice Conversion with Similarity Metric Learned Using Generative Adversarial Networks, interspeech 2017;Kaneko T,Kameoka H,Hojo N,Kashino K,Generative adversarial network-based post-filter for statistical parameter synthesis, ica ssp 2017), a bidirectional long short time memory network (Huang Z, xu W and yu.k. bidirect LSTM-CRF models for Sequence table, available, https:// arxiv.org/abs/1508.01991,2015), and the like. With the continuous holding of international events, namely voice conversion challenge (Voice Conversion Challenge, VCC), the voice conversion method is continuously improved in recent years, and the quality and similarity of the converted voice are further improved. Although these speech conversion schemes are reasonable and effective, a better conversion effect is obtained, most of the speech conversion methods are performed under experimental conditions, so that the larger the training sample data size is, the purer the training corpus is, the better the obtained converted speech effect is, and the conversion effect of the model is limited for small sample data and noisy speech data, and the quality of the converted speech is also greatly reduced.

Disclosure of Invention

The invention aims to provide a frequency division topological anti-noise voice conversion method based on BLSTM.

The technical solution for realizing the purpose of the invention is as follows: a frequency division topological anti-noise voice conversion method based on BLSTM comprises the following specific steps:

step 1: filtering the source voice and the target voice, and extracting voice characteristic parameters, wherein the voice characteristic parameters comprise fundamental frequency, sound channel spectrum and non-periodic components; carrying out dynamic time warping alignment on the extracted sound channel spectrums of the source voice and the target voice;

step 2: respectively inputting the aligned source voice channel spectrum and the aligned target voice channel spectrum into a frequency division converted BLSTM network model for training to obtain a corresponding characteristic conversion network;

step 3, constructing a global statistical variance consistency filtering model;

step 4, after filtering the voice to be converted, extracting characteristic parameters of the voice to be converted, and preprocessing;

step 5: and carrying out parameterized voice synthesis on the characteristic parameters of the preprocessed voice to be converted to generate final converted voice.

Preferably, the extracted vocal tract spectrum features are mel-frequency cepstrum.

Preferably, the frequency-division-input-conversion BLSTM network model includes two BLSTM networks with identical structures, each of the two BLSTM networks is composed of 3 hidden layers, and the hidden node numbers of the three layers are respectively: 128 256, 128, wherein one BLSTM network has no dropout layer and the dropout layer parameter of the other BLSTM network is 0.5.

Preferably, the specific method for constructing the global statistical variance consistency filter model is as follows:

step 3-1, calculating the mean value and variance of each dimension of mel cepstrum of the cepstrum coefficient of the target sentence;

step 3-2, calculating the mean value and variance of the mel cepstrum of each dimension of all frames of the sentence obtained by converting the BLSTM network model of the source voice through frequency division conversion;

step 3-3: constructing a global statistical variance consistency primary filter, wherein the global statistical variance consistency primary filter specifically comprises the following components:

wherein ,σ² _tar Meier cepstrum mean value representing each dimension of target voiceVectors, sigma of the constitution ² _con Represents the mean value +.about.L of the Meier cepstrum of each dimension of the sentence obtained after the conversion of the BLSTM network model of the source voice through frequency division conversion>The vector formed, y represents the mel-cepstrum of the sentence to be converted, < >>The method comprises the steps of (1) setting a vector formed by the mean value of the mel cepstrum of each dimension of all frames of a source voice to be converted sentence in a test stage;

step 3-4, setting adjustment parameters to obtain an adjusted global statistical variance consistency filter, which specifically comprises the following steps:

wherein ,the method is characterized in that the Mel cepstrum obtained after filtering is performed, y represents the Mel cepstrum of the sentence to be converted, and alpha is an adjustment parameter.

Preferably, the specific formula for calculating the mean and variance of each one-dimensional mel-frequency cepstrum of the target voice cepstrum coefficient is as follows:

wherein N represents the number of target sentences in the training stage, M represents the number of frames contained in each sentence, T represents the dimension of the mel-frequency cepstrum, i represents the index of the mel-frequency cepstrum dimension, and />Represents the mean and variance of the mel-cepstrum, x, of each dimension, obtained from all frames of all training sentences, respectively _i Representing the i-th wimeyer spectrum.

Preferably, the specific method for preprocessing the characteristic parameters of the voice to be converted comprises the following steps:

the non-periodic components remain unchanged;

performing log-linear transformation on the fundamental frequency;

stretching and dividing the channel spectrum to obtain a high-frequency part and a full-frequency band part, converting the high-frequency part and the full-frequency band part by using a frequency division converted BLSTM network model, dividing the converted full-frequency band part again to obtain a high-frequency channel spectrum and a low-frequency channel spectrum, fusing the channel spectrums of different frequency bands by using a fusion model, and sending the fused converted channel spectrum characteristics to a global statistical variance consistency filtering model obtained in the step 3 for filtering to obtain the channel spectrum characteristics after conversion and filtering.

Preferably, the fusion model is specifically:

MC _{high fusion} ＝α*mcep _{High 1} +(1-α)*mcep _{High 2}

MC _{Low 1} ＝mcep _{Low 1}

MC _{Full fusion} ＝[mcep _{Low 1} +MC _{High fusion} ]

Wherein a is a fusion coefficient, mcep _{High 1} High frequency channel spectrum, mcep, which is a full band channel spectrum _{Low 1} Low frequency channel spectrum, mcep, which is a full band channel spectrum _{High 2} Is the high frequency part of the channel spectrum.

Preferably, the fusion coefficient is specifically:

in the formula, mcep _{High height} and mcep_{Low and low} And the frequency division statistics are used for obtaining the parameter information of each part of frequency bands.

Compared with the prior art, the invention has the remarkable advantages that: 1) According to the invention, two filtering modules with brand new designs are introduced, training data is filtered before feature extraction, global statistical variance consistency filtering is carried out on sound channel spectrums after feature conversion, noise is jointly restrained through time-frequency filtering, and the quality of converted voice is improved; 2) According to the method, the dimensions of the channel spectrum characteristics are expanded, the high-dimensional channel spectrum is extracted, the BLSTM network is improved, the frequency division conversion and fusion of the high-dimensional channel spectrum are realized by designing two different BLSTM networks, the problems of over fitting or under fitting and the like caused by small sample data are solved, and the conversion precision and the adaptability of the model are improved; 3) The invention designs a brand new fusion rule, uses the fusion rule obtained by statistics of the training stage in the conversion process, fuses the frequency division converted part, and further obtains the vocal tract spectrum which is closer to the target, thereby improving the similarity of voice conversion.

The present invention will be described in further detail with reference to the accompanying drawings.

Drawings

FIG. 1 is a flow chart of a frequency-divided topology anti-noise speech conversion method based on BLSTM.

Fig. 2 is a flow chart of spectral broadening, frequency division, and weight fusion.

Fig. 3 is a schematic diagram of a frequency-divided converted BLSTM network.

Detailed Description

As shown in FIG. 1, the anti-noise voice conversion method for the frequency division topological spectrum based on the BLSTM comprises the following specific steps:

step 1, filtering the trained source voice and target voice, removing part of noise, extracting voice characteristic parameters, and entering the channel spectrums of the extracted source voice and target voiceDynamic time warping alignment and statistics of log fundamental frequencies (i.e., log f) of source and target voices, respectively ₀ ) Mean u and variance sigma of ² The method comprises the steps of (1) calculating linear conversion of logarithmic fundamental frequency in the step (4);

the voice characteristic parameters comprise fundamental frequency, sound channel spectrum (the invention mainly adopts mel cepstrum) and non-periodic components;

step 2, sending the aligned source voice and target voice sound channel spectrums into the frequency division converted BLSTM network model training of the figure 3 to obtain a corresponding characteristic conversion network;

the frequency-divided converted BLSTM network is composed of two BLSTM networks of identical structure, namely BLSTM1 and BLSTM2. Both BLSTM networks consist of 3 hidden layers, the hidden node numbers of the three layers being: 128, 256, 128.BLSTM1 has no dropout layer and BLSTM2 has a dropout layer parameter of 0.5.

Step 3, counting the mean variance of the target voice channel spectrum characteristics and the mean variance of the converted channel spectrum characteristics, and constructing a global statistical variance consistency filtering model; the global statistical variance consistency filtering model is used for obtaining the sound channel spectrum of the voice to be converted, and the specific implementation process is as follows:

step 3-1, calculating the mean value and variance of each dimension of mel cepstrum of the target voice cepstrum coefficient, wherein the calculation formula is as follows:

wherein N represents the number of target sentences in the training stage, M represents the number of frames contained in each sentence, T represents the dimension of the mel-frequency cepstrum, i represents the index of the mel-frequency cepstrum dimension, and />The mel-cepstrum mean and variance of each dimension obtained from all frames of all training sentences are respectively represented, and tar is an abbreviation of the conversion target.

Step 3-2, calculating the Meier cepstrum average value of each dimension of all frames of the sentence obtained by converting the BLSTM network model of the source voice in the training data through frequency division conversion by using the formula (1)Sum of variances->And vector formed by the mean value of the Meier cepstrum of each dimension of all frames of the source voice to be converted in the test stage>Vector sigma formed by sum of variances _y ² Where con is an abbreviation for conversion of overt.

Step 3-3, constructing a primary filter with global statistical variance consistency to obtain primary filtered data

In the formula (2), sigma ² _tar Meier cepstrum mean of each dimension of sentences representing target speakers in training setVectors, sigma of the constitution ² _con Represents the mean value +.about.L of the Meier cepstrum of each dimension of the sentence obtained by the conversion of the BLSTM network model of the frequency division conversion of the source voice in the training set>The vector, y, represents the mel-cepstrum of the sentence to be converted.

Step 3-4, setting a parameter alpha (alpha is usually set according to the actual effect, and is set to be 0.2 in the experiment) according to a formula (3), and adjusting to obtain a global statistical variance consistency filter:

wherein ,is the last mel-cepstrum obtained after filtering, which will be used to generate the converted speech.

Step 4, after filtering the voice to be converted, extracting characteristic parameters of the voice to be converted, wherein the characteristic parameters of the voice to be converted comprise logarithmic fundamental frequency, sound channel spectrum and non-periodic components, and the characteristic parameters of the voice to be converted are preprocessed in the following specific preprocessing modes:

the non-periodic components remain unchanged;

the fundamental frequency F0 is subjected to logarithmic linear transformation, and the formula is shown in formula (4):

p _t ^(Y) and p_t ^(X) Respectively represent the converted logF ₀ And the original logF ₀ ,u ^(X) and u^(Y) Representing the mean value, sigma, of the logarithmic fundamental frequencies of the source and target voices counted in step 1 ^(X) and σ^(Y) Is the standard deviation of the logarithmic fundamental frequencies of the source and target voices counted in step 1.

Stretching and dividing the channel spectrum to obtain a high-frequency part and a full-frequency band part, converting the high-frequency part and the full-frequency band part by using a frequency division converted BLSTM network model, dividing the converted full-frequency band part again to obtain a high-frequency channel spectrum and a low-frequency channel spectrum, fusing the channel spectrums of different frequency bands by using a fusion model, and sending the fused converted channel spectrum characteristics to a global statistical variance consistency filtering model obtained in the step 3 for filtering to obtain converted and filtered channel spectrum characteristics;

in a further embodiment, the specific method for widening and dividing the channel spectrum of the voice to be converted includes:

in the traditional voice conversion model, in order to compress the model and reduce the data training time, the vocal tract spectrum is converted into 24-dimensional or 39-dimensional mel-frequency cepstrum, the voice conversion under the low dimensionality is tried, and the conversion effect is difficult to resist partial noise interference while ensuring the voice conversion quality. Therefore, in the process of obtaining parameters of sound channels such as the mel-frequency cepstrum, the parameters are set to directly obtain the 129-dimensional high-dimensional mel-frequency cepstrum after widening. The high-dimensional information not only reserves more information parameters, but also is beneficial to solving the problems of data overfitting and the like in the training of the corpus of small samples.

The method comprises the steps of stretching and dividing a sound channel spectrum of voice to be converted to obtain a high-frequency part and a full-frequency band part, converting the high-frequency part and the full-frequency band part by using a BLSTM network model of frequency division conversion, dividing the converted full-frequency band part again to obtain a high-frequency sound channel spectrum and a low-frequency sound channel spectrum, and dividing the full-frequency sound channel spectrum into the high-frequency sound channel spectrum and the low-frequency sound channel spectrum by setting half-folding and dividing of a middle dimension.

In a further embodiment, as shown in fig. 2, the channel spectrums of the different frequency bands are fused by a fusion model, and specific fusion parameters involved in the fusion process are extracted as follows:

extracting fusion coefficients

Counting high-frequency channel spectrum and low-frequency channel spectrum parts in the training stage channel spectrum information, and then calculating according to a formula (5) to obtain a fusion coefficient a

Fusion coefficient, mcep, obtained in formula a above _{High height} and mcep_{Low and low} The frequency division statistics is carried out to obtain the parameter information of each part of frequency bands.

Conversion weight fusion

And (3) carrying out weight fusion according to a formula (6) to obtain a final converted channel spectrum:

MC _{high fusion} ＝α*mcep _{High 1} +(1-α)*mcep _{High 2}

MC _{Low 1} ＝mcep _{Low 1}

MC _{Full fusion} ＝[mcep _{Low 1} +MC _{High fusion} ] (6)

Fig. 3 is a specific internal configuration of the frequency division weight fusion network. The BLSTM1 and the BLSTM2 are mainly different in that the BLSTM2 is added with a drop layer, so that the phenomenon of overfitting caused by partial information of a high frequency band is prevented.

And 5, carrying out parameterized voice synthesis on the non-periodic components preprocessed in the step 4, the channel spectrum subjected to conversion and filtering, the fundamental frequency subjected to log linear conversion and other parameters, and generating final converted voice.

Claims

1. A frequency division topological anti-noise voice conversion method based on BLSTM is characterized by comprising the following specific steps:

step 3, constructing a global statistical variance consistency filtering model, wherein the specific method comprises the following steps:

wherein ,the method is characterized in that the Mel cepstrum obtained after filtering is performed, y represents the Mel cepstrum of the sentence to be converted, and alpha is an adjustment parameter;

2. The method for anti-noise conversion of a frequency-divided-topology based on BLSTM of claim 1, wherein the extracted vocal tract spectrum feature is mel-frequency cepstrum.

3. The frequency-divided topological anti-noise voice conversion method based on BLSTM according to claim 1, wherein the frequency-divided-conversion-entering BLSTM network model comprises two BLSTM networks with the same structure, each of the two BLSTM networks is composed of 3 hidden layers, and the hidden node numbers of the three layers are respectively: 128 256, 128, wherein one BLSTM network has no dropout layer and the dropout layer parameter of the other BLSTM network is 0.5.

4. The anti-noise voice conversion method of frequency-divided topological spectrum based on BLSTM according to claim 1, wherein the specific formula for calculating the mean and variance of each dimension of mel cepstrum of the target voice cepstrum coefficient is:

5. The anti-noise voice conversion method based on frequency division topology of BLSTM according to claim 1, wherein the specific method for preprocessing the characteristic parameters of the voice to be converted is as follows:

the non-periodic components remain unchanged;

performing log-linear transformation on the fundamental frequency;

6. The method for anti-noise voice conversion of frequency-divided topology based on BLSTM according to claim 5, wherein the fusion model is specifically:

MC _{high fusion} ＝α*mcep _{High 1} +(1-α)*mcep _{High 2}

MC _{Low 1} ＝mcep _{Low 1}

MC _{Full fusion} ＝[mcep _{Low 1} +MC _{High fusion} ]

7. The method for anti-noise voice conversion of frequency-divided topology based on BLSTM of claim 6, wherein the fusion coefficient is specifically: