CN112562704A

CN112562704A - BLSTM-based frequency division spectrum expansion anti-noise voice conversion method

Info

Publication number: CN112562704A
Application number: CN202011288173.XA
Authority: CN
Inventors: 孙蒙; 苗晓孔; 张雄伟; 曹铁勇; 郑昌艳; 李莉
Original assignee: Army Engineering University of PLA
Current assignee: Army Engineering University of PLA
Priority date: 2020-11-17
Filing date: 2020-11-17
Publication date: 2021-03-26
Anticipated expiration: 2040-11-17
Also published as: CN112562704B

Abstract

The invention discloses a BLSTM-based frequency division spectrum expansion anti-noise voice conversion method, which comprises the following specific steps: filtering source voice and target voice, and extracting voice characteristic parameters, wherein the voice characteristic parameters comprise fundamental frequency, vocal tract spectrum and aperiodic components; performing dynamic time warping alignment on the extracted sound channel spectrums of the source speech and the target speech; respectively inputting the aligned source speech and target speech sound track spectrums into a BLSTM network model of frequency division conversion for training to obtain corresponding feature conversion networks; constructing a global statistical variance consistency filtering model; after filtering the voice to be converted, extracting characteristic parameters of the voice to be converted, and preprocessing the characteristic parameters; and carrying out parameterized speech synthesis on the feature parameters of the preprocessed speech to be converted to generate the final converted speech. The invention designs a brand new fusion rule, and fuses the parts after frequency division conversion, thereby obtaining a sound channel spectrum closer to a target, and further improving the similarity of voice conversion.

Description

BLSTM-based frequency division spectrum expansion anti-noise voice conversion method

Technical Field

The invention belongs to a voice signal processing technology, and particularly relates to a frequency division spectrum expansion anti-noise voice conversion method based on BLSTM.

Background

Voice conversion, which is a speech-to-speech technique, refers to changing the speech personality of one speaker (source speaker) to have the speech personality of another speaker (target speaker). Speech conversion can be divided into two categories: one is non-specific person voice conversion, only the voice of the source speaker needs to be changed, and the method is used for ensuring that the opposite side cannot hear the scene of the own identity; the other type is specific person voice conversion, which is to convert the voice of a source speaker into the voice of a specific target person for a scene masquerading the identity of the target person. The voice conversion of a specific person meets the technical requirements of personalized voice generation, and is one of the main hotspots of the current research.

The speech conversion for a particular speaker can also be divided into: the speech conversion of parallel corpora and the conversion of non-parallel corpora, at present, systems with higher conversion quality and similarity are generally based on the parallel corpora conversion method, and the current research situation of the technology is briefly summarized as follows:

speech conversion dates back to the fifth and sixty years of the last century, and has been improved from the most classical Gaussian Mixture Model (GMM) to the models such as deep neural network (deep neural network) which can now effectively represent high-dimensional sequence data, such as: full Convolutional neural Networks (FCN), Generative countermeasure Networks (Kaneko T., Kameoka H, Hiramatsu K, Kashino K, Sequence-to-Sequence Voice Conversion with knowledge Metal raw Using genetic Adversal Networks, Interspeed 2017; Kaneko T, Kameoka H, Hojo N, Kashino K, general adaptive Network-based post-filter for static parameter synthesis, ICA2017), bidirectional short and long memory Networks (Huang Z, Xu W and Yu. K. binary LSmodules for Sequence retrieval SSP/Avage: 1508.01991,2015/1508.01991,2015, etc.). With the continuous establishment of international events, Voice Conversion Challenge (VCC), in recent years, the Voice Conversion method is continuously improved, and the quality and similarity of the converted Voice are further improved. Although these voice conversion schemes are reasonable and effective, and a better conversion effect is obtained, most voice conversion methods are performed under experimental conditions, and have serious dependence on the size and quality of training data, the more the training sample data amount is, the purer the training corpus is, the better the obtained conversion voice effect is, and for small sample data and noisy voice data, the conversion effect of the model is limited, and the quality of the conversion voice is also greatly reduced.

Disclosure of Invention

The invention aims to provide a frequency division spectrum expansion anti-noise voice conversion method based on BLSTM.

The technical solution for realizing the purpose of the invention is as follows: a frequency division spectrum expansion anti-noise voice conversion method based on BLSTM comprises the following specific steps:

step 1: filtering source voice and target voice, and extracting voice characteristic parameters, wherein the voice characteristic parameters comprise fundamental frequency, vocal tract spectrum and aperiodic components; performing dynamic time warping alignment on the extracted sound channel spectrums of the source speech and the target speech;

step 2: respectively inputting the aligned source speech and target speech sound track spectrums into a BLSTM network model of frequency division conversion for training to obtain corresponding feature conversion networks;

step 3, constructing a global statistical variance consistency filtering model;

step 4, after filtering the voice to be converted, extracting characteristic parameters of the voice to be converted, and preprocessing the characteristic parameters;

and 5: and carrying out parameterized speech synthesis on the feature parameters of the preprocessed speech to be converted to generate the final converted speech.

Preferably, the extracted vocal tract spectral feature is a mel-frequency cepstrum.

Preferably, the BLSTM network model of the input-frequency-division conversion includes two BLSTM networks with the same structure, each of the two BLSTM networks is composed of 3 hidden layers, and the number of hidden nodes in the three layers is respectively: 128, 256, 128, one of the BLSTM networks has no dropout layer, and the other BLSTM network has a dropout layer parameter of 0.5.

Preferably, the specific method for constructing the global statistical variance consistency filtering model comprises the following steps:

step 3-1, calculating the mean and variance of each one-dimensional Mel cepstrum of the cepstrum coefficient of the target statement;

step 3-2, calculating the mean and variance of each dimensionality Mel cepstrum of all frames of the statement obtained after the source speech is converted by the BLSTM network model of frequency division conversion;

step 3-3: constructing a global statistical variance consistency primary filter, wherein the global statistical variance consistency primary filter specifically comprises the following steps:

wherein ,σ² _tarRepresenting the mean of the Mel cepstrum of each dimension of the target speech

Constructed vector, σ² _conRepresenting the Mel cepstrum mean value of each dimension of the sentence obtained after the BLSTM network model conversion of the source speech through frequency division conversion

The constructed vector, y represents the Mel cepstrum of the sentence to be converted,

a vector formed by the mean value of the Mel cepstrum of each dimension of all frames of the source speech to-be-converted sentence at the testing stage;

step 3-4, setting adjustment parameters to obtain an adjusted global statistical variance consistency filter, specifically:

wherein ,

is the Mel cepstrum obtained after filtering, y represents the Mel cepstrum of the sentence to be converted, and alpha is the adjusting parameter.

Preferably, the specific formula for calculating the mean and variance of each one-dimensional mel cepstrum of the target speech cepstrum coefficients is as follows:

wherein N represents the number of target sentences in the training stage, M represents the number of frames contained in each sentence, T represents the dimension of the Mel cepstrum, i represents the index of the dimension of the Mel cepstrum,

and

respectively representing the Merr cepstrum mean and variance, x, of each dimension obtained from all frames of all training sentences_iRepresenting the ith vimel spectrum.

Preferably, the specific method for preprocessing the feature parameters of the speech to be converted is as follows:

the non-periodic components remain unchanged;

carrying out logarithmic linear transformation on the fundamental frequency;

widening and frequency dividing the sound channel spectrum to obtain a high-frequency part and a full-band part, converting the high-frequency part and the full-band part by using a BLSTM network model for frequency division conversion, carrying out frequency division on the full-band part obtained by conversion again to obtain a high-frequency sound channel spectrum and a low-frequency sound channel spectrum, fusing the obtained sound channel spectrums of different frequency bands through a fusion model, and sending the fused converted sound channel spectrum characteristics into a global statistical variance consistency filtering model obtained in the step 3 for filtering to obtain the sound channel spectrum characteristics after the conversion and the filtering.

Preferably, the fusion model is specifically:

MC_{high fusion}＝α*mcep_{Height 1}+(1-α)*mcep_{Height 2}

MC_{Low 1}＝mcep_{Low 1}

MC_{Full fusion}＝[mcep_{Low 1}+MC_{High fusion}]

Wherein a is the fusion coefficient, mcep_{Height 1}High frequency channel spectrum, mecep, being a full band channel spectrum_{Low 1}Low frequency channel spectrum, mecep, being a full band channel spectrum_{Height 2}The high frequency part of the vocal tract spectrum.

Preferably, the fusion coefficient is specifically:

in the formula, mecep_{Height of} and mcep_{Is low in}The frequency band parameter information of each part is obtained after frequency division statistics.

Compared with the prior art, the invention has the following remarkable advantages: 1) the invention introduces two newly designed filtering modules, the training data is filtered before the characteristic extraction, the sound channel spectrum is filtered after the characteristic conversion, the noise is suppressed through the time-frequency filtering, and the quality of the converted voice is improved; 2) according to the method, dimensionality expansion is performed on the vocal tract spectrum characteristics, a high-dimensional vocal tract spectrum is extracted, the network of the BLSTM is improved, frequency division conversion and fusion of the high-dimensional vocal tract spectrum are realized by designing two different BLSTM networks, the problems of over-fitting or under-fitting and the like caused by small sample data are solved, and the conversion precision of a model and the adaptability of the model are improved; 3) the invention designs a brand new fusion rule, uses the fusion rule obtained by statistics in the training stage in the conversion process, and fuses the parts after frequency division conversion, thereby obtaining a sound channel spectrum closer to a target, and further improving the similarity of voice conversion.

The present invention is described in further detail below with reference to the attached drawings.

Drawings

FIG. 1 is a flow chart of a BLSTM-based voice conversion method with spectrum extension and noise immunity.

Fig. 2 is a flow chart of spectral broadening, frequency division, and weight fusion.

Fig. 3 is a schematic diagram of a frequency division converted BLSTM network architecture.

Detailed Description

As shown in fig. 1, a frequency division spectrum expansion anti-noise speech conversion method based on BLSTM includes the following steps:

step 1, filtering the trained source speech and target speech, extracting speech characteristic parameters after removing partial noise, performing dynamic time warping alignment on the extracted sound channel spectrums of the source speech and the target speech, and respectively counting logarithmic fundamental frequencies (logF) of the source speech and the target speech₀) Mean u and variance σ of²For step 4, calculating linear conversion of logarithmic fundamental frequency;

the voice characteristic parameters comprise fundamental frequency, vocal tract spectrum (the invention mainly adopts Mel cepstrum) and non-periodic components;

step 2, sending the aligned source speech and target speech sound track spectrums into the BLSTM network model of frequency division conversion in the figure 3 for training to obtain corresponding feature conversion networks;

the frequency-division converted BLSTM network is composed of two BLSTM networks with the same structure, namely BLSTM1 and BLSTM 2. Two BLSTM networks are constituteed by 3 hidden layers, and the hidden node number of three-layer is respectively: 128, 256, 128. BLSTM1 has no dropout layer, and BLSTM2 has a dropout layer parameter of 0.5.

Step 3, counting the mean variance of the target voice sound channel spectrum characteristics and the mean variance of the converted sound channel spectrum characteristics, and constructing a global statistical variance consistency filtering model; the global statistical variance consistency filtering model is used for obtaining a sound channel spectrum of the voice to be converted, and the specific implementation process is as follows:

step 3-1, calculating the mean value and the variance of each one-dimensional Mel cepstrum of the target voice cepstrum coefficient, wherein the calculation formula is as follows:

wherein N represents the number of target sentences in the training phase, M represents the number of frames contained in each sentence, T represents the dimension of the Mel cepstrum, i represents the index of the dimension of the Mel cepstrum,

and

respectively, the mel cepstrum mean and variance of each dimension obtained from all frames of all training sentences, and tar is an abbreviation of the conversion target.

Step 3-2, calculating the mean value of each dimension Mel cepstrum of all frames of sentences obtained after the BLSTM network model conversion of source speech in the training data through frequency division conversion by using formula (1)

Sum variance

And a vector formed by all dimensionality Mel cepstrum mean values of all frames of the source speech to-be-converted sentence in the testing stage

Vector sigma formed by the sum variance_y ²Where con is an abbreviation for convert.

Step 3-3, constructing a primary filter with global statistical variance consistency to obtain primary filtered data

In the formula (2), σ² _tarMel cepstrum mean of each dimension of sentence representing target speaker in training set

Constructed vector, σ² _conRepresenting each dimension Mel cepstrum mean value of sentences obtained by BLSTM network model conversion of source speech in training set through frequency division conversion

The constructed vector, y, represents the mel cepstrum of the sentence to be converted.

Step 3-4, setting a parameter α (α is usually set according to an actual effect, and is set to 0.2 in the experiment) according to the formula (3), and adjusting to obtain a global statistical variance consistency filter:

wherein ,

is the resulting mel cepstrum after the final filtering, and this parameter will be used to generate the converted speech.

Step 4, after filtering the voice to be converted, extracting the characteristic parameters of the voice to be converted, wherein the characteristic parameters of the voice to be converted comprise logarithmic fundamental frequency, vocal tract spectrum and non-periodic components, and the characteristic parameters of the voice to be converted are preprocessed in the specific preprocessing modes that:

the non-periodic components remain unchanged;

the fundamental frequency F0 is logarithmically and linearly transformed, and its formula is shown in equation (4):

p_t ^(Y) and p_t ^(X)Respectively represent the logF after conversion₀And original logF₀,u^(X) and u^(Y)Means, σ, representing the logarithmic fundamental frequencies of the source and target speech as counted in step 1^(X) and σ^(Y)Is the standard deviation of the logarithmic fundamental frequencies of the source speech and the target speech as counted in step 1.

Widening and frequency dividing the sound channel spectrum to obtain a high-frequency part and a full-frequency band part, converting the high-frequency part and the full-frequency band part by using a BLSTM network model for frequency division conversion, frequency dividing the full-frequency band part obtained by conversion again to obtain a high-frequency sound channel spectrum and a low-frequency sound channel spectrum, fusing the obtained sound channel spectrums of different frequency bands through a fusion model, and sending the fused converted sound channel spectrum characteristics into the global statistical variance consistency filtering model obtained in the step 3 for filtering to obtain the sound channel spectrum characteristics after the conversion and the filtering;

in a further embodiment, the specific method for widening and dividing the sound channel spectrum of the speech to be converted is as follows:

in the traditional voice conversion model, in order to compress the model and reduce the data training time, the vocal tract spectrum is converted into 24-dimensional or 39-dimensional Mel cepstrum, the invention tries the voice conversion under low dimension, and the conversion effect is difficult to resist partial noise interference while ensuring the voice conversion quality. Therefore, in the process of acquiring the vocal tract spectrum parameters such as the Mel cepstrum, parameters are set to directly obtain the widened 129-dimensional high-dimensional Mel cepstrum. The high-dimensional information not only keeps more information parameters, but also is beneficial to solving the problems of data overfitting and the like during corpus training of the small samples.

The method comprises the steps of widening and frequency dividing a vocal tract spectrum of voice to be converted to obtain a high-frequency part and a full-frequency band part, converting the high-frequency part and the full-frequency band part by using a BLSTM network model for frequency dividing conversion, and frequency dividing the full-frequency band part obtained by conversion again to obtain a high-frequency vocal tract spectrum and a low-frequency vocal tract spectrum.

In a further embodiment, as shown in fig. 2, the obtained vocal tract spectrums of different frequency bands are fused by a fusion model, and the specific fusion parameters involved in the fusion process are extracted as follows:

extracting fusion coefficients

Counting the high-frequency vocal tract spectrum and the low-frequency vocal tract spectrum in the vocal tract spectrum information in the training stage, and then calculating according to a formula (5) to obtain a fusion coefficient a

The fusion coefficient obtained in a in the above formula, mcep_{Height of} and mcep_{Is low in}Namely the parameter information of each partial frequency band obtained after frequency division statistics.

Transition weight fusion

And (3) performing weight fusion according to the formula (6) to obtain a final converted sound channel spectrum:

MC_{high fusion}＝α*mcep_{Height 1}+(1-α)*mcep_{Height 2}

MC_{Low 1}＝mcep_{Low 1}

MC_{Full fusion}＝[mcep_{Low 1}+MC_{High fusion}] (6)

Fig. 3 is a specific internal structure of the frequency-division weight fusion network. The main difference between BLSTM1 and BLSTM2 is that BLSTM2 adds a drop layer, preventing the occurrence of the overfitting phenomenon due to less high-band portion information.

And 5, carrying out parameterized speech synthesis on the non-periodic components preprocessed in the step 4, the sound channel spectrum after conversion and filtering, the fundamental frequency after logarithmic linear conversion and other parameters to generate final converted speech.

Claims

1. A frequency division spectrum expansion anti-noise voice conversion method based on BLSTM is characterized by comprising the following specific steps:

step 3, constructing a global statistical variance consistency filtering model;

2. The BLSTM-based frequency division extension spectrum anti-noise speech conversion method according to claim 1, wherein the extracted vocal tract spectrum features are Mel cepstrum.

3. The BLSTM-based frequency division spectrum anti-noise voice conversion method according to claim 1, wherein the BLSTM network model of the input frequency division conversion comprises two BLSTM networks with the same structure, each of the two BLSTM networks consists of 3 hidden layers, and the number of hidden nodes of the three layers is respectively: 128, 256, 128, one of the BLSTM networks has no dropout layer, and the other BLSTM network has a dropout layer parameter of 0.5.

4. The BLSTM-based frequency division spectrum anti-noise voice conversion method according to claim 1, wherein the specific method for constructing the global statistical variance consistency filtering model is as follows:

wherein ,

5. The BLSTM-based frequency division spectrum anti-noise voice conversion method according to claim 4, wherein the specific formula for calculating the mean and variance of the Mel cepstrum of each dimension of the target voice cepstrum coefficient is as follows:

wherein, N represents the number of target sentences in the training phase, M represents the number of frames contained in each sentence, and T represents the number of framesThe dimension of the mel-frequency cepstrum, i represents the index of the mel-frequency cepstrum dimension,

and

6. The BLSTM-based frequency division spectrum anti-noise voice conversion method according to claim 1, wherein the specific method for preprocessing the characteristic parameters of the voice to be converted is as follows:

the non-periodic components remain unchanged;

carrying out logarithmic linear transformation on the fundamental frequency;

7. The BLSTM-based frequency division spectrum anti-noise voice conversion method according to claim 6, wherein the fusion model is specifically:

MC_{high fusion}＝α*mcep_{Height 1}+(1-α)*mcep_{Height 2}

MC_{Low 1}＝mcep_{Low 1}

MC_{Full fusion}＝[mcep_{Low 1}+MC_{High fusion}]

Wherein a is the fusion coefficient, mcep_{Height 1}High frequency channel spectrum, mecep, being a full band channel spectrum_{Low 1}For full frequency band vocal tract spectrumLow frequency channel spectrum of, mecep_{Height 2}The high frequency part of the vocal tract spectrum.

8. The BLSTM-based frequency division spectrum anti-noise voice conversion method according to claim 6, wherein the fusion coefficients are specifically: