CN114023343A

CN114023343A - Voice conversion method based on semi-supervised feature learning

Info

Publication number: CN114023343A
Application number: CN202111277502.5A
Authority: CN
Inventors: 李学龙; 张强; 陈穆林
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2021-10-30
Filing date: 2021-10-30
Publication date: 2022-02-08
Anticipated expiration: 2041-10-30
Also published as: CN114023343B

Abstract

The invention provides a voice conversion method based on semi-supervised feature learning. Firstly, preprocessing voice data in a training set by utilizing an open source voice packet librosa to obtain an extended acoustic feature segment set, and extracting acoustic features representing identity information of a speaker in advance by adopting a generalized end-to-end loss encoder; then, a voice conversion network is constructed, wherein the voice conversion network comprises a variational self-encoder, a decoder and a post-network, and the network is trained by utilizing the constructed data set, wherein the network loss is set by adopting mean square error constraint under supervision information; and finally, processing the source speech data and the target speech data to be converted by using the trained network to obtain converted speech data. The invention introduces semi-supervised feature learning based on the structure of the variational self-encoder, can accurately extract the identity information of speakers, solves the problem of voice conversion among multiple speakers under non-parallel corpus data, and has good generalization capability.

Description

Voice conversion method based on semi-supervised feature learning

Technical Field

The invention belongs to the field of deep learning, and particularly relates to a voice conversion method based on semi-supervised feature learning.

Background

The voice conversion method aims to realize the conversion of the tone of a source speaker into the tone of a target speaker, and simultaneously keep the language content of the source speaker unchanged. The most common practice of speech conversion systems is: only one piece of target speaker voice data and one piece of source speaker voice data are provided, the system automatically extracts the language content from the source speaker data, extracts speaker embedding (i.e., speaker timbre information) from the target speaker data, and recombines the two information to generate a new target voice. In speech conversion, if data is from different speakers and the language content is different, such data belongs to non-parallel corpus data. Due to the characteristics of low cost, easy acquisition, being close to a real application scene and the like, the non-parallel corpus data is widely applied to voice conversion. At present, in the non-parallel corpus data, because the language contents of the converted target speaker and the source speaker are different, the voice color of the source speaker needs to be converted into the voice color of the target speaker while the language contents are kept unchanged. Therefore, two challenges are faced when using non-parallel corpus data for conversion: firstly, if a model is established on a non-parallel corpus data set, the mapping from a source speaker to a target speaker is difficult to accurately learn, so that the problem of poor voice conversion effect is caused; second, for voice conversion between multiple speakers, if the speaker voice data under test never appears in the training data set, the converted voice effect is low in both naturalness and similarity. The above two points become the problem to be solved urgently for the non-parallel voice conversion.

To accurately learn the mapping relationship between the source speaker and the target speaker, Takuhiro et al first introduced the CycleGAN (Cyclic Generation countermeasure network) method into the speech Conversion task in the documents "T.Kaneko and H.Kameoka, CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-dependent adaptive Networks, European Signal Processing Conference (EUSIPCO), pp.2100-2104,2018". The cycle generation countermeasure network solves the problem that the source domain and the target domain are difficult to map. The method uses the countermeasure loss and the cycle consistency loss as the criteria to realize the tone color conversion from the source speaker to the target speaker.

To achieve speech Conversion between arbitrary speakers, Chou et al propose a method for speech Conversion using a variable-rate encoder in the documents "Ju-chip Chou, Cheng-chip Yeh, and Hung-yi Lee, One-Shot Voice Conversion by separation Speaker and Content reproduction with distance Normalization, in Proc. The method takes mean square error as a criterion, utilizes a variational self-encoder structure, utilizes two encoders at an input end to respectively extract language content and speaker embedding in voice through unsupervised learning, and uses a decoder at an output end to combine the language content and the speaker embedding, thereby generating new voice, namely the voice of a target speaker. Because the encoder learns the ability to separate the speech content from the speaker-embedded information through training, given the target speaker speech data and the source speaker speech data, the content encoder can automatically extract the speech content representation from the source speaker speech, the speaker encoder automatically extracts the speaker-embedded representation from the target speech, and at the decoder side, the speech content representation and the speaker-embedded representation are combined into new speech data. This method can be extended to voice conversion between multiple speakers.

The above two methods solve some problems of non-parallel voice conversion, but still have limitations. The first method can accurately learn the mapping from the source speaker to the target speaker, but usually only can perform the voice conversion between two speakers, and the training process is complex, the situation that the gradient dip is zero easily occurs, and the voice conversion between multiple speakers cannot be expanded. In the second method, although the variational self-encoder has a simple structure and is easy to implement, the language content representation extracted by the variational self-encoder contains a small amount of speaker embedded information, and finally the result similarity of voice conversion is poor.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a voice conversion method based on semi-supervised feature learning. Firstly, preprocessing voice data in a training set by utilizing an open source voice packet librosa to obtain an extended acoustic feature segment set, and extracting acoustic features representing identity information of a speaker in advance by adopting a generalized end-to-end loss encoder; then, a voice conversion network is constructed, wherein the voice conversion network comprises a variational self-encoder, a decoder and a post-network, and the network is trained by utilizing the constructed data set, wherein the network loss is set by adopting mean square error constraint under supervision information; and finally, processing the source speech data and the target speech data to be converted by using the trained network to obtain converted speech data. The invention introduces semi-supervised feature learning based on the structure of the variational self-encoder, can accurately extract the identity information of speakers, solves the problem of voice conversion among multiple speakers under non-parallel corpus data, and has good generalization capability.

A voice conversion method based on semi-supervised feature learning is characterized by comprising the following steps:

step 1: utilizing an open source voice packet librosa to preprocess each voice data in a training set, wherein the preprocessing comprises reading in the voice data, performing pre-emphasis, windowing and framing processing on each voice data, performing short-time Fourier transform on each frame of framed voice data, converting the frame of voice data from a time domain signal into a frequency domain signal, screening the voice data converted into the frequency domain signal to obtain voice sections in accordance with the length, and forming an acoustic characteristic section set by all the preprocessed voice sections in the training set;

randomly selecting less than half of speakers from a training set, inputting voice data of the speakers to a coder with generalized end-to-end loss design, and extracting acoustic features representing identity information of the speakers; the encoder is composed of a long-short term memory network layer and a linear layer, the dimensions of an input layer, an output layer and a hidden layer of the long-short term memory network layer are respectively 80, 256 and 256, the dimensions of the input layer and the output layer of the linear layer are both 256, the activation function of the linear layer is a ReLu function, and the encoder adopts end-to-end loss constraint;

step 2: constructing a voice conversion network which comprises a variational self-encoder, a decoder and a post-network, wherein the variational self-encoder comprises two branches of a speaker encoder and a content encoder, the speaker encoder consists of two long-term and short-term memory network layers with unit sizes of 768, and speaker identity information is extracted from input voice data; the content encoder is composed of 3 5 multiplied by 1 convolution layers, 2 bidirectional long-short term memory network layers with the unit size of 32 and 1 example normalization layer, and voice content representation is extracted from input voice data; the decoder is composed of 3 convolution layers of 5 multiplied by 1 and 3 long-short term memory network layers with unit dimensionality of 1024, and the speaker identity information extracted by the speaker encoder and the language content representation extracted by the content encoder are input into the decoder to obtain new voice data; the post network is composed of 5 convolution layers of 5 multiplied by 1, residual signal extraction is carried out on the output of the decoder, and the extracted signal is added with the output of the decoder to obtain reconstructed voice data;

the loss function of the voice conversion network is set as follows:

L＝L_con+L_spe+L_reco (1)

wherein L represents the total loss of the network, L_conRepresents content encoder loss, L_speIndicating a loss of speaker identity information, L_recoExpressing the self-weight loss, and respectively calculating according to the following formulas:

wherein, E [. C]Which represents the mathematical expectation of the calculation,

which represents the output of the decoder, and,

is expressed as input

Output of the temporal content encoder, Z_cIn the representationOutput of the capacity encoder, Z_siRepresents the output of the speaker's encoder,

representing the speaker identity information extracted by a generalized end-to-end method, i representing the speaker serial number, x representing the initial input voice data of the network, E_s(x) Representing the output of the speaker's encoder when the input is x, D (E)_s(x),Z_c) Denotes a reaction of E_s(x) And Z_cInputting the output after the decoder;

the specific processing procedure of the example normalization layer is as follows:

first, the mean value of each channel represented by the speech content is calculated as follows:

wherein, mu_cMean of the c-th channel, W the array dimension of each channel, M_c[ω]Represents the ω -th element in the c-th channel; c is 1,2, …, C and C represents the number of channels;

then, the variance of each channel is calculated as follows:

wherein σ_cThe variance of the channel c is shown, epsilon represents an adjusting parameter, and the value range is (0, 1);

finally, the channel array M is expressed by the following formula_cEach element in (a) is normalized:

wherein, M'_c[ω]Represents the value of the ω -th element in the normalized c-th channel; c1, 2, …, C, ω 1,2, …, W;

and step 3: setting network parameters, including the Batch processing size of data read-in being 32, the initial learning rate being 0.001 and the iteration times of the network being 500000, inputting the voice segments in the acoustic feature segment set obtained in the step 1 into the voice conversion network constructed in the step 2 for training to obtain the trained voice conversion network;

and 4, step 4: inputting source voice data and target voice data to be converted to the trained voice conversion network obtained in the step 3, extracting speaker characteristics from the target voice data by using a speaker encoder, extracting voice content representation of the source voice data by using a content encoder, and outputting the converted voice data through a decoder and a post-network.

The invention has the beneficial effects that: based on the structure of the variational self-encoder, semi-supervised feature learning is introduced, and the semi-supervised feature learning is combined with the simple structure of the variational encoder. The speaker extracted by the pre-training model based on the generalized end-to-end loss is embedded as the monitoring information of the speaker encoder in the voice conversion network, and the mean square error loss is used as the constraint, so that the speaker encoder can accurately extract the identity information of the speaker. Since the content encoder adds an instance normalization layer to remove speaker information, it can extract more accurate speech content information. The invention solves the problem of voice conversion among multiple speakers under non-parallel corpus data, and utilizes pre-extracted speaker tone information as the characteristic supervision information extracted by a speaker encoder, so that the speaker encoder can accurately extract the tone information to complete a voice conversion task with higher similarity, and can expand the tone conversion among the multiple speakers and complete excellent voice conversion for the speakers not in voice training data set. The trained network can independently change the identity of the speaker, and can also complete conversion under the condition that the data of the speaker is less.

Drawings

FIG. 1 is a flow chart of the speech conversion method based on semi-supervised feature learning according to the present invention.

Detailed Description

The present invention will be further described with reference to the following drawings and examples, which include, but are not limited to, the following examples. As shown in fig. 1, the present invention provides a speech conversion method based on semi-supervised feature learning, which is implemented as follows:

1. extracting acoustic features of source and target speech data

Each piece of voice data in a training set is preprocessed by utilizing an open source voice packet librosa, including reading in the voice data, each piece of voice data is subjected to pre-emphasis, windowing and framing processing, each frame of framed voice data is subjected to short-time Fourier transform, the voice data is converted from a time domain signal into a frequency domain signal, the voice data converted into the frequency domain signal is subjected to screening processing, voice sections in accordance with the length are obtained, and all the preprocessed voice sections in the training set form an acoustic characteristic section set. Among them, the open source voice package librosa is described in detail in the literature "Brian McFee, Colin Raffel, Dawen Liang, Daniel P.W.Ellis, Matt McVicar, Eric Battenberg, Oriol Nieto, librosa: Audio and Music Signal Analysis in wavelength.proceedings of the 14th wavelength in Science conference.2015.

Selecting no more than half of speakers randomly from a training set, and designing an encoder by adopting a generalized end-to-end loss method, wherein the encoder consists of a long-short term memory network layer with an input dimension of 80, an output dimension of 256 and a hidden layer dimension of 256, and a linear layer with an input dimension of 256 and an output dimension of 256. The activation function of the linear layer is the ReLu function. The selected speaker data is input into the encoder, the encoder extracts speaker embedding capable of representing speaker identity information from the voice data of the speakers under the constraint of end-to-end loss, and the acoustic characteristics of the part are used as supervision information in the speaker encoder in the subsequent processing so as to guide the speaker encoder to generate correct speaker identity information. Among them, the Generalized end-to-end loss method is described in detail in "Wan L, Wang Q, Papir A, and Moreno I.L, Generalized end-to-end loss for spaker verification. in IEEE Conference on optics, Speech and Signal Processing (ICASSP), pp.4879-4883.2018".

2. Building a voice conversion network

The voice conversion network comprises a variation self-encoder, a decoder and a post-network, wherein the variation self-encoder comprises two branches of a speaker encoder and a content encoder, the speaker encoder consists of two long-term and short-term memory network layers with unit size of 768, the identity information of the speaker is extracted from input voice data, and the output result is a 256-dimensional vector. The content encoder is composed of 3 5 × 1 convolutional layers, 2 bidirectional long-short term memory network layers with the unit size of 32 and 1 example normalization layer, and voice content representation is extracted from input voice data, and an output result is also a vector with the dimension of 64. The decoder is composed of 3 5 x 1 convolution layers and 3 long-short term memory network layers with unit dimension of 1024, and the speaker identity information extracted by the speaker encoder and the language content representation extracted by the content encoder are input into the decoder to obtain new voice data. The post-network is composed of 5 convolution layers of 5 multiplied by 1, residual signal extraction is carried out on the output of the decoder, and the extracted signal is added with the output of the decoder so as to increase the details of the voice frequency spectrum and generate a high-quality voice frequency spectrogram, namely reconstructed voice data.

The loss function of the voice conversion network is set as follows:

L＝L_con+L_spe+L_reco (8)

which represents the output of the decoder, and,

is expressed as input

Output of the temporal content encoder, Z_cRepresenting the output of a content encoder, Z_siRepresents the output of the speaker's encoder,

the method is characterized in that the speaker identity information extracted by adopting a generalized end-to-end method is represented and serves as the monitoring information of a speaker encoder, the mean square error of the speaker encoder and the speaker encoder is calculated, the similarity maximization between the speaker encoder and the speaker encoder is realized, and i represents the serial number of the speaker. x denotes the initial input speech data of the network, E_s(x) Representing the output of the speaker's encoder when the input is x, D (E)_s(x),Z_c) Denotes a reaction of E_s(x) And Z_cAnd inputting the output after the decoder.

The example normalization layer performs example normalization processing on the content extracted by the content encoder to remove a small amount of speaker information contained in the content, and the specific processing procedure is as follows:

then, the variance of each channel is calculated as follows:

wherein, M'_c[ω]Represents the value of the ω -th element in the normalized c-th channel; c1, 2, …, C, ω 1,2, …, W.

3. Network training

And (3) setting network parameters, including the Batch processing size 32 of data read-in, the initial learning rate 0.001 and the iteration times 500000 of the network, inputting the voice segments in the acoustic feature segment set obtained in the step (1) into the voice conversion network constructed in the step (2) for training to obtain the trained voice conversion network.

4. Speech conversion

Inputting source voice data and target voice data to be converted to the trained voice conversion network obtained in the step 3, extracting speaker characteristics from the target voice data by using a speaker encoder, extracting voice content representation of the source voice data by using a content encoder, and outputting the converted voice data through a decoder and a post-network.

In order to verify the effectiveness of the method, the method is adopted to carry out experiments on an open source voice data set AISHELL-3, wherein the data set comprises 174 speakers, and each speaker has about 300 pieces of voice data. The voice data time length is between 3-8 seconds. The 174 speakers are divided into 75 speakers for extracting speaker tone information in advance, 75 speakers for extracting speaker information in the training process and 24 speakers for voice data to be tested. The objective evaluation index of the voice conversion result is MCD (Mel Cepstral Distorescence), that is, whether the two voices before and after conversion are distorted or not is compared to evaluate the performance of the voices. The subjective evaluation mainly adopts MOS (mean opinion score) evaluation, and the collector volunteers score results, firstly, the similarity score is 1-5, 1 represents completely dissimilar, and 5 represents completely similar; the second is a score of 1-5 in naturalness, 1 indicating poor naturalness and 5 indicating excellent sound quality. The two evaluations are used as important indexes for evaluating the performance of the voice conversion system. The comparison method adopts a cycleGAN (cyclic generation countermeasure network) method and a Seq2seqVC (variation self-coder) method, and the index evaluation result is shown in table 1.

TABLE 1

Name of method	Naturalness score	Similarity score	MCD(dB)
				CycleGAN method	3.8	3.9	3.3
Seq2seqVC method	3.6	3.5	2.8
				The method of the invention	3.9	4.1	3.2

Claims

1. A voice conversion method based on semi-supervised feature learning is characterized by comprising the following steps:

the loss function of the voice conversion network is set as follows:

L＝L_con+L_spe+L_reco (1)

which represents the output of the decoder, and,

is expressed as input

Output of time content encoderOut, Z_cRepresenting the output of a content encoder, Z_siRepresents the output of the speaker's encoder,

then, the variance of each channel is calculated as follows:

wherein M is_c′[ω]Represents the second in the c channel after normalizationOmega element values; c1, 2, …, C, ω 1,2, …, W;