CN112562686B

CN112562686B - Zero-sample voice conversion corpus preprocessing method using neural network

Info

Publication number: CN112562686B
Application number: CN202011433778.3A
Authority: CN
Inventors: 魏建国; 更太加
Original assignee: Qinghai Nationalities University
Current assignee: Qinghai Nationalities University
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2022-07-15
Anticipated expiration: 2040-12-10
Also published as: CN112562686A

Abstract

The invention provides a zero sample voice conversion corpus preprocessing method using a neural network, which improves the effectiveness of speaker identity coding vectors in the new field of zero sample voice conversion, obtains certain improvement on the quality of converted voice, carries out preprocessing on the speaker identity codes based on the neural network, uses a speaker identity coder to extract the speaker identity coding vectors in the corpus, inputs the extracted vectors and content codes extracted by using the content coder into a generator, obtains the adjusted speaker identity coding vectors by the generator, takes the result obtained from the generator as the final identity coding vector of the speaker, namely, the final identity coding vector is input into a decoder as an identity label of the speaker, and the decoder generates a converted audio characteristic sequence.

Description

Zero-sample voice conversion corpus preprocessing method using neural network

Technical Field

The application relates to the technical field of network security, in particular to a zero-sample voice conversion corpus preprocessing method using a neural network.

Background

The most central application of speech conversion technology is to change the timbre of speech to make it sound like the target speaker speaks.

In recent years, the field of speech conversion has exploded, moving from parallel corpus requirements and manual alignment and parallel systems to non-parallel systems that do not require parallel corpus. The non-parallel system has the advantages that the requirement on the corpus required by training is not high, the training is very flexible, and the acquisition is convenient, so that the non-parallel system expands the application field of the voice conversion technology.

However, the conventional non-parallel system can only realize the voice conversion under the conditions of one-to-one, one-to-many, many-to-one, many-to-many, that is, the source and speaker in the voice conversion task must be speakers in the training set, so for speakers not in the training set, when they are used as the source or target speaker to perform voice conversion, the corresponding voice data must be used to perform the training of the model again. Generally speaking, the neural network model used for voice conversion is relatively complex, retraining the model necessarily results in a great deal of time and effort consumption, and parameters of model training are often required to be continuously adjusted to enable the voice conversion to work normally. Therefore, in recent years, zero-sample voice conversion technology has become a new research direction in the voice conversion field.

The zero sample means that a source speaker or a target speaker in the voice conversion task does not need to be in a training set but can be any speaker, that is, the zero sample voice conversion technology realizes that the voice conversion task from any speaker to any speaker can be completed by using one voice conversion model, and breaks through the limit of speakers in the training set.

One core idea of the zero-sample speech conversion technique is to use a speaker identity code vector to represent the identity tag of a speaker, but the following problems follow:

1) when the available corpus of the source or target speaker is very little, the obtained speaker identity coding vector is not necessarily reliable;

2) when the speaker identity coding vector is used for model training, the average value is required to be calculated, and a fixed vector is used for representing a fixed speaker;

3) for speakers who are not in the training set, the matching degree of the speaker identity coding vector used in conversion and the voice conversion model is not high.

Therefore, a zero-sample speech conversion corpus preprocessing method using a neural network is urgently needed.

Disclosure of Invention

The invention aims to provide a zero-sample voice conversion corpus preprocessing method using a neural network, which improves the validity of speaker identity coding vectors in the new field of zero-sample voice conversion and improves the quality of converted voice to a certain extent.

In a first aspect, the present application provides a method for preprocessing zero-sample speech conversion corpus using a neural network, the method comprising:

a generator of a neural network is used for preprocessing the identity coding vector of the speaker which is not in the training set, and 256-dimensional vectors are used for representing personalized characteristics of tone and the like of the speaker and correspond to the identity label of the speaker;

separating the speaker related information and the speaker irrelevant information in the voice information through an encoder, wherein the extracted speaker related information is 32-dimensional or 64-dimensional;

the generator is composed of 7 layers of neural networks, the first three layers are one-dimensional convolution layers with convolution kernels of 5, batch standardization steps are carried out after each convolution operation, the output is activated through an activation function ReLU, and the dimension of the output of the convolutional neural networks is 512; the next three layers are the recurrent neural network LSTM, and the last column of output which is output for the last time is selected as the final output of the LSTM after the three layers of LSTM are completely finished, wherein the dimension is 768; the last layer of neural network is a fully connected layer FullConnect, which limits the output dimension to 256 bits again and finally obtains the identity code of the preprocessed speaker;

the generator is independently trained from the voice conversion model, so that a result which is close to the value but not completely consistent is output according to the input speaker identity coding vector, and the output of the generator is better when the generator is closer to the identity coding vector of the corresponding speaker used in the training of the voice conversion model;

preprocessing the speaker identity code based on a neural network, extracting a speaker identity code vector in a corpus by using a speaker identity encoder, inputting the extracted vector and the content code extracted by using the content encoder into a generator, obtaining the adjusted speaker identity code vector by the generator, taking the result obtained from the generator as the final identity code vector of the speaker, namely as an identity label of the speaker, inputting the identity label into a decoder, and generating a converted audio characteristic sequence by the decoder.

With reference to the first aspect, in a first possible implementation manner of the first aspect, a final training target of the generator is as follows:

representing the audio characteristics of the original speech sound,

an identity encoder for a representative of a speaker,

representing the obtained initial speaker identity code vector;

the generator represents the adjusted speaker identity code vector, namely the corresponding output result of the generator;

representing a loss function during training of the generator by averaging the output of the generator with the identity code vectors of the plurality of speakers used in the speech conversion model

The gap of the generator is minimized, and the generator is trained.

With reference to the first aspect, in a second possible implementation manner of the first aspect, the generation of the confrontation network is composed of a generator and a discriminator, which are continuously optimized and iterated in the confrontation process according to a given objective function, so as to finally obtain a model.

In a second aspect, the present application provides a zero-sample speech conversion corpus preprocessing system using a neural network, the system comprising a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is adapted to perform the method steps of any of all possible of the first aspect according to instructions in the program code.

In a third aspect, the present application provides a computer readable storage medium for storing program code for performing the method steps of any one of all possible aspects of the first aspect.

In a fourth aspect, the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method steps of any one of all possible of the first aspect.

The invention provides a preprocessing method of zero sample voice conversion corpus using neural network, which improves the validity of speaker identity coding vector in the new field of zero sample voice conversion, obtains certain improvement on the quality of the converted voice, carries out preprocessing of speaker identity coding based on neural network, uses speaker identity coder to extract speaker identity coding vector in corpus, inputs the extracted vector and content coding extracted by content coder into generator, obtains adjusted speaker identity coding vector by generator, uses the result obtained from generator as final identity coding vector of speaker, i.e. as identity label of speaker to be input into decoder, and generates converted audio characteristic sequence by decoder.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the embodiments will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flow chart of a speech conversion system of the present invention.

FIG. 2 is a diagram illustrating the pretreatment process of the present invention.

FIG. 3 is a diagram of a generator network architecture design of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that the advantages and features of the present invention can be more easily understood by those skilled in the art, and the scope of the present invention will be more clearly and clearly defined.

The application provides a zero-sample voice conversion corpus preprocessing method using a neural network, which comprises the following steps:

the generator is composed of 7 layers of neural networks, the first three layers are one-dimensional convolution layers with convolution kernels of 5, batch standardization steps are carried out after each convolution operation, the output is activated through an activation function ReLU, and the dimension of the output of the convolutional neural networks is 512; the next three layers are a recurrent neural network LSTM, the last column output of the last output is selected as the final output of the LSTM after the three layers of LSTM are all finished, and the dimension is 768; the last layer of neural network is a full connection layer Fullconnect, which limits the output dimension to 256 bits again, and finally obtains the preprocessed speaker identity code;

The method mainly comprises a generator using the neural network, and the generator is used for preprocessing the identity coding vector of the speaker which is not in the training set, so that the effectiveness is improved.

The generator is described as follows:

1) in the current zero-sample voice conversion technology, 256-dimensional vectors are generally used for representing personalized features such as timbre and the like of a speaker, and can be regarded as an identity tag corresponding to the speaker;

2) for the most popular zero sample voice conversion frame based on the self-encoder at present, the feature separation is a core processing mode, the feature separation refers to separating speaker-related information and speaker-unrelated information in voice information through the encoder, the extracted speaker-related information is generally 32-dimensional or 64-dimensional, the method provided by the invention has general application value, only 64-dimensional condition is taken as an example for explanation, and zero sample voice conversion systems based on other dimensions or other frames are also applicable;

3) the generator aims to preprocess the identity coding vector of the speaker not in the training set, the input dimension is 256+64, the output dimension is 256, and finally the purpose of processing the identity coding vector and improving the usability of the identity coding vector is achieved;

4) the generator is composed of 7 layers of neural networks, the first three layers are one-dimensional convolution layers with convolution kernel size of 5, batch standardization steps are carried out after each convolution operation, the output is activated through an activation function ReLU, and the dimension of the output of the convolution neural network is 512. The next three layers are the recurrent neural network LSTM, and the last column output of the last output is selected as the final output of the LSTM after the three layers of the LSTM are all finished, and the dimension is 768. The last layer of neural network is a fully connected layer FullConnect, which limits the output dimension to 256 bits again and finally obtains the identity code of the preprocessed speaker;

5) the generator is independently trained from the voice conversion model, the main idea of training the generator is to enable the generator to output a result which is close to the value but not completely consistent according to the input speaker identity coding vector, and for the output of the generator, the closer the generator is to the identity coding vector of the corresponding speaker used in the training of the voice conversion model, the better the generator is.

6) Sufficient available speaker ID encoding vectors are important for zero sample speech conversion. The technology related to speaker identity coding is derived from the field of speaker recognition, but the application of the technology in the field of speaker recognition and the field of voice conversion are different, and the two fields have different requirements on the technology. In the field of speaker recognition, the speaker identity coding vector is mainly used for judging whether two sections of voices belong to the same speaker or judging the identity of the speaker to which one section of voice belongs, so that the speaker identity coding vector obtained by a speaker coder is directly available, the final effect is not influenced by proper deviation, and the actual requirement can be met. However, in the field of voice conversion, the speaker identity code vector is used as an identity tag of a speaker, so that each unique speaker preferably corresponds to a completely determined and accurate speaker identity code vector, and thus, a better voice conversion effect can be obtained.

In order to solve the above problems, there are the following ideas:

first, it is not practical to improve the performance of the speaker ID encoder sufficiently so that the output speaker ID code vector can converge to a relatively precise point, because the current state of the art cannot meet this requirement.

Second, each speaker who is not in the training set has a sufficiently large amount of audio corpus data as a reference. The method can use speaker coding to extract a large number of speaker identity coding vectors from a large number of linguistic data so as to obtain a relatively stable average value, and can approach the optimal point used by a speech conversion model.

Thirdly, processing each sentence corpus data as a separate and different speaker, and when training a voice conversion model, using speaker identity coding vectors extracted from corresponding sentence audio as the input of a voice conversion system for different utterances of the same speaker. But this method is also completely infeasible, and the experimental results fully show that the training result of the speech conversion model cannot be converged by using this method, and finally the training result of the whole model is collapsed.

Fourthly, preprocessing the identity code of the speaker based on a neural network, namely the method provided by the invention.

7) The steps of using the generator in the conversion phase are as follows:

firstly, extracting a speaker identity coding vector in a corpus by using a speaker identity coder;

second, the extracted vector is input to a generator together with a content code extracted using a content encoder;

thirdly, the generator obtains the adjusted speaker identity coding vector;

fourthly, inputting the result obtained from the generator into a decoder as the final identity coding vector of the speaker, namely as the identity label of the speaker;

fifth, the decoder generates a sequence of transformed audio features.

8) The final training objectives for this generator are as follows:

representing the audio characteristics of the original speech sound,

an encoder for the identity of the representative of the speaker,

representing the resulting initial speaker identity code vector.

Which represents a generator according to the present invention,

representing the adjusted speaker identity code vector, i.e. the corresponding output of the generator.

Representing a loss function during training of the generator by smoothing the identity-encoded vectors of the speakers used in the output speech conversion model of the generatorMean value

The gap of (a) is minimized, and the generator is trained.

After training, the generator correspondingly has the capability of reasonably adjusting the speaker identity coding vector which is not in the training set, and experiments show that the method can effectively improve the usability of the speaker identity coding vector, and particularly can greatly improve the matching degree of the speaker identity coding vector and a voice conversion model when the task of zero-sample voice conversion is carried out, so that the naturalness and the similarity of the converted voice are finally improved.

The present application provides a zero-sample speech conversion corpus preprocessing system using a neural network, the system comprising: the system includes a processor and a memory:

the processor is configured to perform the method steps of any of the embodiments of the first aspect in accordance with instructions in the program code.

The present application provides a computer readable storage medium for storing program code for performing the method steps of any of the embodiments of the first aspect.

The present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method steps of any of the embodiments of the first aspect.

In specific implementation, the present invention further provides a computer storage medium, where the computer storage medium may store a program, and the program may include some or all of the steps in the embodiments of the present invention when executed. The storage medium can be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like.

Those skilled in the art will readily appreciate that the techniques of the embodiments of the present invention may be implemented using software plus any required general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The same and similar parts between the various embodiments of the present specification may be referred to each other. In particular, for the embodiments, since they are substantially similar to the method embodiments, the description is simple, and the relevant points can be referred to the description in the method embodiments.

The above-described embodiments of the present invention do not limit the scope of the present invention.

Claims

1. A method for preprocessing zero-sample speech converted corpus using neural networks, the method comprising:

a generator of a neural network is used for preprocessing the identity coding vector of the speaker which is not in the training set, and 256-dimensional vectors are used for representing the tone personalized characteristics of the speaker and correspond to the identity label of the speaker;

the generator is composed of 7 layers of neural networks, the first three layers are one-dimensional convolution layers with convolution kernels of 5, batch standardization steps are carried out after each convolution operation, the output is activated through an activation function ReLU, and the dimension of the output of the convolutional neural networks is 512; the next three layers are a recurrent neural network LSTM, the last column output of the last output is selected as the final output of the LSTM after the three layers of LSTM are all finished, and the dimension is 768; the last layer of neural network is a fully connected layer FullConnect, which limits the output dimension to 256 bits again and finally obtains the identity code of the preprocessed speaker;

the generator is independently trained from the voice conversion model, so that a result which is close to the identity coding vector of the speaker but is not completely consistent is output according to the input identity coding vector of the speaker, and for the output of the generator, the closer to the identity coding vector of the corresponding speaker used in the training of the voice conversion model, the better;

preprocessing the speaker identity code based on a neural network, extracting a speaker identity code vector in a corpus by using a speaker identity encoder, inputting the extracted vector and a content code extracted by using a content encoder into a generator, obtaining the adjusted speaker identity code vector by the generator, taking a result obtained from the generator as a final identity code vector of the speaker, namely as an identity label of the speaker, inputting the final identity code vector into a decoder, and generating a converted audio characteristic sequence by the decoder.

2. The method of claim 1, wherein: the final training goals for this generator are as follows:

S_nA＝Es(X_nA)

representing the audio characteristics of the original speech, representing the identity of the speaker, encoder, S_nARepresenting the obtained initial speaker identity code vector;

S′_A＝G(S_nA)

g represents a generator, S 'of the invention'_ARepresenting the adjusted speaker identity code vector, namely the corresponding output result of the generator;

L_adjustrepresenting a loss function in training the generator by correlating the output of the generator with the identity of the speakers used in the speech conversion modelAverage value S of the encoded vector_AThe gap of (a) is minimized, and the generator is trained.

3. Method according to one of claims 1-2, characterized in that: the generation of the confrontation network consists of a generator and a discriminator, which are continuously optimized and iterated in the confrontation process according to a given objective function, and finally a model is obtained.

4. A system for zero-sample speech corpus pre-processing using neural networks, the system comprising a processor and a memory: the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to perform the method steps of any of claims 1-3 according to instructions in the program code.

5. A computer-readable storage medium, characterized in that the computer-readable storage medium is configured to store a program code for performing the method steps of any of claims 1-3.