CN111968617B

CN111968617B - Voice conversion method and system for non-parallel data

Info

Publication number: CN111968617B
Application number: CN202010860346.4A
Authority: CN
Inventors: 孙见青
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2020-08-25
Filing date: 2020-08-25
Publication date: 2024-03-15
Anticipated expiration: 2040-08-25
Also published as: CN111968617A

Abstract

The invention provides a voice conversion method and a voice conversion system for non-parallel data, wherein the method comprises the following steps: training a target speaker speech synthesis model by utilizing large-scale synthesis voice library data except the source speaker and the target speaker; generating parallel data corresponding to the target speaker according to the target speaker voice synthesis model based on the text corresponding to the voice data of the source speaker; training a frequency spectrum parameter conversion model by utilizing voice data and parallel data of a source speaker; and generating the voice of the target speaker after conversion according to the frequency spectrum parameter conversion model and the target speaker voice synthesis model based on the voice data of the source speaker. According to the method, the target speaker voice synthesis model is used for forging high-quality parallel data, then the parallel data is used for training the frequency spectrum parameter conversion model, and the frequency spectrum parameter conversion model and the target speaker voice synthesis model are used for voice conversion, so that the conversion quality is ensured.

Description

Voice conversion method and system for non-parallel data

Technical Field

The present invention relates to the field of speech conversion technologies, and in particular, to a method and a system for speech conversion of non-parallel data.

Background

Speech conversion is a technique for modifying a source speaker speech signal to match a target speaker speech signal to have the target speaker's speech characteristics while maintaining the speech information unchanged. The main task of speech conversion includes extracting characteristic parameters representing the speaker's personality and converting, and then reconstructing the converted parameters into speech. The process ensures the definition of the converted voice and the similarity of the characteristics of the converted voice.

In the existing voice conversion technology, most methods require that two speakers have parallel data (text content corresponding to voice is consistent), and the main disadvantage of the method is that the parallel data is difficult to acquire; some methods do not need parallel data, but only non-parallel data, and the main disadvantage of the method is poor conversion effect.

In order to achieve a higher quality voice conversion effect by using non-parallel data, a voice conversion method and system for non-parallel data are needed.

Disclosure of Invention

The invention provides a voice conversion method and a voice conversion system for non-parallel data, which are used for realizing a voice conversion effect with higher quality by using the non-parallel data.

The invention provides a voice conversion method of non-parallel data, which comprises the following steps:

step 1: training a speech synthesis model of a target speaker using large-scale synthesis library data other than the source speaker and the target speaker, wherein the large-scale synthesis library data comprises text data and speech pair data;

step 2: generating parallel data corresponding to the target speaker according to the target speaker voice synthesis model based on the text corresponding to the voice data of the source speaker, wherein the parallel data and the voice data of the source speaker correspond to the same text content;

step 3: training a spectral parameter conversion model by utilizing the voice data of the source speaker and the parallel data;

step 4: and generating the voice of the target speaker after conversion according to the frequency spectrum parameter conversion model and the target speaker voice synthesis model based on the voice data of the source speaker.

Further, the step 1: training a speech synthesis model of the target speaker using the large-scale synthesis library data other than the source speaker and the target speaker to perform the steps of:

step S11: training to obtain a basic speech synthesis model by utilizing large-scale synthesis database data except the source speaker and the target speaker;

step S12: and training to obtain the target speaker voice synthesis model according to the basic voice synthesis model based on the target speaker voice data.

Further, the step S11: the method comprises the following steps of training to obtain a basic speech synthesis model by utilizing large-scale synthesis database data except a source speaker and a target speaker:

step S111: taking a phoneme representation corresponding to text data in the large-scale synthetic database data as input, taking a phoneme duration obtained through a forced alignment algorithm as output, and training to obtain a basic phoneme duration prediction model based on a deep neural network;

step S112: performing frame expansion on the phoneme representation by using the phoneme duration obtained by the forced alignment algorithm;

step S113: taking the phoneme representation after frame expansion as input, taking the frequency spectrum parameter corresponding to the voice data in the large-scale synthetic voice library data as output, and training to obtain a basic frequency spectrum parameter prediction model;

step S114: and training to obtain a basic vocoder model by taking the frequency spectrum parameters as input and taking the voice in the large-scale synthetic voice library data as output.

Further, in the step S113, the basic spectrum parameter prediction model adopts a Tacotron model based on an encoder-decoder framework.

Further, the step S12: based on the voice data of the target speaker, training to obtain the voice synthesis model of the target speaker according to the basic voice synthesis model, and executing the following steps:

step S121: retraining the basic phoneme duration prediction model by using the text data and the voice data of the target speaker to obtain a target speaker phoneme duration prediction model;

step S122: retraining the basic spectrum parameter prediction model by using the text data and the voice data of the target speaker to obtain a spectrum parameter prediction model of the target speaker;

step S123: retraining the basic vocoder model by utilizing the voice data of the target speaker to obtain a target vocoder model;

step S124: and taking the target speaker spectrum parameter prediction model and the target vocoder model as the target speaker voice synthesis model.

Further, the step 2: based on the text corresponding to the voice data of the source speaker, generating parallel data corresponding to the target speaker according to the target speaker voice synthesis model, and executing the following steps:

step S21: inputting the phoneme representation corresponding to the text data of the source speaker into the phoneme duration prediction model of the target speaker to obtain the phoneme duration of the target speaker;

step S22: performing frame expansion on the phoneme representation according to the phoneme duration of the target speaker to obtain a phoneme representation after frame expansion;

step S23: inputting the phoneme representation of the frame extension of the target speaker into the target speaker spectrum parameter prediction model to obtain the spectrum parameter of the target speaker;

step S24: and taking the spectral parameters of the target speaker as the parallel data.

Further, the step 3: using the speech data of the source speaker and the parallel data, training a spectral parameter conversion model performs the steps of:

step S31: extracting a source speaker spectrum parameter from the voice data of the source speaker;

step S32: using a dynamic time warping method to align the source speaker spectral parameters with the parallel data;

step S33: and training to obtain the spectrum parameter conversion model by taking the spectrum parameters of the source speaker after frame alignment as input and taking the parallel data after frame alignment as output.

Further, in the step S33, the spectral parameter conversion model uses a Tacotron model based on an encoder-encoder framework.

Further, the step 4: based on the voice data of the source speaker, generating the voice of the converted target speaker according to the frequency spectrum parameter conversion model and the target speaker voice synthesis model, and executing the following steps:

step S41: extracting a source speaker spectrum parameter from the input voice data of the source speaker;

step S42: inputting the spectrum parameters of the source speaker into the spectrum parameter conversion model to obtain converted spectrum parameters of the source speaker;

step S43: and inputting the converted spectral parameters of the source speaker into the target speaker voice synthesis model to obtain the voice data of the converted target speaker.

The voice conversion method of the non-parallel data provided by the embodiment of the invention has the following beneficial effects: and the target speaker voice synthesis model is used for forging high-quality parallel data, then the parallel data is used for training the frequency spectrum parameter conversion model, and the frequency spectrum parameter conversion model and the target speaker voice synthesis model are used for voice conversion, so that the conversion quality is ensured.

The invention also provides a voice conversion system of non-parallel data, comprising:

the system comprises a target speaker voice synthesis model training module, a target speaker voice synthesis model generating module and a target speaker voice synthesis model generating module, wherein the target speaker voice synthesis model training module is used for training a target speaker voice synthesis model by utilizing large-scale synthesis voice library data except a source speaker and a target speaker, and the large-scale synthesis voice library data comprises text data and voice pair data;

the parallel data generation module is used for generating parallel data corresponding to the target speaker according to the target speaker voice synthesis model based on the text corresponding to the voice data of the source speaker, wherein the parallel data and the voice data of the source speaker correspond to the same text content;

the frequency spectrum parameter conversion model training module is used for training a frequency spectrum parameter conversion model by utilizing the voice data of the source speaker and the parallel data;

and the voice conversion module is used for generating the voice of the target speaker after conversion according to the frequency spectrum parameter conversion model and the target speaker voice synthesis model based on the voice data of the source speaker.

The voice conversion system of non-parallel data provided by the embodiment of the invention has the following beneficial effects: the parallel data generation module uses the target speaker voice synthesis model to forge high-quality parallel data, the frequency spectrum parameter conversion model training module uses the parallel data to train the frequency spectrum parameter conversion model, and the voice conversion module uses the frequency spectrum parameter conversion model and the target speaker voice synthesis model to carry out voice conversion, so that the conversion quality is ensured.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

FIG. 1 is a flow chart of a method for converting non-parallel data into voice according to an embodiment of the invention;

FIG. 2 is a block diagram of a speech conversion system for non-parallel data in accordance with an embodiment of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.

The embodiment of the invention provides a voice conversion method of non-parallel data, as shown in fig. 1, the method comprises the following steps:

The working principle of the technical scheme is as follows: there is no parallel speech data between the speech data of the source speaker and the speech data of the target speaker. Illustratively, the speaker a is a source speaker, the speaker B is a target speaker, and in the present invention, the voice of the speaker B is obtained according to the voice data of the speaker a, and the voice can maintain the voice content of the speaker a and has the tone color of the speaker B.

In the invention, firstly, a target speaker speech synthesis model is trained by utilizing large-scale synthesis database data except a source speaker and a target speaker; then, based on the text corresponding to the voice data of the source speaker, generating parallel data corresponding to the target speaker according to the voice synthesis model of the target speaker; then, training a frequency spectrum parameter conversion model by utilizing the voice data and the parallel data of the source speaker; and finally, generating the voice of the target speaker after conversion according to the frequency spectrum parameter conversion model and the target speaker voice synthesis model based on the voice data of the source speaker.

The beneficial effects of the technical scheme are as follows: and the target speaker voice synthesis model is used for forging high-quality parallel data, then the parallel data is used for training the frequency spectrum parameter conversion model, and the frequency spectrum parameter conversion model and the target speaker voice synthesis model are used for voice conversion, so that the conversion quality is ensured.

In one embodiment, the step 1: training a speech synthesis model of the target speaker using the large-scale synthesis library data other than the source speaker and the target speaker to perform the steps of:

The working principle of the technical scheme is as follows: firstly, training to obtain a basic speech synthesis model by utilizing large-scale synthesis database data; and then training to obtain the target speaker voice synthesis model by utilizing the target speaker voice data on the basis of the basic voice synthesis model.

The beneficial effects of the technical scheme are as follows: specific steps are provided for training a speech synthesis model of a target speaker using large-scale synthesis library data other than the source speaker and the target speaker.

In one embodiment, the step S11: the method comprises the following steps of training to obtain a basic speech synthesis model by utilizing large-scale synthesis database data except a source speaker and a target speaker:

The working principle of the technical scheme is as follows: firstly, training to obtain a basic phoneme duration prediction model so as to predict duration; then training to obtain a basic spectrum parameter prediction model so as to perform spectrum prediction; and finally training to obtain a basic vocoder model.

In the step S113, the basic spectrum parameter prediction model adopts a Tacotron model based on an encoder-decoder framework, and in addition, the Tacotron model removes an attention (attention) module since the input and the output have been forcedly aligned in advance.

The beneficial effects of the technical scheme are as follows: the specific steps of training to obtain a basic speech synthesis model using large-scale synthesis library data other than the source speaker and the target speaker are provided.

In one embodiment, the step S12: based on the voice data of the target speaker, training to obtain the voice synthesis model of the target speaker according to the basic voice synthesis model, and executing the following steps:

The working principle of the technical scheme is as follows: and respectively retraining the basic phoneme duration prediction model, the basic frequency spectrum parameter prediction model and the basic vocoder model by using the data of the target speaker to respectively obtain the target speaker phoneme duration prediction model, the target speaker frequency spectrum parameter prediction model and the target vocoder model.

The beneficial effects of the technical scheme are as follows: the specific steps of training to obtain the target speaker voice synthesis model based on the target speaker voice data according to the basic voice synthesis model are provided.

In one embodiment, the step 2: based on the text corresponding to the voice data of the source speaker, generating parallel data corresponding to the target speaker according to the target speaker voice synthesis model, and executing the following steps:

The working principle of the technical scheme is as follows: inputting the phoneme representation corresponding to the text data of the source speaker into a target speaker phoneme duration prediction model, and taking the phoneme representation as the input of a target speaker spectrum parameter prediction model after performing frame expansion on the phoneme representation according to the outputted phoneme duration of the target speaker, so as to obtain the spectrum parameter of the target speaker, namely the parallel data. The parallel data has the same text content as the voice data of the source speaker.

The beneficial effects of the technical scheme are as follows: specific steps are provided for generating parallel data corresponding to a target speaker based on text corresponding to the voice data of the source speaker according to a target speaker voice synthesis model.

In one embodiment, the step 3: using the speech data of the source speaker and the parallel data, training a spectral parameter conversion model performs the steps of:

The working principle of the technical scheme is as follows: the extracted source speaker spectral parameters and the length of the parallel data may not be uniform, and thus the spectral parameters need to be processed, in particular frame aligned using dynamic time warping (Dynamic time warping, DTW).

In the step S33, the spectral parameter conversion model adopts a Tacotron model based on an encoder-decoder framework. In addition, the Tacotron model removes the attention (attention) module since the input and output have been frame aligned in advance.

The beneficial effects of the technical scheme are as follows: specific steps are provided for training a spectral parametric transformation model using the speech data of the source speaker and the parallel data.

In one embodiment, the step 4: based on the voice data of the source speaker, generating the voice of the converted target speaker according to the frequency spectrum parameter conversion model and the target speaker voice synthesis model, and executing the following steps:

The working principle of the technical scheme is as follows: in step S43, the converted spectral parameters of the source speaker are input to the target vocoder model obtained in step S123, so as to obtain converted speech, which can maintain the content of the input speech and has the tone color of the target speaker. Therefore, according to the input voice data of the source speaker, the voice of the target speaker is obtained through the voice conversion step.

The beneficial effects of the technical scheme are as follows: specific steps are provided for generating a converted speech of a target speaker based on speech data of a source speaker according to a spectral parameter conversion model and a target speaker speech synthesis model.

As shown in fig. 2, an embodiment of the present invention provides a speech conversion system for non-parallel data, including:

a target speaker speech synthesis model training module 201 for training a target speaker speech synthesis model using large-scale synthesis library data other than the source speaker and the target speaker, wherein the large-scale synthesis library data includes text data and speech pair data;

a parallel data generating module 202, configured to generate parallel data corresponding to the target speaker according to the target speaker speech synthesis model based on a text corresponding to the speech data of the source speaker, where the parallel data and the speech data of the source speaker correspond to the same text content;

the spectral parameter conversion model training module 203 is configured to train a spectral parameter conversion model by using the speech data of the source speaker and the parallel data;

the voice conversion module 204 is configured to generate a converted voice of the target speaker according to the spectral parameter conversion model and the target speaker voice synthesis model based on the voice data of the source speaker.

In the present invention, the target speaker speech synthesis model training module 201 trains a target speaker speech synthesis model using large-scale synthesis database data other than the source speaker and the target speaker; the parallel data generating module 202 generates parallel data corresponding to the target speaker according to the target speaker speech synthesis model based on the text corresponding to the speech data of the source speaker; the spectral parameter conversion model training module 203 trains a spectral parameter conversion model by using the voice data and the parallel data of the source speaker; the speech conversion module 204 generates the converted speech of the target speaker based on the speech data of the source speaker according to the spectral parameter conversion model and the target speaker speech synthesis model.

The beneficial effects of the technical scheme are as follows: the parallel data generation module uses the target speaker voice synthesis model to forge high-quality parallel data, the frequency spectrum parameter conversion model training module uses the parallel data to train the frequency spectrum parameter conversion model, and the voice conversion module uses the frequency spectrum parameter conversion model and the target speaker voice synthesis model to carry out voice conversion, so that the conversion quality is ensured.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method for speech conversion of non-parallel data, the method comprising the steps of:

step 1: training a speech synthesis model of a target speaker using large-scale synthesis library data other than the source speaker and the target speaker, wherein the large-scale synthesis library data comprises text data and speech pair data; wherein, the step 1 further comprises the steps of,

step S12: training to obtain a target speaker voice synthesis model according to the basic voice synthesis model based on the target speaker voice data;

step 4: generating the voice of the target speaker after conversion according to the spectrum parameter conversion model and the target speaker voice synthesis model based on the voice data of the source speaker; wherein, the step 4 further comprises the steps of,

2. The method according to claim 1, wherein said step S11: the method comprises the following steps of training to obtain a basic speech synthesis model by utilizing large-scale synthesis database data except a source speaker and a target speaker:

3. The method according to claim 2, wherein in the step S113, the basic spectral parameter prediction model uses a Tacotron model based on an encoder-decoder framework.

4. The method according to claim 2, wherein said step S12: based on the voice data of the target speaker, training to obtain the voice synthesis model of the target speaker according to the basic voice synthesis model, and executing the following steps:

5. The method according to claim 4, wherein said step 2: based on the text corresponding to the voice data of the source speaker, generating parallel data corresponding to the target speaker according to the target speaker voice synthesis model, and executing the following steps:

6. The method according to claim 1, wherein said step 3: using the speech data of the source speaker and the parallel data, training a spectral parameter conversion model performs the steps of:

7. The method of claim 6, wherein in said step S33, said spectral parameter transformation model employs a Tacotron model based on an encoder-decoder framework.

8. A speech conversion system for non-parallel data, comprising:

the system comprises a target speaker voice synthesis model training module, a target speaker voice synthesis model generating module and a target speaker voice synthesis model generating module, wherein the target speaker voice synthesis model training module is used for training a target speaker voice synthesis model by utilizing large-scale synthesis voice library data except a source speaker and a target speaker, and the large-scale synthesis voice library data comprises text data and voice pair data; the method comprises the steps of training to obtain a basic speech synthesis model by utilizing large-scale synthesis voice library data except a source speaker and a target speaker; training to obtain a target speaker voice synthesis model according to the basic voice synthesis model based on the target speaker voice data;

the voice conversion module is used for generating the voice of the target speaker after conversion according to the frequency spectrum parameter conversion model and the target speaker voice synthesis model based on the voice data of the source speaker; extracting a source speaker frequency spectrum parameter from the input voice data of the source speaker; inputting the spectrum parameters of the source speaker into the spectrum parameter conversion model to obtain converted spectrum parameters of the source speaker; and inputting the converted spectral parameters of the source speaker into the target speaker voice synthesis model to obtain the voice data of the converted target speaker.