CN111968617B - Voice conversion method and system for non-parallel data - Google Patents

Voice conversion method and system for non-parallel data Download PDF

Info

Publication number
CN111968617B
CN111968617B CN202010860346.4A CN202010860346A CN111968617B CN 111968617 B CN111968617 B CN 111968617B CN 202010860346 A CN202010860346 A CN 202010860346A CN 111968617 B CN111968617 B CN 111968617B
Authority
CN
China
Prior art keywords
voice
speaker
data
target speaker
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010860346.4A
Other languages
Chinese (zh)
Other versions
CN111968617A (en
Inventor
孙见青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd, Xiamen Yunzhixin Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN202010860346.4A priority Critical patent/CN111968617B/en
Publication of CN111968617A publication Critical patent/CN111968617A/en
Application granted granted Critical
Publication of CN111968617B publication Critical patent/CN111968617B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/69Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for evaluating synthetic or decoded voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a voice conversion method and a voice conversion system for non-parallel data, wherein the method comprises the following steps: training a target speaker speech synthesis model by utilizing large-scale synthesis voice library data except the source speaker and the target speaker; generating parallel data corresponding to the target speaker according to the target speaker voice synthesis model based on the text corresponding to the voice data of the source speaker; training a frequency spectrum parameter conversion model by utilizing voice data and parallel data of a source speaker; and generating the voice of the target speaker after conversion according to the frequency spectrum parameter conversion model and the target speaker voice synthesis model based on the voice data of the source speaker. According to the method, the target speaker voice synthesis model is used for forging high-quality parallel data, then the parallel data is used for training the frequency spectrum parameter conversion model, and the frequency spectrum parameter conversion model and the target speaker voice synthesis model are used for voice conversion, so that the conversion quality is ensured.

Description

Voice conversion method and system for non-parallel data
Technical Field
The present invention relates to the field of speech conversion technologies, and in particular, to a method and a system for speech conversion of non-parallel data.
Background
Speech conversion is a technique for modifying a source speaker speech signal to match a target speaker speech signal to have the target speaker's speech characteristics while maintaining the speech information unchanged. The main task of speech conversion includes extracting characteristic parameters representing the speaker's personality and converting, and then reconstructing the converted parameters into speech. The process ensures the definition of the converted voice and the similarity of the characteristics of the converted voice.
In the existing voice conversion technology, most methods require that two speakers have parallel data (text content corresponding to voice is consistent), and the main disadvantage of the method is that the parallel data is difficult to acquire; some methods do not need parallel data, but only non-parallel data, and the main disadvantage of the method is poor conversion effect.
In order to achieve a higher quality voice conversion effect by using non-parallel data, a voice conversion method and system for non-parallel data are needed.
Disclosure of Invention
The invention provides a voice conversion method and a voice conversion system for non-parallel data, which are used for realizing a voice conversion effect with higher quality by using the non-parallel data.
The invention provides a voice conversion method of non-parallel data, which comprises the following steps:
step 1: training a speech synthesis model of a target speaker using large-scale synthesis library data other than the source speaker and the target speaker, wherein the large-scale synthesis library data comprises text data and speech pair data;
step 2: generating parallel data corresponding to the target speaker according to the target speaker voice synthesis model based on the text corresponding to the voice data of the source speaker, wherein the parallel data and the voice data of the source speaker correspond to the same text content;
step 3: training a spectral parameter conversion model by utilizing the voice data of the source speaker and the parallel data;
step 4: and generating the voice of the target speaker after conversion according to the frequency spectrum parameter conversion model and the target speaker voice synthesis model based on the voice data of the source speaker.
Further, the step 1: training a speech synthesis model of the target speaker using the large-scale synthesis library data other than the source speaker and the target speaker to perform the steps of:
step S11: training to obtain a basic speech synthesis model by utilizing large-scale synthesis database data except the source speaker and the target speaker;
step S12: and training to obtain the target speaker voice synthesis model according to the basic voice synthesis model based on the target speaker voice data.
Further, the step S11: the method comprises the following steps of training to obtain a basic speech synthesis model by utilizing large-scale synthesis database data except a source speaker and a target speaker:
step S111: taking a phoneme representation corresponding to text data in the large-scale synthetic database data as input, taking a phoneme duration obtained through a forced alignment algorithm as output, and training to obtain a basic phoneme duration prediction model based on a deep neural network;
step S112: performing frame expansion on the phoneme representation by using the phoneme duration obtained by the forced alignment algorithm;
step S113: taking the phoneme representation after frame expansion as input, taking the frequency spectrum parameter corresponding to the voice data in the large-scale synthetic voice library data as output, and training to obtain a basic frequency spectrum parameter prediction model;
step S114: and training to obtain a basic vocoder model by taking the frequency spectrum parameters as input and taking the voice in the large-scale synthetic voice library data as output.
Further, in the step S113, the basic spectrum parameter prediction model adopts a Tacotron model based on an encoder-decoder framework.
Further, the step S12: based on the voice data of the target speaker, training to obtain the voice synthesis model of the target speaker according to the basic voice synthesis model, and executing the following steps:
step S121: retraining the basic phoneme duration prediction model by using the text data and the voice data of the target speaker to obtain a target speaker phoneme duration prediction model;
step S122: retraining the basic spectrum parameter prediction model by using the text data and the voice data of the target speaker to obtain a spectrum parameter prediction model of the target speaker;
step S123: retraining the basic vocoder model by utilizing the voice data of the target speaker to obtain a target vocoder model;
step S124: and taking the target speaker spectrum parameter prediction model and the target vocoder model as the target speaker voice synthesis model.
Further, the step 2: based on the text corresponding to the voice data of the source speaker, generating parallel data corresponding to the target speaker according to the target speaker voice synthesis model, and executing the following steps:
step S21: inputting the phoneme representation corresponding to the text data of the source speaker into the phoneme duration prediction model of the target speaker to obtain the phoneme duration of the target speaker;
step S22: performing frame expansion on the phoneme representation according to the phoneme duration of the target speaker to obtain a phoneme representation after frame expansion;
step S23: inputting the phoneme representation of the frame extension of the target speaker into the target speaker spectrum parameter prediction model to obtain the spectrum parameter of the target speaker;
step S24: and taking the spectral parameters of the target speaker as the parallel data.
Further, the step 3: using the speech data of the source speaker and the parallel data, training a spectral parameter conversion model performs the steps of:
step S31: extracting a source speaker spectrum parameter from the voice data of the source speaker;
step S32: using a dynamic time warping method to align the source speaker spectral parameters with the parallel data;
step S33: and training to obtain the spectrum parameter conversion model by taking the spectrum parameters of the source speaker after frame alignment as input and taking the parallel data after frame alignment as output.
Further, in the step S33, the spectral parameter conversion model uses a Tacotron model based on an encoder-encoder framework.
Further, the step 4: based on the voice data of the source speaker, generating the voice of the converted target speaker according to the frequency spectrum parameter conversion model and the target speaker voice synthesis model, and executing the following steps:
step S41: extracting a source speaker spectrum parameter from the input voice data of the source speaker;
step S42: inputting the spectrum parameters of the source speaker into the spectrum parameter conversion model to obtain converted spectrum parameters of the source speaker;
step S43: and inputting the converted spectral parameters of the source speaker into the target speaker voice synthesis model to obtain the voice data of the converted target speaker.
The voice conversion method of the non-parallel data provided by the embodiment of the invention has the following beneficial effects: and the target speaker voice synthesis model is used for forging high-quality parallel data, then the parallel data is used for training the frequency spectrum parameter conversion model, and the frequency spectrum parameter conversion model and the target speaker voice synthesis model are used for voice conversion, so that the conversion quality is ensured.
The invention also provides a voice conversion system of non-parallel data, comprising:
the system comprises a target speaker voice synthesis model training module, a target speaker voice synthesis model generating module and a target speaker voice synthesis model generating module, wherein the target speaker voice synthesis model training module is used for training a target speaker voice synthesis model by utilizing large-scale synthesis voice library data except a source speaker and a target speaker, and the large-scale synthesis voice library data comprises text data and voice pair data;
the parallel data generation module is used for generating parallel data corresponding to the target speaker according to the target speaker voice synthesis model based on the text corresponding to the voice data of the source speaker, wherein the parallel data and the voice data of the source speaker correspond to the same text content;
the frequency spectrum parameter conversion model training module is used for training a frequency spectrum parameter conversion model by utilizing the voice data of the source speaker and the parallel data;
and the voice conversion module is used for generating the voice of the target speaker after conversion according to the frequency spectrum parameter conversion model and the target speaker voice synthesis model based on the voice data of the source speaker.
The voice conversion system of non-parallel data provided by the embodiment of the invention has the following beneficial effects: the parallel data generation module uses the target speaker voice synthesis model to forge high-quality parallel data, the frequency spectrum parameter conversion model training module uses the parallel data to train the frequency spectrum parameter conversion model, and the voice conversion module uses the frequency spectrum parameter conversion model and the target speaker voice synthesis model to carry out voice conversion, so that the conversion quality is ensured.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:
FIG. 1 is a flow chart of a method for converting non-parallel data into voice according to an embodiment of the invention;
FIG. 2 is a block diagram of a speech conversion system for non-parallel data in accordance with an embodiment of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.
The embodiment of the invention provides a voice conversion method of non-parallel data, as shown in fig. 1, the method comprises the following steps:
step 1: training a speech synthesis model of a target speaker using large-scale synthesis library data other than the source speaker and the target speaker, wherein the large-scale synthesis library data comprises text data and speech pair data;
step 2: generating parallel data corresponding to the target speaker according to the target speaker voice synthesis model based on the text corresponding to the voice data of the source speaker, wherein the parallel data and the voice data of the source speaker correspond to the same text content;
step 3: training a spectral parameter conversion model by utilizing the voice data of the source speaker and the parallel data;
step 4: and generating the voice of the target speaker after conversion according to the frequency spectrum parameter conversion model and the target speaker voice synthesis model based on the voice data of the source speaker.
The working principle of the technical scheme is as follows: there is no parallel speech data between the speech data of the source speaker and the speech data of the target speaker. Illustratively, the speaker a is a source speaker, the speaker B is a target speaker, and in the present invention, the voice of the speaker B is obtained according to the voice data of the speaker a, and the voice can maintain the voice content of the speaker a and has the tone color of the speaker B.
In the invention, firstly, a target speaker speech synthesis model is trained by utilizing large-scale synthesis database data except a source speaker and a target speaker; then, based on the text corresponding to the voice data of the source speaker, generating parallel data corresponding to the target speaker according to the voice synthesis model of the target speaker; then, training a frequency spectrum parameter conversion model by utilizing the voice data and the parallel data of the source speaker; and finally, generating the voice of the target speaker after conversion according to the frequency spectrum parameter conversion model and the target speaker voice synthesis model based on the voice data of the source speaker.
The beneficial effects of the technical scheme are as follows: and the target speaker voice synthesis model is used for forging high-quality parallel data, then the parallel data is used for training the frequency spectrum parameter conversion model, and the frequency spectrum parameter conversion model and the target speaker voice synthesis model are used for voice conversion, so that the conversion quality is ensured.
In one embodiment, the step 1: training a speech synthesis model of the target speaker using the large-scale synthesis library data other than the source speaker and the target speaker to perform the steps of:
step S11: training to obtain a basic speech synthesis model by utilizing large-scale synthesis database data except the source speaker and the target speaker;
step S12: and training to obtain the target speaker voice synthesis model according to the basic voice synthesis model based on the target speaker voice data.
The working principle of the technical scheme is as follows: firstly, training to obtain a basic speech synthesis model by utilizing large-scale synthesis database data; and then training to obtain the target speaker voice synthesis model by utilizing the target speaker voice data on the basis of the basic voice synthesis model.
The beneficial effects of the technical scheme are as follows: specific steps are provided for training a speech synthesis model of a target speaker using large-scale synthesis library data other than the source speaker and the target speaker.
In one embodiment, the step S11: the method comprises the following steps of training to obtain a basic speech synthesis model by utilizing large-scale synthesis database data except a source speaker and a target speaker:
step S111: taking a phoneme representation corresponding to text data in the large-scale synthetic database data as input, taking a phoneme duration obtained through a forced alignment algorithm as output, and training to obtain a basic phoneme duration prediction model based on a deep neural network;
step S112: performing frame expansion on the phoneme representation by using the phoneme duration obtained by the forced alignment algorithm;
step S113: taking the phoneme representation after frame expansion as input, taking the frequency spectrum parameter corresponding to the voice data in the large-scale synthetic voice library data as output, and training to obtain a basic frequency spectrum parameter prediction model;
step S114: and training to obtain a basic vocoder model by taking the frequency spectrum parameters as input and taking the voice in the large-scale synthetic voice library data as output.
The working principle of the technical scheme is as follows: firstly, training to obtain a basic phoneme duration prediction model so as to predict duration; then training to obtain a basic spectrum parameter prediction model so as to perform spectrum prediction; and finally training to obtain a basic vocoder model.
In the step S113, the basic spectrum parameter prediction model adopts a Tacotron model based on an encoder-decoder framework, and in addition, the Tacotron model removes an attention (attention) module since the input and the output have been forcedly aligned in advance.
The beneficial effects of the technical scheme are as follows: the specific steps of training to obtain a basic speech synthesis model using large-scale synthesis library data other than the source speaker and the target speaker are provided.
In one embodiment, the step S12: based on the voice data of the target speaker, training to obtain the voice synthesis model of the target speaker according to the basic voice synthesis model, and executing the following steps:
step S121: retraining the basic phoneme duration prediction model by using the text data and the voice data of the target speaker to obtain a target speaker phoneme duration prediction model;
step S122: retraining the basic spectrum parameter prediction model by using the text data and the voice data of the target speaker to obtain a spectrum parameter prediction model of the target speaker;
step S123: retraining the basic vocoder model by utilizing the voice data of the target speaker to obtain a target vocoder model;
step S124: and taking the target speaker spectrum parameter prediction model and the target vocoder model as the target speaker voice synthesis model.
The working principle of the technical scheme is as follows: and respectively retraining the basic phoneme duration prediction model, the basic frequency spectrum parameter prediction model and the basic vocoder model by using the data of the target speaker to respectively obtain the target speaker phoneme duration prediction model, the target speaker frequency spectrum parameter prediction model and the target vocoder model.
The beneficial effects of the technical scheme are as follows: the specific steps of training to obtain the target speaker voice synthesis model based on the target speaker voice data according to the basic voice synthesis model are provided.
In one embodiment, the step 2: based on the text corresponding to the voice data of the source speaker, generating parallel data corresponding to the target speaker according to the target speaker voice synthesis model, and executing the following steps:
step S21: inputting the phoneme representation corresponding to the text data of the source speaker into the phoneme duration prediction model of the target speaker to obtain the phoneme duration of the target speaker;
step S22: performing frame expansion on the phoneme representation according to the phoneme duration of the target speaker to obtain a phoneme representation after frame expansion;
step S23: inputting the phoneme representation of the frame extension of the target speaker into the target speaker spectrum parameter prediction model to obtain the spectrum parameter of the target speaker;
step S24: and taking the spectral parameters of the target speaker as the parallel data.
The working principle of the technical scheme is as follows: inputting the phoneme representation corresponding to the text data of the source speaker into a target speaker phoneme duration prediction model, and taking the phoneme representation as the input of a target speaker spectrum parameter prediction model after performing frame expansion on the phoneme representation according to the outputted phoneme duration of the target speaker, so as to obtain the spectrum parameter of the target speaker, namely the parallel data. The parallel data has the same text content as the voice data of the source speaker.
The beneficial effects of the technical scheme are as follows: specific steps are provided for generating parallel data corresponding to a target speaker based on text corresponding to the voice data of the source speaker according to a target speaker voice synthesis model.
In one embodiment, the step 3: using the speech data of the source speaker and the parallel data, training a spectral parameter conversion model performs the steps of:
step S31: extracting a source speaker spectrum parameter from the voice data of the source speaker;
step S32: using a dynamic time warping method to align the source speaker spectral parameters with the parallel data;
step S33: and training to obtain the spectrum parameter conversion model by taking the spectrum parameters of the source speaker after frame alignment as input and taking the parallel data after frame alignment as output.
The working principle of the technical scheme is as follows: the extracted source speaker spectral parameters and the length of the parallel data may not be uniform, and thus the spectral parameters need to be processed, in particular frame aligned using dynamic time warping (Dynamic time warping, DTW).
In the step S33, the spectral parameter conversion model adopts a Tacotron model based on an encoder-decoder framework. In addition, the Tacotron model removes the attention (attention) module since the input and output have been frame aligned in advance.
The beneficial effects of the technical scheme are as follows: specific steps are provided for training a spectral parametric transformation model using the speech data of the source speaker and the parallel data.
In one embodiment, the step 4: based on the voice data of the source speaker, generating the voice of the converted target speaker according to the frequency spectrum parameter conversion model and the target speaker voice synthesis model, and executing the following steps:
step S41: extracting a source speaker spectrum parameter from the input voice data of the source speaker;
step S42: inputting the spectrum parameters of the source speaker into the spectrum parameter conversion model to obtain converted spectrum parameters of the source speaker;
step S43: and inputting the converted spectral parameters of the source speaker into the target speaker voice synthesis model to obtain the voice data of the converted target speaker.
The working principle of the technical scheme is as follows: in step S43, the converted spectral parameters of the source speaker are input to the target vocoder model obtained in step S123, so as to obtain converted speech, which can maintain the content of the input speech and has the tone color of the target speaker. Therefore, according to the input voice data of the source speaker, the voice of the target speaker is obtained through the voice conversion step.
The beneficial effects of the technical scheme are as follows: specific steps are provided for generating a converted speech of a target speaker based on speech data of a source speaker according to a spectral parameter conversion model and a target speaker speech synthesis model.
As shown in fig. 2, an embodiment of the present invention provides a speech conversion system for non-parallel data, including:
a target speaker speech synthesis model training module 201 for training a target speaker speech synthesis model using large-scale synthesis library data other than the source speaker and the target speaker, wherein the large-scale synthesis library data includes text data and speech pair data;
a parallel data generating module 202, configured to generate parallel data corresponding to the target speaker according to the target speaker speech synthesis model based on a text corresponding to the speech data of the source speaker, where the parallel data and the speech data of the source speaker correspond to the same text content;
the spectral parameter conversion model training module 203 is configured to train a spectral parameter conversion model by using the speech data of the source speaker and the parallel data;
the voice conversion module 204 is configured to generate a converted voice of the target speaker according to the spectral parameter conversion model and the target speaker voice synthesis model based on the voice data of the source speaker.
The working principle of the technical scheme is as follows: there is no parallel speech data between the speech data of the source speaker and the speech data of the target speaker. Illustratively, the speaker a is a source speaker, the speaker B is a target speaker, and in the present invention, the voice of the speaker B is obtained according to the voice data of the speaker a, and the voice can maintain the voice content of the speaker a and has the tone color of the speaker B.
In the present invention, the target speaker speech synthesis model training module 201 trains a target speaker speech synthesis model using large-scale synthesis database data other than the source speaker and the target speaker; the parallel data generating module 202 generates parallel data corresponding to the target speaker according to the target speaker speech synthesis model based on the text corresponding to the speech data of the source speaker; the spectral parameter conversion model training module 203 trains a spectral parameter conversion model by using the voice data and the parallel data of the source speaker; the speech conversion module 204 generates the converted speech of the target speaker based on the speech data of the source speaker according to the spectral parameter conversion model and the target speaker speech synthesis model.
The beneficial effects of the technical scheme are as follows: the parallel data generation module uses the target speaker voice synthesis model to forge high-quality parallel data, the frequency spectrum parameter conversion model training module uses the parallel data to train the frequency spectrum parameter conversion model, and the voice conversion module uses the frequency spectrum parameter conversion model and the target speaker voice synthesis model to carry out voice conversion, so that the conversion quality is ensured.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (8)

1. A method for speech conversion of non-parallel data, the method comprising the steps of:
step 1: training a speech synthesis model of a target speaker using large-scale synthesis library data other than the source speaker and the target speaker, wherein the large-scale synthesis library data comprises text data and speech pair data; wherein, the step 1 further comprises the steps of,
step S11: training to obtain a basic speech synthesis model by utilizing large-scale synthesis database data except the source speaker and the target speaker;
step S12: training to obtain a target speaker voice synthesis model according to the basic voice synthesis model based on the target speaker voice data;
step 2: generating parallel data corresponding to the target speaker according to the target speaker voice synthesis model based on the text corresponding to the voice data of the source speaker, wherein the parallel data and the voice data of the source speaker correspond to the same text content;
step 3: training a spectral parameter conversion model by utilizing the voice data of the source speaker and the parallel data;
step 4: generating the voice of the target speaker after conversion according to the spectrum parameter conversion model and the target speaker voice synthesis model based on the voice data of the source speaker; wherein, the step 4 further comprises the steps of,
step S41: extracting a source speaker spectrum parameter from the input voice data of the source speaker;
step S42: inputting the spectrum parameters of the source speaker into the spectrum parameter conversion model to obtain converted spectrum parameters of the source speaker;
step S43: and inputting the converted spectral parameters of the source speaker into the target speaker voice synthesis model to obtain the voice data of the converted target speaker.
2. The method according to claim 1, wherein said step S11: the method comprises the following steps of training to obtain a basic speech synthesis model by utilizing large-scale synthesis database data except a source speaker and a target speaker:
step S111: taking a phoneme representation corresponding to text data in the large-scale synthetic database data as input, taking a phoneme duration obtained through a forced alignment algorithm as output, and training to obtain a basic phoneme duration prediction model based on a deep neural network;
step S112: performing frame expansion on the phoneme representation by using the phoneme duration obtained by the forced alignment algorithm;
step S113: taking the phoneme representation after frame expansion as input, taking the frequency spectrum parameter corresponding to the voice data in the large-scale synthetic voice library data as output, and training to obtain a basic frequency spectrum parameter prediction model;
step S114: and training to obtain a basic vocoder model by taking the frequency spectrum parameters as input and taking the voice in the large-scale synthetic voice library data as output.
3. The method according to claim 2, wherein in the step S113, the basic spectral parameter prediction model uses a Tacotron model based on an encoder-decoder framework.
4. The method according to claim 2, wherein said step S12: based on the voice data of the target speaker, training to obtain the voice synthesis model of the target speaker according to the basic voice synthesis model, and executing the following steps:
step S121: retraining the basic phoneme duration prediction model by using the text data and the voice data of the target speaker to obtain a target speaker phoneme duration prediction model;
step S122: retraining the basic spectrum parameter prediction model by using the text data and the voice data of the target speaker to obtain a spectrum parameter prediction model of the target speaker;
step S123: retraining the basic vocoder model by utilizing the voice data of the target speaker to obtain a target vocoder model;
step S124: and taking the target speaker spectrum parameter prediction model and the target vocoder model as the target speaker voice synthesis model.
5. The method according to claim 4, wherein said step 2: based on the text corresponding to the voice data of the source speaker, generating parallel data corresponding to the target speaker according to the target speaker voice synthesis model, and executing the following steps:
step S21: inputting the phoneme representation corresponding to the text data of the source speaker into the phoneme duration prediction model of the target speaker to obtain the phoneme duration of the target speaker;
step S22: performing frame expansion on the phoneme representation according to the phoneme duration of the target speaker to obtain a phoneme representation after frame expansion;
step S23: inputting the phoneme representation of the frame extension of the target speaker into the target speaker spectrum parameter prediction model to obtain the spectrum parameter of the target speaker;
step S24: and taking the spectral parameters of the target speaker as the parallel data.
6. The method according to claim 1, wherein said step 3: using the speech data of the source speaker and the parallel data, training a spectral parameter conversion model performs the steps of:
step S31: extracting a source speaker spectrum parameter from the voice data of the source speaker;
step S32: using a dynamic time warping method to align the source speaker spectral parameters with the parallel data;
step S33: and training to obtain the spectrum parameter conversion model by taking the spectrum parameters of the source speaker after frame alignment as input and taking the parallel data after frame alignment as output.
7. The method of claim 6, wherein in said step S33, said spectral parameter transformation model employs a Tacotron model based on an encoder-decoder framework.
8. A speech conversion system for non-parallel data, comprising:
the system comprises a target speaker voice synthesis model training module, a target speaker voice synthesis model generating module and a target speaker voice synthesis model generating module, wherein the target speaker voice synthesis model training module is used for training a target speaker voice synthesis model by utilizing large-scale synthesis voice library data except a source speaker and a target speaker, and the large-scale synthesis voice library data comprises text data and voice pair data; the method comprises the steps of training to obtain a basic speech synthesis model by utilizing large-scale synthesis voice library data except a source speaker and a target speaker; training to obtain a target speaker voice synthesis model according to the basic voice synthesis model based on the target speaker voice data;
the parallel data generation module is used for generating parallel data corresponding to the target speaker according to the target speaker voice synthesis model based on the text corresponding to the voice data of the source speaker, wherein the parallel data and the voice data of the source speaker correspond to the same text content;
the frequency spectrum parameter conversion model training module is used for training a frequency spectrum parameter conversion model by utilizing the voice data of the source speaker and the parallel data;
the voice conversion module is used for generating the voice of the target speaker after conversion according to the frequency spectrum parameter conversion model and the target speaker voice synthesis model based on the voice data of the source speaker; extracting a source speaker frequency spectrum parameter from the input voice data of the source speaker; inputting the spectrum parameters of the source speaker into the spectrum parameter conversion model to obtain converted spectrum parameters of the source speaker; and inputting the converted spectral parameters of the source speaker into the target speaker voice synthesis model to obtain the voice data of the converted target speaker.
CN202010860346.4A 2020-08-25 2020-08-25 Voice conversion method and system for non-parallel data Active CN111968617B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010860346.4A CN111968617B (en) 2020-08-25 2020-08-25 Voice conversion method and system for non-parallel data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010860346.4A CN111968617B (en) 2020-08-25 2020-08-25 Voice conversion method and system for non-parallel data

Publications (2)

Publication Number Publication Date
CN111968617A CN111968617A (en) 2020-11-20
CN111968617B true CN111968617B (en) 2024-03-15

Family

ID=73391241

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010860346.4A Active CN111968617B (en) 2020-08-25 2020-08-25 Voice conversion method and system for non-parallel data

Country Status (1)

Country Link
CN (1) CN111968617B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112530400A (en) * 2020-11-30 2021-03-19 清华珠三角研究院 Method, system, device and medium for generating voice based on text of deep learning
CN112489629A (en) * 2020-12-02 2021-03-12 北京捷通华声科技股份有限公司 Voice transcription model, method, medium, and electronic device
CN113488057B (en) * 2021-08-18 2023-11-14 山东新一代信息产业技术研究院有限公司 Conversation realization method and system for health care
CN114708849A (en) * 2022-04-27 2022-07-05 网易(杭州)网络有限公司 Voice processing method and device, computer equipment and computer readable storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751922A (en) * 2009-07-22 2010-06-23 中国科学院自动化研究所 Text-independent speech conversion system based on HMM model state mapping
CN102063899A (en) * 2010-10-27 2011-05-18 南京邮电大学 Method for voice conversion under unparallel text condition
WO2017067206A1 (en) * 2015-10-20 2017-04-27 百度在线网络技术(北京)有限公司 Training method for multiple personalized acoustic models, and voice synthesis method and device
CN109326283A (en) * 2018-11-23 2019-02-12 南京邮电大学 Multi-to-multi phonetics transfer method under non-parallel text condition based on text decoder
CN109584893A (en) * 2018-12-26 2019-04-05 南京邮电大学 Based on the multi-to-multi speech conversion system of VAE and i-vector under non-parallel text condition
CN110136691A (en) * 2019-05-28 2019-08-16 广州多益网络股份有限公司 A kind of speech synthesis model training method, device, electronic equipment and storage medium
CN110136686A (en) * 2019-05-14 2019-08-16 南京邮电大学 Multi-to-multi voice conversion method based on STARGAN Yu i vector
CN111223474A (en) * 2020-01-15 2020-06-02 武汉水象电子科技有限公司 Voice cloning method and system based on multi-neural network

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751922A (en) * 2009-07-22 2010-06-23 中国科学院自动化研究所 Text-independent speech conversion system based on HMM model state mapping
CN102063899A (en) * 2010-10-27 2011-05-18 南京邮电大学 Method for voice conversion under unparallel text condition
WO2017067206A1 (en) * 2015-10-20 2017-04-27 百度在线网络技术(北京)有限公司 Training method for multiple personalized acoustic models, and voice synthesis method and device
CN109326283A (en) * 2018-11-23 2019-02-12 南京邮电大学 Multi-to-multi phonetics transfer method under non-parallel text condition based on text decoder
CN109584893A (en) * 2018-12-26 2019-04-05 南京邮电大学 Based on the multi-to-multi speech conversion system of VAE and i-vector under non-parallel text condition
CN110136686A (en) * 2019-05-14 2019-08-16 南京邮电大学 Multi-to-multi voice conversion method based on STARGAN Yu i vector
CN110136691A (en) * 2019-05-28 2019-08-16 广州多益网络股份有限公司 A kind of speech synthesis model training method, device, electronic equipment and storage medium
CN111223474A (en) * 2020-01-15 2020-06-02 武汉水象电子科技有限公司 Voice cloning method and system based on multi-neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于CycleGAN网络实现非平行语料库条件下的语音转换;李涛;中国优秀硕士学位论文全文数据库(信息科技辑)(第2期);I136-351 *

Also Published As

Publication number Publication date
CN111968617A (en) 2020-11-20

Similar Documents

Publication Publication Date Title
CN111968617B (en) Voice conversion method and system for non-parallel data
Wang et al. Neural source-filter waveform models for statistical parametric speech synthesis
Weng et al. Deep learning enabled semantic communications with speech recognition and synthesis
Han et al. Semantic-preserved communication system for highly efficient speech transmission
CN101751922B (en) Text-independent speech conversion system based on HMM model state mapping
CN111048064B (en) Voice cloning method and device based on single speaker voice synthesis data set
Ai et al. A neural vocoder with hierarchical generation of amplitude and phase spectra for statistical parametric speech synthesis
CN110767210A (en) Method and device for generating personalized voice
KR102505927B1 (en) Deep learning-based emotional text-to-speech apparatus and method using generative model-based data augmentation
CN112002303B (en) End-to-end speech synthesis training method and system based on knowledge distillation
CN111986646B (en) Dialect synthesis method and system based on small corpus
CN111128211B (en) Voice separation method and device
US20230282202A1 (en) Audio generator and methods for generating an audio signal and training an audio generator
CN116018638A (en) Synthetic data enhancement using voice conversion and speech recognition models
Takamichi et al. Sampling-based speech parameter generation using moment-matching networks
CN111326170B (en) Method and device for converting ear voice into normal voice by combining time-frequency domain expansion convolution
Shah et al. Novel MMSE DiscoGAN for cross-domain whisper-to-speech conversion
CN113744715A (en) Vocoder speech synthesis method, device, computer equipment and storage medium
Tobing et al. Voice conversion with CycleRNN-based spectral mapping and finely tuned WaveNet vocoder
Ai et al. Knowledge-and-data-driven amplitude spectrum prediction for hierarchical neural vocoders
CN105023574A (en) Method and system of enhancing TTS
CN102196100A (en) Instant call translation system and method
CN107464569A (en) Vocoder
CN117079637A (en) Mongolian emotion voice synthesis method based on condition generation countermeasure network
CN112242134A (en) Speech synthesis method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant