CN114495898A - Training method and system for unified speech synthesis and speech conversion - Google Patents

Training method and system for unified speech synthesis and speech conversion Download PDF

Info

Publication number
CN114495898A
CN114495898A CN202210395964.5A CN202210395964A CN114495898A CN 114495898 A CN114495898 A CN 114495898A CN 202210395964 A CN202210395964 A CN 202210395964A CN 114495898 A CN114495898 A CN 114495898A
Authority
CN
China
Prior art keywords
speech
information
speaker
conversion
content information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210395964.5A
Other languages
Chinese (zh)
Other versions
CN114495898B (en
Inventor
陶建华
汪涛
易江燕
傅睿博
张震
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202210395964.5A priority Critical patent/CN114495898B/en
Publication of CN114495898A publication Critical patent/CN114495898A/en
Application granted granted Critical
Publication of CN114495898B publication Critical patent/CN114495898B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention provides a training method and a training system for unified speech synthesis and speech conversion. The method comprises the following steps: decoupling the encoding tasks of voice synthesis and voice conversion into three subtasks, namely extracting content information, extracting speaker information and extracting prosody information; the content information is language information irrelevant to a speaker; the speaker information includes: a characteristic of the speaker; the prosodic information indicates how the speaker speaks the content information and reflects the rhythm of the voice; and inputting the extracted content information, speaker information and prosody information into a decoding task to obtain restored voice information. The scheme provided by the invention unifies the voice synthesis and voice conversion models, thereby avoiding the difficulty of independent construction; the use of label-free speech improves the performance of speech synthesis and speech conversion.

Description

Training method and system for unified speech synthesis and speech conversion
Technical Field
The invention belongs to the technical field of voice cloning, and particularly relates to a training method and a training system for unified voice synthesis and voice conversion.
Background
Cloning the voice of a target speaker is an attractive technology, and can be applied to various scenes, such as entertainment creation, personalized mobile assistants, the safety field and the like. The most desirable phonetic cloning operation is to reference only a single utterance of speech of the target speaker that has not been seen, and then synthesize any speech of the target speaker, which is called single sample phonetic cloning. In the field of speech research, speech synthesis technology and speech conversion technology are two mainstream ways to implement speech cloning. Both technologies have been separately studied and developed as independent tasks.
TTS (text-to-speech) speech synthesis;
VC (voice conversion);
although TTS and VC technologies are two important methods for voice cloning, the two technologies have been separately studied and developed as independent tasks in the past, and there is not much interaction between them. The reason for the difficulty is because the speech content representations of the two tasks are different. In particular, the speech content in TTS is obtained by text information, and the text and speech in TTS are two unequal sequences, which usually require attention mechanisms to align them. However, the attention mechanism is often influenced by the information of the speaker, so that the voice content representation irrelevant to the speaker cannot be learned. Whereas in VC, the source and target speech are aligned over the speech content, so the speech content can be extracted directly from the source speech, unlike TTS.
Disclosure of Invention
In order to solve the above technical problems, the present invention provides a technical solution of a training method and system for unified speech synthesis and speech conversion, so as to solve the above technical problems.
The invention discloses a training method for unified speech synthesis and speech conversion in a first aspect, which comprises the following steps:
step S1, decoupling the encoding task of voice synthesis and voice conversion into three subtasks, namely extracting content information, extracting speaker information and extracting prosody information; the content information is language information irrelevant to a speaker; the speaker information includes: a characteristic of the speaker; the prosodic information indicates how the speaker speaks the content information and reflects the rhythm of the voice;
and step S2, inputting the extracted content information, speaker information and prosody information into a decoding task to obtain restored voice information.
According to the method of the first aspect of the invention, in the sub-task of extracting the content information, for speech synthesis, the source of the speech content is text, the text is encoded using a text encoder to obtain a context representation; the context representation is upsampled according to duration information of the phonemes to obtain content information of the speech.
According to the method of the first aspect of the present invention, in the sub-task of extracting the content information, for the speech conversion, since the source speech is aligned with the target speech, the content information of the speech is extracted from the source speech directly using the content encoder.
According to the method of the first aspect of the invention, the encoder for extracting the speaker information is shared in the tasks of speech synthesis and speech conversion, the speaker information being extracted directly from the speech without text.
According to the method of the first aspect of the present invention, the extracted encoder of the prosodic information is shared in the tasks of speech synthesis and speech conversion, and the fundamental frequency information is directly extracted from speech as the prosodic information.
According to the method of the first aspect of the present invention, in the training phase, the content information of the speech and the speaker information are used as input to predict the fundamental frequency information.
According to the method of the first aspect of the invention, the overall loss function of speech synthesis and speech conversion comprises three parts, the loss function of speech synthesis, the loss function of speech conversion and the additional content information loss function, i.e.,
total loss function = loss function of speech synthesis + loss function of speech conversion + additional content information loss function;
the method for constructing the additional content information loss function comprises the following steps:
coding by using a vector quantization codebook to respectively extract content information of speech synthesis and content information of speech conversion;
then, the difference value between the content information of the voice synthesis and the content information of the voice conversion is applied to construct the additional content information loss function;
the loss function for speech synthesis consists of two parts, one part is the reconstruction loss of the speech for speech synthesis, and the other part is the reconstruction loss of the fundamental frequency for speech synthesis, i.e.,
loss function of speech synthesis = reconstruction loss of speech synthesis + reconstruction loss of fundamental frequency of speech synthesis;
the loss function for a speech conversion consists of two parts, one part is the reconstruction loss of the speech for the speech conversion and the other part is the reconstruction loss of the fundamental frequency for the speech conversion, i.e.,
loss function of speech conversion = reconstruction loss of speech conversion + reconstruction loss of fundamental frequency of speech conversion.
The second aspect of the present invention discloses a training system for unified speech synthesis and speech conversion, the system comprising:
the first processing module is configured to decouple the encoding tasks of voice synthesis and voice conversion into three subtasks, namely extraction of content information, extraction of speaker information and extraction of prosody information; the content information is language information irrelevant to a speaker; the speaker information includes: a characteristic of the speaker; the prosodic information indicates how the speaker speaks the content information and reflects the rhythm of the voice;
and the second processing module is configured to input the extracted content information, speaker information and prosody information into a decoding task to obtain restored voice information.
According to the system of the second aspect of the present invention, the first processing module is configured to, in the sub-task of extracting the content information, for speech synthesis, the source of the speech content is text, encode the text using a text encoder to obtain a context representation; the context representation is upsampled according to duration information of the phonemes to obtain content information of the speech.
According to the system of the second aspect of the present invention, the first processing module is configured to, in the sub-task of extracting the content information, extract the content information of the speech from the source speech directly using the content encoder for the speech conversion, since the source speech is aligned with the target speech.
According to the system of the second aspect of the present invention, the first processing module is configured such that the extracted encoder of the prosodic information is shared in tasks of speech synthesis and speech conversion, and fundamental frequency information is directly extracted from speech as the prosodic information.
According to the system of the second aspect of the present invention, the first processing module is configured to predict the fundamental frequency information by using the content information of the speech and the speaker information as input in the training stage.
The system according to the second aspect of the invention, the first processing module is configured such that the total loss function of speech synthesis and speech conversion comprises three parts, a loss function of speech synthesis, a loss function of speech conversion and an additional content information loss function, i.e.,
total loss function = loss function of speech synthesis + loss function of speech conversion + additional content information loss function;
the constructing of the additional content information loss function comprises:
coding by using a vector quantization codebook to respectively extract content information of speech synthesis and content information of speech conversion;
then, the difference value between the content information of the voice synthesis and the content information of the voice conversion is applied to construct the additional content information loss function;
the loss function for speech synthesis consists of two parts, one part is the reconstruction loss of the speech for speech synthesis, and the other part is the reconstruction loss of the fundamental frequency for speech synthesis, i.e.,
loss function of speech synthesis = reconstruction loss of speech synthesis + reconstruction loss of fundamental frequency of speech synthesis;
the loss function for a speech conversion consists of two parts, one part is the reconstruction loss of the speech for the speech conversion and the other part is the reconstruction loss of the fundamental frequency for the speech conversion, i.e.,
loss function of speech conversion = loss of reconstruction of speech conversion + loss of reconstruction of fundamental frequency of speech conversion.
A third aspect of the invention discloses an electronic device. The electronic device comprises a memory and a processor, the memory stores a computer program, and the processor executes the computer program to realize the steps of the training method for unified speech synthesis and speech conversion in any one of the first aspect of the disclosure.
A fourth aspect of the invention discloses a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of a method for training unified speech synthesis and speech conversion according to any one of the first aspect of the present disclosure.
The scheme provided by the invention has the following beneficial effects:
1. the voice synthesis and the voice conversion model are unified, and the difficulty of independent construction is avoided.
2. The use of label-free speech improves the performance of speech synthesis and speech conversion.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of a training method for unified speech synthesis and speech conversion according to an embodiment of the present invention;
FIG. 2 is a block diagram of a training method according to an embodiment of the present invention;
FIG. 3 is a block diagram of a subtask according to an embodiment of the invention;
FIG. 4 is a block diagram of a unified speech synthesis and speech conversion training system according to an embodiment of the present invention;
fig. 5 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention discloses a training method for unified voice synthesis and voice conversion in a first aspect. Fig. 1 is a flowchart of a training method for unified speech synthesis and speech conversion according to an embodiment of the present invention, as shown in fig. 1, the method includes:
step S1, decoupling the encoding task of voice synthesis and voice conversion into three subtasks, namely extracting content information, extracting speaker information and extracting prosody information; the content information is language information irrelevant to a speaker; the speaker information includes: a characteristic of the speaker; the prosodic information indicates how the speaker speaks the content information and reflects the rhythm of the voice;
and step S2, inputting the extracted content information, speaker information and prosody information into a decoding task to obtain restored voice information.
In step S1, the encoding task of speech synthesis and speech conversion is decoupled into three subtasks, namely, extraction of content information, extraction of speaker information, and extraction of prosody information; the content information is language information irrelevant to a speaker; the speaker information includes: speaker characteristics such as timbre, volume, etc.; the prosodic information indicates how the speaker speaks the content information, reflecting the rhythm of the speech.
In some embodiments, modeling accurate speech content information is important in order to generate intelligible speech signals. Since TTS and VC have different types of input signals, the method of extracting content information is different. In the sub-task of extraction of the content information, for speech synthesis, the source of the speech content is a text, the text is encoded using a text encoder to obtain a context representation; upsampling the context representation according to duration information of phonemes to obtain content information of speech; for speech conversion, a content encoder is used directly to extract the content information of the speech from the source speech, since the source speech is aligned with the target speech.
In some embodiments, the extracted encoder of the speaker information is shared in the tasks of speech synthesis and speech conversion, extracting the speaker information directly from the speech without text; a large amount of data without text labels in the VC task can be used for training, which is helpful for improving the migration learning capability of the model.
In some embodiments, the fundamental frequency information is used to extract as prosodic information, since the fundamental frequency information can reflect the rhythm of the speech.
In some embodiments, the extracted encoder of prosodic information is shared among the tasks of speech synthesis and speech conversion, extracting fundamental frequency information directly from speech as the prosodic information; in the training stage, the content information of the voice and the information of the speaker are used as input to predict the fundamental frequency information.
In some embodiments, the overall loss function for speech synthesis and speech conversion includes three parts, a loss function for speech synthesis, a loss function for speech conversion, and an additional content information loss function, i.e.,
total loss function = loss function of speech synthesis + loss function of speech conversion + additional content information loss function;
the method for constructing the additional content information loss function comprises the following steps:
coding by using a vector quantization codebook to respectively extract content information of speech synthesis and content information of speech conversion;
then, the difference value between the content information of the voice synthesis and the content information of the voice conversion is applied to construct the additional content information loss function;
the loss function for speech synthesis consists of two parts, one part is the reconstruction loss of the speech for speech synthesis, and the other part is the reconstruction loss of the fundamental frequency for speech synthesis, i.e.,
loss function of speech synthesis = reconstruction loss of speech synthesis + reconstruction loss of fundamental frequency of speech synthesis;
the loss function for a speech conversion consists of two parts, one part is the reconstruction loss of the speech for the speech conversion and the other part is the reconstruction loss of the fundamental frequency for the speech conversion, i.e.,
loss function of speech conversion = reconstruction loss of speech conversion + reconstruction loss of fundamental frequency of speech conversion.
Detailed description of the preferred embodiment
A training process:
1. the input TTS data, i.e., text x and its corresponding speech y, is first trained.
Extracting fundamental frequency information F0 and speaker information S from y, and extracting prosody information P through F0;
and extracting the speaking content information C from the x.
The extraction process of these three kinds of information can be formulated as:
C = VQ(text_encoder(x))
S = speaker_encoder(y)
P = prosody_encoder(y)
VQ is expressed as vector quantization;
text _ encoder (.) is denoted as a text encoder;
a speaker _ encoder (.) is represented as a speaker information encoder;
prosody _ encoder (.) is denoted as a prosody information encoder;
and finally, adding the three information, and inputting the added information into a decoder module to restore y information.
y’ = decoder(C + S + P)
F0’ = pitch_predictor(C+S)
The loss of this process consists of two parts, one is the reconstruction loss of y and one is the reconstruction loss of the fundamental frequency.
Figure 393148DEST_PATH_IMAGE001
2. Then training the VC data, only using the voice data y. (y includes both unlabeled and labeled data)
Unlike TTS, the speech content information is then extracted directly from y.
C’ = VQ(content_encoder(y))
S’ = speaker_encoder(y)
P’ = prosody_encoder(y)
And finally, adding the three information, and inputting the added information into a decoder module to restore y information.
y’ = decoder(C’ + S’ + P’)
F0’ = pitch_predictor(C’+S’)
The loss of this process consists of two parts, one is the reconstruction loss of y and one is the reconstruction loss of the fundamental frequency.
Figure 850674DEST_PATH_IMAGE002
3. Content information loss for TTS and VC
For the labeled data y, different speaking content information C and C' are obtained in the TTS flow and the VC flow. In order to unify TTS and VC frameworks, two works are done, namely encoding using the same VQ codebook when extracting C and C', respectively. And on the other hand, extra content loss is adopted to carry out speaking content loss supervision on the marked speech. The losses are as follows:
Figure 665046DEST_PATH_IMAGE003
therefore, the total loss consists of three parts:
Figure 487509DEST_PATH_IMAGE004
testing phase
According to the difference of the voice generating task, specifically, the voice synthesizing task or the voice converting task is needed, and different reasoning paths are provided.
And (3) voice synthesis task: the inference may be performed by synthesizing the speech of the target speaker in the manner shown on the left side of fig. 2.
And a voice conversion task: the inference may be performed by synthesizing the speech of the target speaker in the manner shown on the right side of fig. 2.
The implementation case is as follows:
the proposed unified TTS and VC training framework is shown in fig. 2. Specifically, the structure of each sub-module can be as shown in fig. 3. The number of feedforward transformer (FFT) blocks in the text encoder is 2 and in the decoder block is 6. In each FFT block, the dimension of the hidden state is 256. The kernel size of all one-dimensional convolutions is set to 3. The rate of conjugate was set at 0.5. The last linear layer in the decoder has a dimension of 80, which is consistent with the Mel spectral dimension. The size of the last linear layer in the encoder (text encoder, prosody information encoder, content information encoder) is 256. The Adam optimizer is used to update the parameters. The initial learning rate was 0.001, and the learning rate decreased exponentially. In the inference phase, hifigan is used as a vocoder.
In addition, an extra training is needed for a duration model, which is very common in the speech synthesis task, and the example can be realized by using a 3-layer full-connection layer.
After the model is constructed as described above, the model is trained according to the method of the first step of the implementation. And then, according to the method of the second step of the specific implementation process, the speech synthesis and the speech conversion tasks can be realized by using the same frame.
In conclusion, the solution proposed by the present invention enables,
1. the voice synthesis and the voice conversion model are unified, and the difficulty of independent construction is avoided.
2. The use of label-free speech improves the performance of speech synthesis and speech conversion.
The second aspect of the invention discloses a training system for unified speech synthesis and speech conversion. FIG. 4 is a block diagram of a unified speech synthesis and speech conversion training system according to an embodiment of the present invention; as shown in fig. 4, the system 100 includes:
a first processing module 101 configured to decouple the encoding task of speech synthesis and speech conversion into three subtasks, namely, extracting content information, extracting speaker information, and extracting prosody information; the content information is language information irrelevant to a speaker; the speaker information includes: a characteristic of the speaker; the prosodic information indicates how the speaker speaks the content information and reflects the rhythm of the voice;
and the second processing module 102 is configured to input the extracted content information, speaker information and prosody information into a decoding task to obtain restored voice information.
According to the system of the second aspect of the present invention, the first processing module 101 is configured to, in the sub-task of extracting the content information, for speech synthesis, the source of the speech content is text, encode the text using a text encoder to obtain a context representation; the context representation is upsampled according to duration information of the phonemes to obtain content information of the speech.
According to the system of the second aspect of the present invention, the first processing module 101 is configured to, in the sub-task of extracting the content information, extract the content information of the speech from the source speech directly using the content encoder for the speech conversion, since the source speech is aligned with the target speech.
According to the system of the second aspect of the present invention, the first processing module 101 is configured such that the extracted encoder of the prosodic information is shared in the tasks of speech synthesis and speech conversion, and the fundamental frequency information is directly extracted from the speech as the prosodic information.
According to the system of the second aspect of the present invention, the first processing module 101 is configured to predict the fundamental frequency information by using the content information of the speech and the speaker information as input in the training stage.
The system according to the second aspect of the invention, the first processing module 101, is configured such that the total loss function of speech synthesis and speech conversion comprises three parts, a loss function of speech synthesis, a loss function of speech conversion and an additional content information loss function, i.e.,
total loss function = loss function of speech synthesis + loss function of speech conversion + additional content information loss function;
the constructing of the additional content information loss function comprises:
coding by using a vector quantization codebook to respectively extract content information of speech synthesis and content information of speech conversion;
then, the difference value between the content information of the voice synthesis and the content information of the voice conversion is applied to construct the additional content information loss function;
the loss function for speech synthesis consists of two parts, one part is the reconstruction loss of the speech for speech synthesis, and the other part is the reconstruction loss of the fundamental frequency for speech synthesis, i.e.,
loss function of speech synthesis = reconstruction loss of speech synthesis + reconstruction loss of fundamental frequency of speech synthesis;
the loss function for a speech conversion consists of two parts, one part is the reconstruction loss of the speech for the speech conversion and the other part is the reconstruction loss of the fundamental frequency for the speech conversion, i.e.,
loss function of speech conversion = reconstruction loss of speech conversion + reconstruction loss of fundamental frequency of speech conversion.
A third aspect of the invention discloses an electronic device. The electronic device comprises a memory and a processor, the memory stores a computer program, and the processor executes the computer program to realize the steps of the training method for unified speech synthesis and speech conversion in any one of the first aspect of the disclosure.
Fig. 5 is a block diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 5, the electronic device includes a processor, a memory, a communication interface, a display screen, and an input device, which are connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the electronic device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, Near Field Communication (NFC) or other technologies. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the electronic equipment, an external keyboard, a touch pad or a mouse and the like.
It will be understood by those skilled in the art that the structure shown in fig. 5 is only a partial block diagram related to the technical solution of the present disclosure, and does not constitute a limitation of the electronic device to which the solution of the present application is applied, and a specific electronic device may include more or less components than those shown in the drawings, or combine some components, or have a different arrangement of components.
A fourth aspect of the invention discloses a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of a method for training unified speech synthesis and speech conversion according to any of the first aspect of the present disclosure.
It should be noted that the technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, however, as long as there is no contradiction between the combinations of the technical features, the scope of the present description should be considered. The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method for unified speech synthesis and speech conversion training, the method comprising:
step S1, decoupling the encoding task of voice synthesis and voice conversion into three subtasks, namely extracting content information, extracting speaker information and extracting prosody information; the content information is language information irrelevant to a speaker; the speaker information includes: a characteristic of the speaker; the prosodic information indicates how the speaker speaks the content information and reflects the rhythm of the voice;
and step S2, inputting the extracted content information, speaker information and prosody information into a decoding task to obtain restored voice information.
2. The method of claim 1, wherein in the sub-task of extracting the content information, the source of the speech content is text for speech synthesis, and a text encoder is used to encode the text to obtain the context representation; the context representation is upsampled according to duration information of the phonemes to obtain content information of the speech.
3. The method of claim 2, wherein in the sub-task of extracting content information, the content information is extracted from the source speech directly using a content encoder for speech conversion because the source speech is aligned with the target speech.
4. The method of claim 1, wherein the speaker information extracting encoder is shared in speech synthesis and speech conversion tasks, and the speaker information is extracted directly from the speech without text.
5. The method of claim 1, wherein the extracted prosodic information encoder is shared between speech synthesis and speech conversion tasks, and the fundamental frequency information is extracted directly from speech as the prosodic information.
6. The method as claimed in claim 5, wherein the fundamental frequency information is predicted by using the content information of the speech and the speaker information as input in the training stage.
7. The method of claim 1, wherein the total loss function of the speech synthesis and the speech conversion comprises three parts, namely a loss function of the speech synthesis, a loss function of the speech conversion and an additional loss function of the content information,
total loss function = loss function of speech synthesis + loss function of speech conversion + additional content information loss function;
the method for constructing the additional content information loss function comprises the following steps:
coding by using a vector quantization codebook to respectively extract content information of speech synthesis and content information of speech conversion;
then, the difference value between the content information of the voice synthesis and the content information of the voice conversion is applied to construct the additional content information loss function;
the loss function for speech synthesis consists of two parts, one part is the reconstruction loss of the speech for speech synthesis, and the other part is the reconstruction loss of the fundamental frequency for speech synthesis, i.e.,
loss function of speech synthesis = reconstruction loss of speech synthesis + reconstruction loss of fundamental frequency of speech synthesis;
the loss function for a speech conversion consists of two parts, one part is the reconstruction loss of the speech for the speech conversion, and the other part is the reconstruction loss of the fundamental frequency for the speech conversion, i.e.,
loss function of speech conversion = reconstruction loss of speech conversion + reconstruction loss of fundamental frequency of speech conversion.
8. A training system for unified speech synthesis and speech conversion, the system comprising:
the first processing module is configured to decouple the encoding tasks of voice synthesis and voice conversion into three subtasks, namely extraction of content information, extraction of speaker information and extraction of prosody information; the content information is language information irrelevant to a speaker; the speaker information includes: a characteristic of the speaker; the prosodic information indicates how the speaker speaks the content information and reflects the rhythm of the voice;
and the second processing module is configured to input the extracted content information, speaker information and prosody information into a decoding task to obtain restored voice information.
9. An electronic device, comprising a memory storing a computer program and a processor, wherein the processor, when executing the computer program, implements the steps of a training method for unified speech synthesis and speech conversion according to any of claims 1 to 7.
10. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of a method for training unified speech synthesis and speech conversion according to any one of claims 1 to 7.
CN202210395964.5A 2022-04-15 2022-04-15 Unified speech synthesis and speech conversion training method and system Active CN114495898B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210395964.5A CN114495898B (en) 2022-04-15 2022-04-15 Unified speech synthesis and speech conversion training method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210395964.5A CN114495898B (en) 2022-04-15 2022-04-15 Unified speech synthesis and speech conversion training method and system

Publications (2)

Publication Number Publication Date
CN114495898A true CN114495898A (en) 2022-05-13
CN114495898B CN114495898B (en) 2022-07-01

Family

ID=81489542

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210395964.5A Active CN114495898B (en) 2022-04-15 2022-04-15 Unified speech synthesis and speech conversion training method and system

Country Status (1)

Country Link
CN (1) CN114495898B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101064103A (en) * 2006-04-24 2007-10-31 中国科学院自动化研究所 Chinese voice synthetic method and system based on syllable rhythm restricting relationship
CN101471071A (en) * 2007-12-26 2009-07-01 中国科学院自动化研究所 Speech synthesis system based on mixed hidden Markov model
CN101751922A (en) * 2009-07-22 2010-06-23 中国科学院自动化研究所 Text-independent speech conversion system based on HMM model state mapping
CN103247293A (en) * 2013-05-14 2013-08-14 中国科学院自动化研究所 Coding method and decoding method for voice data
CN104112444A (en) * 2014-07-28 2014-10-22 中国科学院自动化研究所 Text message based waveform concatenation speech synthesis method
CN112365878A (en) * 2020-10-30 2021-02-12 广州华多网络科技有限公司 Speech synthesis method, device, equipment and computer readable storage medium
CN112750446A (en) * 2020-12-30 2021-05-04 标贝(北京)科技有限公司 Voice conversion method, device and system and storage medium
CN112786018A (en) * 2020-12-31 2021-05-11 科大讯飞股份有限公司 Speech conversion and related model training method, electronic equipment and storage device
CN112820268A (en) * 2020-12-29 2021-05-18 深圳市优必选科技股份有限公司 Personalized voice conversion training method and device, computer equipment and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101064103A (en) * 2006-04-24 2007-10-31 中国科学院自动化研究所 Chinese voice synthetic method and system based on syllable rhythm restricting relationship
CN101471071A (en) * 2007-12-26 2009-07-01 中国科学院自动化研究所 Speech synthesis system based on mixed hidden Markov model
CN101751922A (en) * 2009-07-22 2010-06-23 中国科学院自动化研究所 Text-independent speech conversion system based on HMM model state mapping
CN103247293A (en) * 2013-05-14 2013-08-14 中国科学院自动化研究所 Coding method and decoding method for voice data
CN104112444A (en) * 2014-07-28 2014-10-22 中国科学院自动化研究所 Text message based waveform concatenation speech synthesis method
CN112365878A (en) * 2020-10-30 2021-02-12 广州华多网络科技有限公司 Speech synthesis method, device, equipment and computer readable storage medium
CN112820268A (en) * 2020-12-29 2021-05-18 深圳市优必选科技股份有限公司 Personalized voice conversion training method and device, computer equipment and storage medium
CN112750446A (en) * 2020-12-30 2021-05-04 标贝(北京)科技有限公司 Voice conversion method, device and system and storage medium
CN112786018A (en) * 2020-12-31 2021-05-11 科大讯飞股份有限公司 Speech conversion and related model training method, electronic equipment and storage device

Also Published As

Publication number Publication date
CN114495898B (en) 2022-07-01

Similar Documents

Publication Publication Date Title
JP7009564B2 (en) End-to-end text-to-speech conversion
CN109036371B (en) Audio data generation method and system for speech synthesis
CN112687259B (en) Speech synthesis method, device and readable storage medium
CN111968618B (en) Speech synthesis method and device
CN111192568B (en) Speech synthesis method and speech synthesis device
CN113450765A (en) Speech synthesis method, apparatus, device and storage medium
US20230122659A1 (en) Artificial intelligence-based audio signal generation method and apparatus, device, and storage medium
WO2022252904A1 (en) Artificial intelligence-based audio processing method and apparatus, device, storage medium, and computer program product
JP2022133408A (en) Speech conversion method and system, electronic apparatus, readable storage medium, and computer program
Zen et al. Recent development of the HMM-based speech synthesis system (HTS)
CN114678032B (en) Training method, voice conversion method and device and electronic equipment
CN112634919A (en) Voice conversion method and device, computer equipment and storage medium
CN113450758B (en) Speech synthesis method, apparatus, device and medium
CN114242093A (en) Voice tone conversion method and device, computer equipment and storage medium
CN113160794B (en) Voice synthesis method and device based on timbre clone and related equipment
CN114495898B (en) Unified speech synthesis and speech conversion training method and system
CN115206284B (en) Model training method, device, server and medium
CN113345454B (en) Training and application methods, devices, equipment and storage medium of voice conversion model
CN115512682A (en) Polyphone pronunciation prediction method and device, electronic equipment and storage medium
JP2020013008A (en) Voice processing device, voice processing program, and voice processing method
CN114495896A (en) Voice playing method and computer equipment
CN115206281A (en) Speech synthesis model training method and device, electronic equipment and medium
CN116343749A (en) Speech synthesis method, device, computer equipment and storage medium
US20240038213A1 (en) Generating method, generating device, and generating program
CN117219052A (en) Prosody prediction method, apparatus, device, storage medium, and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant