CN116364058A - Voice synthesis method based on variation self-encoder - Google Patents

Voice synthesis method based on variation self-encoder Download PDF

Info

Publication number
CN116364058A
CN116364058A CN202310195823.3A CN202310195823A CN116364058A CN 116364058 A CN116364058 A CN 116364058A CN 202310195823 A CN202310195823 A CN 202310195823A CN 116364058 A CN116364058 A CN 116364058A
Authority
CN
China
Prior art keywords
training
voice
target text
hidden variable
encoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310195823.3A
Other languages
Chinese (zh)
Inventor
谭可华
吕慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianyun Rongchuang Data Science & Technology Beijing Co ltd
Original Assignee
Tianyun Rongchuang Data Science & Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianyun Rongchuang Data Science & Technology Beijing Co ltd filed Critical Tianyun Rongchuang Data Science & Technology Beijing Co ltd
Priority to CN202310195823.3A priority Critical patent/CN116364058A/en
Publication of CN116364058A publication Critical patent/CN116364058A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a voice synthesis method based on a variation self-encoder, which relates to the technical field of natural language processing, and comprises the following steps: acquiring character identifiers corresponding to characters in a target text, phonemes included in the target text and voice duration corresponding to the target text; inputting the character identifier, the phonemes and the voice duration into a priori coding module of a voice synthesis model, and obtaining a priori hidden variable corresponding to the target text, wherein the voice synthesis model is a model obtained by training a variation self-encoder; mapping the prior hidden variable according to a preset mapping relation to obtain a prior hidden variable mapping result, inputting the prior hidden variable mapping result into a decoding module of the speech synthesis model, and obtaining voiceprint data corresponding to the target text; resampling the voiceprint data to obtain voiceprint features corresponding to the voiceprint data, and generating voice data corresponding to the target text according to the voiceprint features.

Description

Voice synthesis method based on variation self-encoder
Technical Field
The application relates to the technical field of natural language processing, in particular to a voice synthesis method based on a variation self-encoder.
Background
Along with the development of artificial intelligence technology, the development of speech synthesis technology is also faster and faster, and in the prior art, the main mode of speech synthesis is as follows: firstly converting a text into a linear frequency spectrum corresponding to the text, then converting the linear frequency spectrum into voiceprint data, and finally resampling the voiceprint to generate voice.
However, in the above synthesis method, since the voiceprint data of the speech generated by resampling is obtained based on the original linear spectrum, the tone of the generated speech is consistent with the tone of the input speech of the text, which may cause a problem of sound copyright.
Disclosure of Invention
In order to solve the problem that the tone color of synthesized raw voice is consistent with that of input voice of text when synthesizing voice based on the existing method, the application provides a voice synthesis method, device, electronic equipment and storage medium based on a variation self-encoder.
In a first aspect, the present application provides a method of speech synthesis based on a variant self-encoder, comprising:
acquiring character identifiers corresponding to characters in a target text, phonemes included in the target text and voice duration corresponding to the target text;
inputting the character identifier, the phonemes and the voice duration into a priori coding module of a voice synthesis model, and obtaining a priori hidden variable corresponding to the target text, wherein the voice synthesis model is a model obtained by training a variation self-encoder;
mapping the prior hidden variable according to a preset mapping relation to obtain a prior hidden variable mapping result, inputting the prior hidden variable mapping result into a decoding module of the speech synthesis model, and obtaining voiceprint data corresponding to the target text;
resampling the voiceprint data to obtain voiceprint features corresponding to the voiceprint data, and generating voice data corresponding to the target text according to the voiceprint features.
As an optional implementation manner of the embodiment of the present application, before inputting the character identifier, the phoneme, and the speech duration into the prior encoding module of the speech synthesis model, and obtaining the prior hidden variable mapping result, the method further includes:
acquiring a training data set, wherein the training data set comprises training voice data, voice duration corresponding to the training voice data, training text corresponding to the training voice data and phonemes corresponding to the training voice data;
and training the variation self-encoder based on the training data set to obtain the voice synthesis model.
As an optional implementation manner of the embodiment of the present application, training the variation self-encoder based on the training data set to obtain the speech synthesis model includes:
acquiring a linear frequency spectrum corresponding to the training voice data, inputting the linear frequency spectrum into a posterior coding module of the variation self-coder, and acquiring a posterior hidden variable;
inputting the voice duration corresponding to the training voice data, the training text corresponding to the training voice data and the phonemes corresponding to the training voice data into a priori coding module of the variational self-encoder to obtain a priori hidden variable corresponding to the training text;
and adjusting parameters of the variation self-encoder based on the posterior hidden variable and the prior hidden variable corresponding to the training text until the KL divergence is smaller than a preset threshold value to obtain the speech synthesis model.
As an optional implementation manner of the embodiment of the present application, the adjusting, based on the posterior hidden variable and the prior hidden variable corresponding to the training text, the parameter of the variation self-encoder until the KL divergence is smaller than a preset threshold value, to obtain the speech synthesis model includes:
carrying out reversible transformation on the posterior hidden variable to obtain a reversible transformation result of the posterior hidden variable;
mapping the prior hidden variable corresponding to the training text to obtain a prior hidden variable mapping result;
calculating KL divergence of the posterior hidden variable reversible transformation result and the prior hidden variable mapping result;
and if the KL divergence is smaller than a preset threshold value, determining that the variation self-encoder is the speech synthesis model.
As an optional implementation manner of the embodiment of the present application, the method further includes:
inputting the posterior hidden variable into a decoding module of the variation self-encoder to obtain voiceprint features corresponding to the training voice data;
inputting voiceprint features corresponding to the training voice data into a generator for generating an countermeasure network to generate a voice signal;
inputting the voice signal into a discriminator for generating the countermeasure network, and optimizing a decoding module of the variation self-encoder based on a discrimination result output by the discriminator.
As an optional implementation manner of the embodiment of the present application, resampling the voiceprint data to obtain voiceprint features corresponding to the voiceprint data, and generating, according to the voiceprint features, voice data corresponding to the target text includes:
inputting the prior hidden variable corresponding to the target text and the voice duration corresponding to the target text into a duration prediction module of the voice synthesis model to obtain target voice duration;
and generating voice data corresponding to the target text according to the voiceprint characteristics and the target voice duration.
As an optional implementation manner of the embodiment of the present application, the obtaining the character identifier corresponding to each character in the target text includes:
acquiring a target text, and determining whether the target text is a mixed text of Chinese characters and English characters;
if yes, separating the English character from the Chinese character to form an English character sequence and a Chinese character sequence;
and converting all characters in the English character sequence and the Chinese character sequence into character identifications corresponding to the characters respectively based on a mapping dictionary.
In a second aspect, the present application provides a variant self-encoder based speech synthesis apparatus comprising:
the acquisition module is used for acquiring character identifiers corresponding to the characters in the target text, phonemes included in the target text and voice duration corresponding to the target text;
the input module is used for inputting the character identifier, the phonemes and the voice duration into a priori coding module of a voice synthesis model, and acquiring a priori hidden variable corresponding to the target text, wherein the voice synthesis model is a model obtained by training a variation self-encoder;
the processing module is used for mapping the prior hidden variable according to a preset mapping relation to obtain a prior hidden variable mapping result, inputting the prior hidden variable mapping result into the decoding module of the voice synthesis model and obtaining voiceprint data corresponding to the target text;
and the generation module is used for resampling the voiceprint data to obtain voiceprint characteristics corresponding to the voiceprint data, and generating voice data corresponding to the target text according to the voiceprint characteristics.
As an optional implementation manner of the embodiment of the present application, the apparatus further includes:
the training module is used for acquiring a training data set, wherein the training data set comprises training voice data, voice duration corresponding to the training voice data, training text corresponding to the training voice data and phonemes corresponding to the training voice data;
and training the variation self-encoder based on the training data set to obtain the voice synthesis model.
As an optional implementation manner of this embodiment of the present application, the training module is specifically configured to obtain a linear spectrum corresponding to the training speech data, and input the linear spectrum into a posterior coding module of the variance self-encoder to obtain a posterior hidden variable;
inputting the voice duration corresponding to the training voice data, the training text corresponding to the training voice data and the phonemes corresponding to the training voice data into a priori coding module of the variational self-encoder to obtain a priori hidden variable corresponding to the training text;
and adjusting parameters of the variation self-encoder based on the posterior hidden variable and the prior hidden variable corresponding to the training text until the KL divergence is smaller than a preset threshold value to obtain the speech synthesis model.
As an optional implementation manner of the embodiment of the present application, the training module is specifically configured to perform reversible transformation on the posterior hidden variable to obtain a reversible transformation result of the posterior hidden variable;
mapping the prior hidden variable corresponding to the training text to obtain a prior hidden variable mapping result;
calculating KL divergence of the posterior hidden variable reversible transformation result and the prior hidden variable mapping result;
and if the KL divergence is smaller than a preset threshold value, determining that the variation self-encoder is the speech synthesis model.
As an optional implementation manner of the embodiment of the present application, the apparatus further includes:
the optimization module is used for inputting the posterior hidden variable into a decoding module of the variation self-encoder to obtain voiceprint characteristics corresponding to the training voice data;
inputting voiceprint features corresponding to the training voice data into a generator for generating an countermeasure network to generate a voice signal;
inputting the voice signal into a discriminator for generating the countermeasure network, and optimizing a decoding module of the variation self-encoder based on a discrimination result output by the discriminator.
As an optional implementation manner of the embodiment of the present application, the generating module is specifically configured to input, into a duration prediction module of the speech synthesis model, a priori hidden variable corresponding to the target text and a speech duration corresponding to the target text, to obtain a target speech duration;
and generating voice data corresponding to the target text according to the voiceprint characteristics and the target voice duration.
As an optional implementation manner of the embodiment of the present application, the obtaining module is specifically configured to obtain a target text, and determine whether the target text is a mixed text of chinese characters and english characters;
if yes, separating the English character from the Chinese character to form an English character sequence and a Chinese character sequence;
and converting all characters in the English character sequence and the Chinese character sequence into character identifications corresponding to the characters respectively based on a mapping dictionary.
In a third aspect, an embodiment of the present application provides an electronic device, including: a memory for storing a computer program and a processor for performing the method of variant self-encoder based speech synthesis of the first aspect or any of the alternative embodiments of the first aspect when the computer program is invoked.
In a fourth aspect, embodiments of the present application provide a computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the method for synthesizing speech based on a variant self-encoder according to the first aspect or any alternative implementation of the first aspect.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
the embodiment of the application provides a voice synthesis method, a device, electronic equipment and a storage medium based on a variation self-encoder, wherein the method comprises the following steps: acquiring character identifiers corresponding to characters in a target text, phonemes included in the target text and voice duration corresponding to the target text; inputting the character identifier, the phonemes and the voice duration into a priori coding module of a voice synthesis model, and obtaining a priori hidden variable corresponding to the target text, wherein the voice synthesis model is a model obtained by training a variation self-encoder; mapping the prior hidden variable according to a preset mapping relation to obtain a prior hidden variable mapping result, inputting the prior hidden variable mapping result into a decoding module of the speech synthesis model, and obtaining voiceprint data corresponding to the target text; resampling the voiceprint data to obtain voiceprint features corresponding to the voiceprint data, and generating voice data corresponding to the target text according to the voiceprint features. In the embodiment of the application, the voiceprint feature for generating the voice data corresponding to the target text is obtained by processing the prior hidden variable mapping result through the decoding module of the voice synthesis model, and the prior hidden variable is obtained according to the character identifier, the phonemes and the voice duration corresponding to the target text, which are respectively corresponding to each character in the target text, and is not obtained based on the original input voice of the target text, so that the tone of the generated voice data is different from the tone of the original input voice of the target text, and the problem of sound copyright is avoided.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a flow chart of steps of a method for variable self-encoder based speech synthesis according to one embodiment of the present application;
FIG. 2 is a flow chart illustrating steps of a method for variable self-encoder based speech synthesis according to another embodiment of the present application;
FIG. 3 is a schematic diagram of a speech device based on a variable self-encoder according to one embodiment of the present application;
FIG. 4 is a schematic diagram of a speech device based on a variable self-encoder according to another embodiment of the present application;
fig. 5 is an internal structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
For purposes of clarity, embodiments and advantages of the present application, the following description will make clear and complete the exemplary embodiments of the present application, with reference to the accompanying drawings in the exemplary embodiments of the present application, it being apparent that the exemplary embodiments described are only some, but not all, of the examples of the present application.
Based on the exemplary embodiments described herein, all other embodiments that may be obtained by one of ordinary skill in the art without making any inventive effort are within the scope of the claims appended hereto. Furthermore, while the disclosure is presented in the context of an exemplary embodiment or embodiments, it should be appreciated that the various aspects of the disclosure may, separately, comprise a complete embodiment. It should be noted that the brief description of the terms in the present application is only for convenience in understanding the embodiments described below, and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.
The voice synthesis method based on the variation self-encoder provided by the embodiment of the application can be executed by the voice synthesis device based on the variation self-encoder or the electronic equipment provided by the embodiment of the application, and the electronic equipment can be terminal equipment or other types of electronic equipment, and the specific type of the electronic equipment is not limited.
The following describes the speech synthesis method based on the variable self-encoder according to several embodiments, and in order to make the above objects, features and advantages of the present application more obvious and understandable, the following describes the optional embodiments in detail with reference to the accompanying drawings.
Fig. 1 is a flowchart of a speech synthesis method based on a variation self-encoder according to an embodiment of the present application, and referring to fig. 1, the speech synthesis method based on a variation self-encoder according to the embodiment includes the following steps:
s110, acquiring character identifiers corresponding to the characters in the target text, phonemes included in the target text and voice duration corresponding to the target text.
The target text can be a text obtained based on the input voice or can be a text directly input, and when the target text is a text obtained based on the input voice, the voice duration corresponding to the target text is the duration of the input voice; when the target text is a text which is directly input, the audio data corresponding to the target text can be acquired first, and then the voice duration is determined.
In this embodiment, the target text may be a text in full chinese, a text in full english, or a mixed text of chinese and english.
For example, the obtaining of the character identifiers corresponding to the characters in the target text may be implemented in the following manner: acquiring a target text, and determining whether the target text is a mixed text of Chinese characters and English characters; if yes, separating the English character from the Chinese character to form an English character sequence and a Chinese character sequence; and converting all characters in the English character sequence and the Chinese character sequence into character identifications corresponding to the characters respectively based on a mapping dictionary.
After the target text is obtained, characters which are neither Chinese nor English in the target text are filtered, punctuation marks in the target text are preprocessed, the punctuation marks are normalized, namely only the target punctuation marks are reserved, and other punctuation marks are replaced by the target punctuation marks. For example, the target punctuation mark may be a punctuation mark that needs a speech pause in the speech corresponding to the target text, such as one or more of comma, period, and semicolon, and the punctuation mark that does not need a speech pause like "#" is converted into one of the target punctuation marks, and specifically, which of the target punctuation marks may be converted according to a specific correspondence, such as converting "#" into comma, which is not illustrated here.
Converting all characters in the English character sequence and the Chinese character sequence into character identifications corresponding to the characters respectively based on a mapping dictionary, wherein the method comprises the following steps: obtaining phoneme identifications corresponding to the characters respectively based on a mapping dictionary, wherein the mapping dictionary comprises correspondence relations between phonemes and the phoneme identifications; and forming a character identifier corresponding to each character based on the phoneme identifier corresponding to the character.
For example, each english character in the sequence of english characters is converted into a phonetic symbol corresponding to each english character, phonemes included in each phonetic symbol are converted into phoneme identifications corresponding to each phoneme based on a mapping dictionary, and one or more phoneme identifications corresponding to one english character form a corresponding character identification of the english character; each chinese character in the chinese character sequence is converted into a pinyin corresponding to each chinese character and a number corresponding to a tone, for example, the characters "middle" are converted into "zhong1", "and" he2"," code "are converted into" ma3"," yes "are converted into" shi4", etc., phonemes included in the pinyin are converted into phoneme identifications corresponding to each phoneme based on the mapping dictionary, and one or more phoneme identifications included in the pinyin of one chinese character and the number corresponding to the tone of the chinese character constitute the corresponding character identification of the chinese character.
S120, inputting the character identification, the phonemes and the voice duration into a priori coding module of a voice synthesis model to obtain a priori hidden variable corresponding to the target text.
The speech synthesis model is a model obtained by training a variation self-encoder, and the prior encoding module is an encoder based on a transducer.
The prior hidden variable can be understood as text feature data corresponding to the target text.
The prior hidden variable corresponding to the target text and the voice duration corresponding to the target text can be input into a duration prediction module of the voice synthesis model, the target voice duration can be obtained, the target voice duration is used for controlling the duration of generating the voice data corresponding to the target text, and the voice speed of outputting the voice data corresponding to the target text is indirectly controlled.
It should be noted that, in the case where the target text is a mixed text of chinese characters and english characters, the order of character identifiers input to the speech synthesis model is identical to the order of the characters in the target text. That is, before inputting the character identifiers into the speech synthesis model, the character identifiers corresponding to the english characters and the character identifiers corresponding to the chinese characters are first ordered according to the sequence of each character in the target text. For example, in the target text, the third character is "my" and the fifth character is "and", and in the character identification sequence of the input speech synthesis model, the third character is the character identification corresponding to "my" and the fifth character is the character identification corresponding to "and".
S130, mapping the prior hidden variable according to a preset mapping relation to obtain a prior hidden variable mapping result, inputting the prior hidden variable mapping result into a decoding module of the speech synthesis model, and obtaining voiceprint data corresponding to the target text.
The prior hidden variable is mapped according to a preset mapping relation, and the prior hidden variable can be mapped according to a hidden variable in a TTS (Text to Speech) -Flow standard Flow.
The decoding module of the speech synthesis model is a decoder optimized by utilizing a GAN model (Generative Adversarial Network) to generate an countermeasure network, so that the speech signal output by the optimized decoder is almost identical to the speech signal input into the posterior coding module.
S140, resampling the voiceprint data to obtain voiceprint features corresponding to the voiceprint data, and generating voice data corresponding to the target text according to the voiceprint features.
The prior hidden variable corresponding to the target text and the voice duration corresponding to the target text are input into a duration prediction module of the voice synthesis model to obtain target voice duration; and generating voice data corresponding to the target text according to the voiceprint characteristics and the target voice duration.
Illustratively, a mel spectrum is generated according to the voiceprint feature and the target voice duration, and voice data corresponding to the target text is generated according to the mel spectrum.
After generating the voice data corresponding to the target text, outputting the voice data in a voice mode.
The voice synthesis method based on the variation self-encoder provided by the embodiment of the application comprises the following steps: acquiring character identifiers corresponding to characters in a target text, phonemes included in the target text and voice duration corresponding to the target text; inputting the character identifier, the phonemes and the voice duration into a priori coding module of a voice synthesis model, and obtaining a priori hidden variable corresponding to the target text, wherein the voice synthesis model is a model obtained by training a variation self-encoder; mapping the prior hidden variable according to a preset mapping relation to obtain a prior hidden variable mapping result, inputting the prior hidden variable mapping result into a decoding module of the speech synthesis model, and obtaining voiceprint data corresponding to the target text; resampling the voiceprint data to obtain voiceprint features corresponding to the voiceprint data, and generating voice data corresponding to the target text according to the voiceprint features. In the embodiment of the application, the voiceprint feature for generating the voice data corresponding to the target text is obtained by processing the prior hidden variable mapping result through the decoding module of the voice synthesis model, and the prior hidden variable is obtained according to the character identifier, the phonemes and the voice duration corresponding to the target text, which are respectively corresponding to each character in the target text, and is not obtained based on the original input voice of the target text, so that the tone of the generated voice data is different from the tone of the original input voice of the target text, and the problem of sound copyright is avoided.
In step S110, before the character identifier, the phoneme, and the voice duration are input to the prior encoding module of the voice synthesis model and the prior hidden variable mapping result is obtained, the variable self-encoder needs to be trained to obtain the voice synthesis model, as shown in fig. 2, and the model training process may include the following steps S210 to S220, which are not repeated in the embodiment shown in fig. 2, and particularly, the description and the illustration in the embodiment shown in fig. 1 may be referred to.
S210, acquiring a training data set, wherein the training data set comprises training voice data, voice duration corresponding to the training voice data, training text corresponding to the training voice data and phonemes corresponding to the training voice data.
The training voice data is audio time length corresponding to the training text, the training data set comprises a plurality of pieces of training data, and one piece of training data can be a sentence, voice corresponding to the sentence, audio time length of the voice and phonemes included in the sentence; or a piece of training data may be a segment of speech, a voice corresponding to the segment of speech, an audio duration of the segment of speech, a phoneme included in the segment of speech, and so on.
The training text in the training dataset may include text of chinese characters and text of english characters.
S220, training the variation self-encoder based on the training data set to obtain the voice synthesis model.
Illustratively, the variational self-encoder may be trained by:
acquiring a linear frequency spectrum corresponding to the training voice data, inputting the linear frequency spectrum into a posterior coding module of the variation self-coder, and acquiring a posterior hidden variable; inputting the voice duration corresponding to the training voice data, the training text corresponding to the training voice data and the phonemes corresponding to the training voice data into a priori coding module of the variational self-encoder to obtain a priori hidden variable corresponding to the training text; and adjusting parameters of the variation self-encoder based on the posterior hidden variable and the prior hidden variable corresponding to the training text until the KL divergence is smaller than a preset threshold value, thereby obtaining the speech synthesis model.
The posterior coding module is an encoder formed based on a residual error network wavnet, and the posterior hidden variable is spectral characteristic data of a linear spectrum of training voice data.
The parameters of the variable self-encoder are adjusted based on the posterior hidden variable and the prior hidden variable corresponding to the training text, and the KL divergence is smaller than a preset threshold value, so that the speech synthesis model can be obtained by the following modes:
carrying out reversible transformation on the posterior hidden variable to obtain a reversible transformation result of the posterior hidden variable; mapping the prior hidden variable corresponding to the training text to obtain a prior hidden variable mapping result; calculating KL divergence of the posterior hidden variable reversible transformation result and the prior hidden variable mapping result; and if the KL divergence is smaller than a preset threshold value, determining that the variation self-encoder is the speech synthesis model.
In this embodiment, the posterior hidden variable obtained through the training voice data (audio of the training text) is the spectrum characteristic data of the linear spectrum of the training voice data, and the posterior hidden variable is subjected to reversible transformation to obtain a reversible transformation result of the posterior hidden variable; the priori hidden variable obtained through training the text is text characteristic data, the priori hidden variable is mapped to obtain a priori hidden variable mapping result, and the result of the reversible transformation of the posterior hidden variable and the KL divergence of the prior hidden variable mapping result are smaller than a preset threshold value, so that the result of the reversible transformation of the posterior hidden variable approaches the prior hidden variable mapping result infinitely but cannot be equal, voiceprint data output by a decoder of a speech synthesis model is inconsistent with voiceprint data input into a posterior coding module, and further tone color of speech data generated based on the speech synthesis model is ensured to be different from tone color of input speech originally input by a target text.
In the model training process, the decoding module of the variable self-encoder needs to be optimized, and in an exemplary manner, the decoding module can be optimized through a GAN model: inputting the posterior hidden variable into a decoding module of the variation self-encoder to obtain voiceprint features corresponding to the training voice data; inputting voiceprint features corresponding to the training voice data into a generator for generating an countermeasure network to generate a voice signal; inputting the voice signal into a discriminator for generating the countermeasure network, and optimizing a decoding module of the variation self-encoder based on a discrimination result output by the discriminator.
The decoding module is optimized to ensure that the voice signal output by the optimized decoder keeps the tone color of the voice signal input into the posterior coding module.
Based on the same inventive concept, as an implementation of the above method, the embodiment of the present application further provides a speech synthesis device based on a variable self-encoder, which is provided by the embodiment of the present application, and the device may perform the speech synthesis method based on a variable self-encoder corresponding to the embodiment of the foregoing method.
Fig. 3 is a schematic structural diagram of a speech synthesis apparatus based on a variable-component self-encoder according to an embodiment of the present application, and as shown in fig. 3, a speech synthesis apparatus 300 based on a variable-component self-encoder according to the embodiment includes:
an obtaining module 310, configured to obtain a character identifier corresponding to each character in a target text, a phoneme included in the target text, and a voice duration corresponding to the target text;
the input module 320 is configured to input the character identifier, the phoneme, and the speech duration into a priori coding module of a speech synthesis model, to obtain a priori hidden variable corresponding to the target text, where the speech synthesis model is a model obtained by training a variation self-encoder;
the processing module 330 is configured to map the prior hidden variable according to a preset mapping relationship to obtain a mapping result of the prior hidden variable, and input the mapping result of the prior hidden variable to a decoding module of the speech synthesis model to obtain voiceprint data corresponding to the target text;
and the generating module 340 is configured to resample the voiceprint data to obtain voiceprint features corresponding to the voiceprint data, and generate voice data corresponding to the target text according to the voiceprint features.
Referring to fig. 4, fig. 4 is a schematic structural diagram of another speech synthesis apparatus based on a variation self-encoder according to an embodiment of the present application, and further includes, on the basis of the apparatus shown in fig. 3:
a training module 410, configured to obtain a training data set, where the training data set includes training speech data, a speech duration corresponding to the training speech data, a training text corresponding to the training speech data, and a phoneme corresponding to the training speech data; and training the variation self-encoder based on the training data set to obtain the voice synthesis model.
As an optional implementation manner of this embodiment of the present application, the training module 410 is specifically configured to obtain a linear spectrum corresponding to the training speech data, and input the linear spectrum to a posterior coding module of the variance self-encoder to obtain a posterior hidden variable; inputting the voice duration corresponding to the training voice data, the training text corresponding to the training voice data and the phonemes corresponding to the training voice data into a priori coding module of the variational self-encoder to obtain a priori hidden variable corresponding to the training text; and adjusting parameters of the variation self-encoder based on the posterior hidden variable and the prior hidden variable corresponding to the training text until the KL divergence is smaller than a preset threshold value to obtain the speech synthesis model.
As an optional implementation manner of the embodiment of the present application, the training module 410 is specifically configured to perform reversible transformation on the posterior hidden variable to obtain a reversible transformation result of the posterior hidden variable; mapping the prior hidden variable corresponding to the training text to obtain a prior hidden variable mapping result; calculating KL divergence of the posterior hidden variable reversible transformation result and the prior hidden variable mapping result; and if the KL divergence is smaller than a preset threshold value, determining that the variation self-encoder is the speech synthesis model.
As an optional implementation manner of the embodiment of the present application, the apparatus further includes:
the optimizing module 420 is configured to input the posterior hidden variable into a decoding module of the variation self-encoder, and obtain voiceprint features corresponding to the training voice data; inputting voiceprint features corresponding to the training voice data into a generator for generating an countermeasure network to generate a voice signal; inputting the voice signal into a discriminator for generating the countermeasure network, and optimizing a decoding module of the variation self-encoder based on a discrimination result output by the discriminator.
As an optional implementation manner of the embodiment of the present application, the generating module 340 is specifically configured to input, into a duration prediction module of the speech synthesis model, a priori hidden variable corresponding to the target text and a speech duration corresponding to the target text, to obtain a target speech duration; and generating voice data corresponding to the target text according to the voiceprint characteristics and the target voice duration.
As an optional implementation manner of the embodiment of the present application, the obtaining module 310 is specifically configured to determine whether the target text is a mixed text of chinese characters and english characters; if yes, separating the English character from the Chinese character to form an English character sequence and a Chinese character sequence; and converting all characters in the English character sequence and the Chinese character sequence into character identifications corresponding to the characters respectively based on a mapping dictionary.
The voice synthesis device based on the variation self-encoder provided in this embodiment may perform the voice synthesis method based on the variation self-encoder provided in the above method embodiment, and its implementation principle and technical effects are similar, and will not be described here again. The respective modules in the above-described variant-self-encoder based speech synthesis apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In an embodiment, an electronic device is provided, comprising a memory storing a computer program and a processor implementing the steps of any of the variant-self-encoder based speech synthesis described in the method embodiments above when the computer program is executed.
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 5, the electronic device provided in this embodiment includes: a memory 51 and a processor 52, the memory 51 for storing a computer program; the processor 52 is configured to execute steps in the speech synthesis method based on the variation self-encoder provided in the above method embodiment when invoking the computer program, and its implementation principle and technical effects are similar, and will not be described herein. It will be appreciated by those skilled in the art that the structure shown in fig. 5 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the electronic device to which the present application is applied, and that a particular electronic device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the steps of any of the variant self-encoder based speech synthesis methods described in the method embodiments above.
Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program, which may be stored on a non-transitory computer readable storage medium and which, when executed, may comprise the steps of the above-described embodiments of the methods. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as static random access memory (Static Random Access Memory, SRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like.
It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing is merely a specific embodiment of the application to enable one skilled in the art to understand or practice the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown and described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method of speech synthesis based on a variational self-encoder, comprising:
acquiring character identifiers corresponding to characters in a target text, phonemes included in the target text and voice duration corresponding to the target text;
inputting the character identifier, the phonemes and the voice duration into a priori coding module of a voice synthesis model, and obtaining a priori hidden variable corresponding to the target text, wherein the voice synthesis model is a model obtained by training a variation self-encoder;
mapping the prior hidden variable according to a preset mapping relation to obtain a prior hidden variable mapping result, inputting the prior hidden variable mapping result into a decoding module of the speech synthesis model, and obtaining voiceprint data corresponding to the target text;
resampling the voiceprint data to obtain voiceprint features corresponding to the voiceprint data, and generating voice data corresponding to the target text according to the voiceprint features.
2. The method of claim 1, wherein prior to inputting the character identification, the phonemes, and the speech duration into a prior encoding module of a speech synthesis model to obtain a prior hidden variable mapping result, the method further comprises:
acquiring a training data set, wherein the training data set comprises training voice data, voice duration corresponding to the training voice data, training text corresponding to the training voice data and phonemes corresponding to the training voice data;
and training the variation self-encoder based on the training data set to obtain the voice synthesis model.
3. The method of claim 2, wherein training the variational self-encoder based on the training dataset to obtain the speech synthesis model comprises:
acquiring a linear frequency spectrum corresponding to the training voice data, inputting the linear frequency spectrum into a posterior coding module of the variation self-coder, and acquiring a posterior hidden variable;
inputting the voice duration corresponding to the training voice data, the training text corresponding to the training voice data and the phonemes corresponding to the training voice data into a priori coding module of the variational self-encoder to obtain a priori hidden variable corresponding to the training text;
and adjusting parameters of the variation self-encoder based on the posterior hidden variable and the prior hidden variable corresponding to the training text until the KL divergence is smaller than a preset threshold value to obtain the speech synthesis model.
4. The method according to claim 3, wherein said adjusting parameters of the variational self-encoder based on the a priori hidden variables corresponding to the a posteriori hidden variables and the training text until KL divergence is less than a preset threshold value, to obtain the speech synthesis model, comprises:
carrying out reversible transformation on the posterior hidden variable to obtain a reversible transformation result of the posterior hidden variable;
mapping the prior hidden variable corresponding to the training text to obtain a prior hidden variable mapping result;
calculating KL divergence of the posterior hidden variable reversible transformation result and the prior hidden variable mapping result;
and if the KL divergence is smaller than a preset threshold value, determining that the variation self-encoder is the speech synthesis model.
5. A method according to claim 3, characterized in that the method further comprises:
inputting the posterior hidden variable into a decoding module of the variation self-encoder to obtain voiceprint features corresponding to the training voice data;
inputting voiceprint features corresponding to the training voice data into a generator for generating an countermeasure network to generate a voice signal;
inputting the voice signal into a discriminator for generating the countermeasure network, and optimizing a decoding module of the variation self-encoder based on a discrimination result output by the discriminator.
6. The method according to claim 1, wherein resampling the voiceprint data to obtain voiceprint features corresponding to the voiceprint data, generating speech data corresponding to the target text according to the voiceprint features, comprises:
inputting the prior hidden variable corresponding to the target text and the voice duration corresponding to the target text into a duration prediction module of the voice synthesis model to obtain target voice duration;
and generating voice data corresponding to the target text according to the voiceprint characteristics and the target voice duration.
7. The method according to any one of claims 1-6, wherein the obtaining the character identifier corresponding to each character in the target text includes:
acquiring a target text;
preprocessing punctuation coincidence in the target text to obtain a normalized target text;
and acquiring each character included in the normalized target text, and converting each character into a character identifier corresponding to each character respectively based on a mapping dictionary.
8. A speech synthesis apparatus based on a variational self-encoder, comprising:
the acquisition module is used for acquiring character identifiers corresponding to the characters in the target text, phonemes included in the target text and voice duration corresponding to the target text;
the input module is used for inputting the character identifier, the phonemes and the voice duration into a priori coding module of a voice synthesis model, and acquiring a priori hidden variable corresponding to the target text, wherein the voice synthesis model is a model obtained by training a variation self-encoder;
the processing module is used for mapping the prior hidden variable according to a preset mapping relation to obtain a prior hidden variable mapping result, inputting the prior hidden variable mapping result into the decoding module of the voice synthesis model and obtaining voiceprint data corresponding to the target text;
and the generation module is used for resampling the voiceprint data to obtain voiceprint characteristics corresponding to the voiceprint data, and generating voice data corresponding to the target text according to the voiceprint characteristics.
9. An electronic device, comprising: a memory storing a computer program, and a processor, wherein the processor implements the variant self-encoder based speech synthesis method of any of claims 1 to 7 when executing the computer program.
10. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the method of variant self-encoder based speech synthesis of any of claims 1 to 7.
CN202310195823.3A 2023-02-24 2023-02-24 Voice synthesis method based on variation self-encoder Pending CN116364058A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310195823.3A CN116364058A (en) 2023-02-24 2023-02-24 Voice synthesis method based on variation self-encoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310195823.3A CN116364058A (en) 2023-02-24 2023-02-24 Voice synthesis method based on variation self-encoder

Publications (1)

Publication Number Publication Date
CN116364058A true CN116364058A (en) 2023-06-30

Family

ID=86917947

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310195823.3A Pending CN116364058A (en) 2023-02-24 2023-02-24 Voice synthesis method based on variation self-encoder

Country Status (1)

Country Link
CN (1) CN116364058A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117995165A (en) * 2024-04-03 2024-05-07 中国科学院自动化研究所 Speech synthesis method, device and equipment based on hidden variable space watermark addition

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117995165A (en) * 2024-04-03 2024-05-07 中国科学院自动化研究所 Speech synthesis method, device and equipment based on hidden variable space watermark addition
CN117995165B (en) * 2024-04-03 2024-05-31 中国科学院自动化研究所 Speech synthesis method, device and equipment based on hidden variable space watermark addition

Similar Documents

Publication Publication Date Title
US7136816B1 (en) System and method for predicting prosodic parameters
KR20210146368A (en) End-to-end automatic speech recognition for digit sequences
WO2006106182A1 (en) Improving memory usage in text-to-speech system
CN111243571B (en) Text processing method, device and equipment and computer readable storage medium
CN111128116B (en) Voice processing method and device, computing equipment and storage medium
US11645474B2 (en) Computer-implemented method for text conversion, computer device, and non-transitory computer readable storage medium
CN112735454A (en) Audio processing method and device, electronic equipment and readable storage medium
CN116364058A (en) Voice synthesis method based on variation self-encoder
CN114743539A (en) Speech synthesis method, apparatus, device and storage medium
KR20240122776A (en) Adaptation and Learning in Neural Speech Synthesis
CN116994553A (en) Training method of speech synthesis model, speech synthesis method, device and equipment
CN109285536B (en) Voice special effect synthesis method and device, electronic equipment and storage medium
CN114974218A (en) Voice conversion model training method and device and voice conversion method and device
CN114420083A (en) Audio generation method, training method of related model and related device
KR102072162B1 (en) Artificial intelligence speech synthesis method and apparatus in foreign language
CN111048065B (en) Text error correction data generation method and related device
CN113077783A (en) Method and device for amplifying Chinese speech corpus, electronic equipment and storage medium
CN112002302A (en) Speech synthesis method and device
US11915714B2 (en) Neural pitch-shifting and time-stretching
CN113327581B (en) Recognition model optimization method and system for improving speech recognition accuracy
CN112489646B (en) Speech recognition method and device thereof
CN115359780A (en) Speech synthesis method, apparatus, computer device and storage medium
CN114882868A (en) Speech synthesis, emotion migration, interaction method, storage medium, and program product
CN115206281A (en) Speech synthesis model training method and device, electronic equipment and medium
CN108288464B (en) Method for correcting wrong tone in synthetic sound

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination