WO2023279976A1 - 语音合成方法、装置、设备及存储介质 - Google Patents

语音合成方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2023279976A1
WO2023279976A1 PCT/CN2022/100747 CN2022100747W WO2023279976A1 WO 2023279976 A1 WO2023279976 A1 WO 2023279976A1 CN 2022100747 W CN2022100747 W CN 2022100747W WO 2023279976 A1 WO2023279976 A1 WO 2023279976A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
training
synthesized
features
acoustic
Prior art date
Application number
PCT/CN2022/100747
Other languages
English (en)
French (fr)
Inventor
方鹏
刘恺
陈伟
Original Assignee
北京搜狗科技发展有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京搜狗科技发展有限公司 filed Critical 北京搜狗科技发展有限公司
Publication of WO2023279976A1 publication Critical patent/WO2023279976A1/zh
Priority to US18/201,105 priority Critical patent/US20230298564A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • G10L2013/105Duration

Definitions

  • the present application relates to the technical field of speech processing, in particular to a speech synthesis method, device, equipment and storage medium.
  • Speech synthesis is a technology that generates corresponding audio based on text, and is widely used in application scenarios such as video dubbing.
  • speech synthesis can generally be implemented based on phonemes.
  • Phoneme-based speech synthesis needs to collect a large number of words and their corresponding phonemes as materials in advance to realize text-to-speech conversion; it also needs to collect a large number of words and their corresponding pause information as materials in advance to realize the prosody prediction of speech.
  • the embodiment of the present application provides a speech synthesis method, device, device and storage medium, which can reduce the difficulty of speech synthesis, and the solutions include:
  • the embodiment of the present application discloses a speech synthesis method, the method is executed by an electronic device, and the method includes:
  • a text speech corresponding to the text to be synthesized is generated.
  • the embodiment of the present application also discloses a speech synthesis synthesis device, the device includes:
  • Text obtaining module is used for obtaining the text to be synthesized
  • the first feature generation module is used to generate hidden layer features and prosodic features of the text to be synthesized, and predict the pronunciation duration of characters in the text to be synthesized;
  • the second feature generation module is used to generate acoustic features corresponding to the text to be synthesized based on the hidden layer features of the text to be synthesized, the prosodic features and the pronunciation duration;
  • a speech synthesis module configured to generate text speech corresponding to the text to be synthesized according to the acoustic features.
  • the embodiment of the present application also discloses a readable storage medium.
  • the instructions in the storage medium are executed by the processor of the electronic device, the electronic device executes the speech synthesis method as described in the above aspects.
  • the embodiment of the present application also discloses an electronic device, including:
  • the embodiment of the present application also discloses a computer program product, the computer program product includes computer instructions, and the computer instructions are stored in a computer-readable storage medium; the processor of the electronic device reads the The computer instruction, the processor executes the computer instruction, so that the electronic device executes the speech synthesis method as described in the above aspect.
  • the text to be synthesized is obtained, the hidden layer features and the prosodic features of the text to be synthesized are generated, so as to extract the feature information associated with the text characteristics and the feature information associated with the speech prosody based on the text to be synthesized, and Predict the duration of each character in the text to be synthesized for subsequent character-based speech synthesis; based on the hidden layer features, prosodic features of the text to be synthesized, and the duration of each character in the text to be synthesized, generate the synthesized speech required for synthesis Acoustic features corresponding to the text; use the acoustic features corresponding to the text to be synthesized to generate the text speech corresponding to the text to be synthesized, so that there is no need to preprocess a large amount of material, and by extracting hidden layer features and prosodic features in the text, and predicting speech based on characters Duration, to achieve character-level speech synthesis. Moreover, the synthesized voice quality is better, and
  • the speech synthesis method of the embodiment of the present application by extracting the hidden layer features and prosodic features in the text, and predicting the speech duration based on the characters, the speech synthesis at the character level is realized.
  • the speech synthesis scheme does not need to preprocess a lot of speed, which helps to reduce the difficulty of speech synthesis.
  • Fig. 1 is a flow chart of the steps of a speech synthesis method provided by an embodiment of the present application
  • Fig. 2 is a flow chart of the steps of another speech synthesis method provided by the embodiment of the present application.
  • Fig. 3 is a flow chart of the steps of another speech synthesis method according to the embodiment of the present application.
  • Fig. 4 is a schematic structural diagram of an acoustic model provided by an embodiment of the present application.
  • Fig. 5 is a schematic diagram of the training of the acoustic model provided by the embodiment of the present application.
  • FIG. 6 is a structural block diagram of an embodiment of a speech synthesis device provided in an embodiment of the present application.
  • Fig. 7 is a structural block diagram of an electronic device shown in an exemplary embodiment of the present application.
  • Fig. 8 is a schematic structural diagram of an electronic device shown in another exemplary embodiment of the present application.
  • the embodiment of the present application adopts a character-level speech synthesis method, which does not need to obtain phoneme input, but can directly predict the duration of each character in the text to be synthesized, and generate the prosodic features corresponding to the text to be synthesized. Then, based on the duration of each character in the text to be synthesized and the prosodic features of the text to be synthesized, the acoustic features corresponding to the text to be synthesized are generated, and finally the speech corresponding to the text to be synthesized is synthesized based on the acoustic features, which can make the process of speech synthesis easier , to reduce the difficulty of speech synthesis.
  • it is also relatively convenient to realize support for individual needs.
  • the speech synthesis method provided in the embodiment of the present application can be executed by an electronic device that has speech synthesis requirements, and the electronic device can be a smart phone, a tablet computer, a vehicle terminal, a smart TV, a wearable device, a portable personal computer, etc. Weak mobile terminals can also be non-mobile terminals with strong computing power such as personal computers and servers.
  • the speech synthesis method provided in the embodiment of the present application can be applied to a video dubbing scenario.
  • the video editing application can adopt this method to realize automatic video dubbing. For example, after the video editing application obtains the narration copy corresponding to the video, it performs speech synthesis on the narration copy to obtain the narration voice, and then synthesizes the video and narration voice based on the time axis to realize automatic dubbing of the video.
  • the speech synthesis method provided in the embodiment of the present application can be applied to barrier-free scenarios.
  • visually impaired assistive devices such as visually impaired glasses
  • the visually impaired assistive device collects the environment image through the camera, and recognizes the environment image, and obtains the environment description text describing the current environment, so that the environment description text is further converted into the environment description voice through the speech synthesis technology, and then Play through the speaker of the visually impaired assistive device to remind the visually impaired.
  • the speech synthesis method provided in the embodiment of the present application can also be used in the scenario of listening to books.
  • the server uses the method provided by the embodiment of the present application to perform speech synthesis on the audiobook text to obtain the audiobook audio, and publish the audiobook audio to the audiobook application so that users can choose to listen to it.
  • the server can synthesize audiobook audios in different languages, timbres, and styles based on audiobook texts for users to choose to play.
  • the speech synthesis method provided by the embodiment of the present application can be applied to other scenarios that need to convert text into speech, which is not limited by the embodiment of the present application.
  • FIG. 1 shows a flow chart of the steps of a speech synthesis method embodiment provided by the embodiment of the present application, which may include the following steps:
  • Step 101 obtaining the text to be synthesized
  • the user may submit the text to be synthesized, and the electronic device may thereby obtain the text to be synthesized that requires speech synthesis.
  • the text to be synthesized may be words, short sentences, long sentences, articles, etc. in one language, and this application does not limit it.
  • Step 102 generating hidden layer features and prosodic features of the text to be synthesized, and predicting the pronunciation duration of characters in the text to be synthesized;
  • the electronic device may extract feature information related to speech synthesis in the text to be synthesized, and generate hidden layer features of the text to be synthesized.
  • the hidden layer features can be related to text characteristics such as character part of speech, character context association, and character emotion of characters in the text to be synthesized, and can usually be expressed in the form of vectors.
  • the pronunciation and duration of the characters in the text can usually be determined based on the character part of speech, character context association, and character emotion of the text to be synthesized hidden in the hidden layer features , tone, intonation, and the overall sounding rhythm of the text, etc., to generate the sound waveform features corresponding to the text to be synthesized, and obtain the acoustic features.
  • the electronic device can further generate prosodic features associated with prosodic characteristics such as tone, intonation, stress, and rhythm, and predict the pronunciation duration of characters in the text to be synthesized, so that in the subsequent speech synthesis process, more natural, Synthesized speech with better expressiveness, and at the same time, the prediction of the voice duration of characters can be more accurate.
  • prosodic features associated with prosodic characteristics such as tone, intonation, stress, and rhythm
  • the electronic device predicts the pronunciation duration of each character in the text to be synthesized, or the electronic device predicts and obtains the pronunciation duration of some characters in the text to be synthesized, and the part of characters may include key characters.
  • the prediction of the pronunciation duration of each character is taken as an example for schematic illustration.
  • a character can be an abstract graphical symbol recognizable in linguistics, and the smallest distinguishing unit in text. For example, the letters “a, b, c" in English, the Chinese characters “you, me, him” in Chinese, and the hiragana characters " ⁇ , ⁇ , ⁇ ” in Japanese.
  • characters may have corresponding pronunciation durations according to factors such as part of speech, context, and emotion. If there is a situation where the character does not need to be pronounced, the pronunciation duration can also be 0.
  • the electronic device can predict the time required for its pronunciation in units of characters, so that the synthesized voice can have a more accurate pronunciation time, so that the final synthesized voice has a better effect.
  • the hidden layer features extracted based on the text to be synthesized can be associated with text characteristics such as character part of speech, character context association, and character emotion, when predicting the pronunciation duration of a character, character Pronunciation duration prediction, to predict the duration of characters based on factors such as part of speech, context, emotion, etc., to obtain a more accurate duration prediction effect.
  • Step 103 based on hidden layer features, prosody features and pronunciation duration, generate acoustic features corresponding to the text to be synthesized;
  • the present application after obtaining the pronunciation duration of the characters in the text to be synthesized and the hidden layer features and prosodic features of the text to be synthesized, based on the hidden text-related features and prosodic features of the text to be synthesized.
  • the implicit prosody-related features and the pronunciation duration of the characters in the text to be synthesized are used to generate the sound waveform features corresponding to the text to be synthesized, and the acoustic features are obtained.
  • prosody features and pronunciation duration of characters are further considered in the technology of hidden layer features, so the generated sound waveform features can have more accurate rhythm and pronunciation duration, so that the synthesized speech can have better Pronunciation naturalness and expressiveness.
  • the acoustic feature may be waveform feature information of sound, for example, loudness and frequency information that vary with time.
  • Acoustic features can be represented by spectrograms, for example, Melton spectrum, linear spectrum, etc.
  • Step 104 according to the acoustic features, generate text speech corresponding to the text to be synthesized.
  • the sound since the sound is a wave generated by the vibration of an object, the sound signal can be restored after obtaining the waveform characteristics of the sound.
  • the electronic device can use the acoustic features corresponding to the text to be synthesized to restore the sound signal, generate text speech corresponding to the text to be synthesized, and complete the speech synthesis of the text to be synthesized.
  • the duration of the text to be synthesized is predicted based on the characters, and the hidden layer features and prosodic features are generated at the same time, and finally the text to be synthesized is generated based on the hidden layer features, prosodic features of the text to be synthesized, and the pronunciation duration of the characters in the text to be synthesized
  • the corresponding acoustic features can complete the speech synthesis based on the character level, which can build a speech library without extracting a large number of words, phonemes, pauses and other information, making the process of speech synthesis easier and reducing the difficulty of speech synthesis, and because the process of generating acoustic features
  • the prosody features and the pronunciation duration of characters are further referred to, which can further improve the quality of speech synthesis.
  • personalized support for voice synthesis can also be completed relatively easily.
  • the speech synthesis method of the embodiment of the present application by extracting the hidden layer features and prosodic features in the text, and predicting the speech duration based on the characters, the speech synthesis at the character level is realized.
  • the speech synthesis quality Compared with the phoneme-level speech synthesis scheme, it does not need to pre-process a large amount of speed, which helps to reduce the difficulty of speech synthesis.
  • FIG. 2 it shows a flow chart of the steps of a speech synthesis method embodiment provided by the embodiment of the present application, which may include the following steps:
  • Step 201 obtaining the text to be synthesized
  • Step 202 using the acoustic model corresponding to the text to be synthesized to generate hidden layer features and prosodic features of the text to be synthesized, and predict the pronunciation duration of characters in the text to be synthesized;
  • the acoustic model may contain multiple sub-models, wherein one sub-model may be used to preset the pronunciation duration of characters in the text to be synthesized, and one sub-model may be used to generate prosodic features, A sub-model can be used to generate hidden features of the text to be synthesized.
  • the sub-model for predicting character duration can take the text to be synthesized as input, and output the pronunciation duration of each character in the text to be synthesized.
  • the sub-model for generating prosodic features takes the text to be synthesized as input and the prosodic features of the text to be synthesized as output.
  • the model for generating hidden features takes the text to be synthesized as input and outputs the hidden features of the text to be synthesized.
  • the acoustic model may have various types.
  • the acoustic model can be adapted to different languages, for example, an acoustic model applicable to Chinese, an acoustic model applicable to English, an acoustic model applicable to Japanese, and the like.
  • the acoustic model can also have a personalized voice style, for example, soprano, baritone, alto, bass, child voice, voice style of a specific cartoon character, voice style of a specific star, etc.
  • the acoustic model performs speech synthesis based on characters, there is no need to extract a large number of words, phonemes, pauses and other information to build a speech library, so the training process of the acoustic model can be relatively simple. Therefore, it is relatively easy to deploy corresponding acoustic models according to different needs of users, so as to meet the needs of multilingual and personalized speech.
  • the electronic device may also select an acoustic model suitable for the language and/or voice style to process the text to be synthesized according to at least one of the language and speech style corresponding to the text to be synthesized, Generate hidden layer features and prosodic features of the text to be synthesized, and predict the pronunciation duration of characters in the text to be synthesized.
  • an acoustic model suitable for the language and/or speech style can also be used for processing, so as to meet the individual needs of speech synthesis.
  • Step 203 using an acoustic model to generate acoustic features corresponding to the text to be synthesized based on the hidden layer features, prosodic features of the text to be synthesized, and the pronunciation duration of the characters in the text to be synthesized;
  • an acoustic model can be used, based on the hidden layer features, prosody features and pronunciation duration of the text to be synthesized, Generate acoustic features corresponding to the text to be synthesized.
  • the acoustic features may be waveform feature information of sound, for example, loudness and frequency information that vary with time. Acoustic features can be represented by spectrograms, for example, Melton spectrum, linear spectrum, etc.
  • the acoustic model may further include a sub-model for generating the corresponding acoustic features of the text to be synthesized.
  • the sub-model used to synthesize the corresponding acoustic features of the text to be synthesized can take the hidden layer features, prosodic features and pronunciation duration of characters of the text to be synthesized as input, and output the corresponding acoustic features of the text to be synthesized,
  • a sub-model for synthesizing the corresponding acoustic features of the text to be synthesized can be obtained.
  • Step 204 according to the acoustic features, generate text speech corresponding to the text to be synthesized.
  • the acoustic model is trained as follows:
  • the training text in that language and the training audio corresponding to the training text can be obtained.
  • the training language can be a language used in different regions, for example, Chinese, English, Japanese, Korean, French, etc.; it can also be a local dialect under a certain language branch, for example, Hakka, Cantonese, etc.
  • the electronic device may use the training text corresponding to the training language and the training audio corresponding to the training text to train the acoustic model to be trained to obtain the acoustic model of the training language after training.
  • the trained acoustic model can be applied to the speech synthesis of the training language.
  • the acoustic model may be in the form of an end-to-end model, the sub-models contained in the acoustic model are interrelated rather than independent, and the input of a sub-model may be the output of other sub-models.
  • the acoustic model can also be trained as a whole during the training process. After obtaining the acoustic characteristics of the final output of the acoustic model, based on the final output of the acoustic model, the adjustment of each sub-model in the acoustic model is completed, and the training is completed. Acoustic models for the training language of .
  • the training text in the training language and the training audio corresponding to the training text include training text from several people and the training audio corresponding to the training text, that is, the training text and the training audio come from different pronunciation objects, so that Improve the generalization of the trained acoustic model to different pronunciation objects in the same language;
  • using the training text and the training audio corresponding to the training text to train the acoustic model to be trained to obtain the trained acoustic model of the training language including the following steps:
  • the electronic device uses the training text of several pronunciation objects in the training language and the training audio corresponding to the training text to train the acoustic model to be trained to obtain the trained acoustic model of the training language, so that the acoustic model can be Learn the common pronunciation rules of the training language, reduce the error rate of the acoustic model, and improve the feature quality of the acoustic features output by the acoustic model.
  • the electronic device can further use the training text of the target speech style in the training language and the training audio corresponding to the training text to adjust the acoustic model Perform training to obtain an acoustic model of the target speech style.
  • the speech synthesized based on the acoustic model of the target speech style obviously has the target speech style and has a relatively high High voice accuracy, while the voice quality can also be improved to a certain extent.
  • the electronic device first trains the acoustic model based on the training text and training audio from different pronunciation objects, and then, on the basis of the acoustic model, further trains the corresponding speech style based on the training text and training audio of different speech styles.
  • Acoustic model under the premise of ensuring the quality of speech synthesis, makes the synthesized speech have a specific speech style.
  • FIG. 3 shows a flow chart of the steps of a speech synthesis method embodiment provided by the embodiment of the present application, which may include the following steps:
  • Step 301 obtaining the text to be synthesized
  • an acoustic model may be used to complete speech synthesis.
  • the acoustic model may include multiple sub-models.
  • the acoustic model may include an encoder (encoder), a duration model, a variational autoencoder (VariationalAutoEncoder, VAE) and a decoder (decoder).
  • the text to be synthesized needs to be processed by an encoder, a time-length model, a variational autoencoder, and a decoder, and finally the acoustic features corresponding to the text to be synthesized are obtained.
  • the acoustic model can be an end-to-end model, in which the encoder, time-length model, variational autoencoder, and decoder are interconnected rather than independent.
  • Encoders and variational autoencoders may not output independent results, but output intermediate vectors generated during model processing, and then input the intermediate vectors into the decoder to obtain the acoustic features of the text to be synthesized.
  • they can input the text to be synthesized into the acoustic model, that is, they can directly obtain the acoustic features output by the acoustic model.
  • the structure of the acoustic model can be further simplified, and the efficiency of the acoustic model to convert the text to be synthesized into acoustic features can be improved.
  • Step 302 extracting features of the text to be synthesized by the encoder to obtain hidden layer features of the text to be synthesized
  • the encoder can learn the latent information of the text to be synthesized, and output hidden layer features associated with text characteristics such as character part of speech, character context association, and character emotion, so that subsequent models can be further processed based on hidden layer features .
  • the hidden layer features output by the encoder can be expressed in a vector form. Since the hidden layer features of the text to be synthesized output by the encoder can be considered as the output of the intermediate processing of the model, it may not be interpretable.
  • Step 303 based on the hidden layer features, predict the pronunciation duration of the characters in the text to be synthesized through the duration model;
  • a decoder corresponding to the encoder can be used, according to the character part of speech, character context association, character emotion, and character duration implied by the hidden layer features and other text features to generate acoustic features corresponding to the text to be synthesized.
  • the electronic device can use a duration model to predict the pronunciation duration of the characters, so as to further improve the accuracy of the pronunciation duration of the characters in the synthesized speech and improve the naturalness of the pronunciation.
  • Spend After obtaining the hidden layer features, the hidden layer features can be input into the duration model, and the duration model can predict the information to be synthesized through the information associated with the text characteristics such as character part of speech, character context association, and character emotion implied by the hidden layer features.
  • the duration of the voice corresponding to the character in the text that is, the pronunciation duration of the character.
  • Step 304 based on the hidden layer features, extracting prosodic features of the text to be synthesized through a variational autoencoder
  • the electronic device in order to further improve the naturalness and expressiveness of the synthesized speech, can further output the prosodic feature through the hidden layer feature through the variational autoencoder, so that in the subsequent speech synthesis process, it can Improving the naturalness and expressiveness of synthetic speech based on prosodic features.
  • the hidden layer features can also be input into the variational autoencoder, and the variational autoencoder can learn the latent representation of the speaker state in the text to be synthesized , and output prosodic features associated with prosodic features such as tone, intonation, stress, and rhythm. Prosodic features can be expressed in vector form.
  • Step 305 adjusting the feature length of the hidden layer feature based on the pronunciation duration of the character
  • the length of the hidden layer feature may be related to the sounding duration of the characters in the speech.
  • the electronic device can adjust the length of the hidden layer features based on the utterance duration of characters in the text to be synthesized.
  • the characteristic length of the hidden layer feature is positively correlated with the pronunciation duration, that is, the longer the pronunciation duration, the longer the corresponding characteristic length of the hidden layer feature.
  • the hidden layer feature is "abc”
  • the hidden layer feature can be adjusted to "aaabbbccc” based on the pronunciation duration of characters in the text to be synthesized.
  • Step 306 input the adjusted hidden layer features and prosodic features into the decoder to obtain the corresponding acoustic features of the text to be synthesized;
  • the adjusted hidden layer features and the prosodic features of the text to be synthesized can be input into the decoder, and the decoder is based on the hidden character part of speech, character context association, and character emotion implied by the adjusted hidden layer features. , character duration and other text characteristics, as well as the tone, intonation, accent, rhythm and other rhythm characteristics implied by the prosodic features, to generate the corresponding acoustic features of the text to be synthesized.
  • the process of decoding by the decoder is the process of feature restoration.
  • the decoder further refers to the output prosodic features of the variational self-encoding to generate acoustic features on the basis of the adjusted hidden layer features, the prosodic features of the synthesized speech can be made more accurate, and the quality of the synthesized speech can be further improved.
  • Step 307 input the acoustic features into the vocoder, and obtain the text speech corresponding to the text to be synthesized outputted by the vocoder.
  • the electronic device can input the acoustic features corresponding to the text to be synthesized into a vocoder (vocoder), and the vocoder generates text speech based on the acoustic features, and completes Speech synthesis of the text to be synthesized.
  • a vocoder vocoder
  • the vocoder may be a trained model for converting acoustic features into speech.
  • the vocoder may be a recurrent neural network, based on a source-filter model, etc., which is not limited in this embodiment of the present application.
  • Fig. 4 is a schematic structural diagram of an acoustic model provided by an embodiment of the present application.
  • the acoustic model may include an encoder 401 , a duration model 402 , a variational autoencoder 403 and a decoder 404 .
  • the electronic device inputs the text to be synthesized into the encoder 401 and obtains the hidden layer features output by the encoder 401 . Thereafter, the hidden layer features may be input into the duration model 402, and the pronunciation duration of each character in the text to be synthesized outputted by the duration model 402 is obtained.
  • the hidden layer features output by the encoder 401 can be input into the variational autoencoder 403, and the prosodic features output by the variational autoencoder 403 can be obtained.
  • the hidden layer features can be adjusted by using the pronunciation duration of each character in the text to be synthesized, and the adjusted hidden layer features and the prosodic features of the text to be synthesized are input into the decoder 404, and the output of the decoder 404 is obtained.
  • Acoustic features corresponding to the text to be synthesized Thereafter, a pre-trained vocoder can be used to process the acoustic features to obtain the text speech corresponding to the text to be synthesized.
  • the acoustic model is trained as follows:
  • the electronic device may obtain the training text and the training audio corresponding to the training text, and the training audio may be audio from multiple different characters in the same language or from the same character.
  • the training text may be words, short sentences, long sentences, articles, etc. in one language, which is not limited in this embodiment of the present application.
  • the electronic device may extract target acoustic features in the training audio as a training target for the overall acoustic model.
  • the target acoustic feature may be waveform feature information of sound, for example, loudness and frequency information that vary with time.
  • Acoustic features can be represented by spectrograms, for example, Melton spectrum, linear spectrum, etc.
  • the electronic device may use an acoustic feature extraction algorithm to extract target acoustic features in the training audio from the training audio.
  • an acoustic feature extraction algorithm For example, MFCC (Mel Frequency Cepstrum Coefficient, Mel's spectral coefficient algorithm), FBank (Filter Banks, filter bank algorithm), LogFBank (Log Filter Banks, logarithmic filter bank algorithm), etc. can be used. This is not limited.
  • the electronic device can input the training text into the acoustic model to be trained, and the model can output a model acoustic feature after being processed by the encoder, duration model, variational autoencoder, and decoder in the acoustic model.
  • the training process of the acoustic model is the process of approaching the acoustic features of the model to the target acoustic features.
  • both the hidden layer features output by the encoder and the target acoustic features can be used as inputs.
  • the variational autoencoder can fit the target acoustic features and the hidden layer features output by the encoder into a value through its own two neural networks, and the variational value encoder can learn the value, and then in the application stage, After obtaining the hidden layer features output by the encoder, the prosodic features of the text to be synthesized can be output accordingly based on the hidden layer features and its own learned values.
  • the electronic device may determine whether the model acoustic feature is close to the target acoustic feature by calculating the feature similarity between the model acoustic feature and the target acoustic feature, and then determine whether the acoustic model has been trained.
  • the electronic device may calculate a vector distance between the model acoustic feature and the target acoustic feature, so as to determine the vector distance as feature similarity.
  • the electronic device can adjust the model parameters in the acoustic model to be trained based on the feature similarity between the model acoustic features and the target acoustic features (as a loss function), so that the model acoustic features output by the acoustic model can be Constantly approaching the target acoustic signature.
  • the electronic device may adjust model parameters of the acoustic model by using a gradient descent or backpropagation algorithm.
  • the preset condition can be that the feature similarity between the model acoustic feature and the target acoustic feature is higher than the preset threshold; the similarity between the model acoustic feature and the target acoustic feature basically does not change, etc., and the embodiment of the present application does not do this limit.
  • GAN Generative Adversarial Networks
  • the electronic device inputs the synthesized audio and the training audio into the discriminator to obtain a first discrimination result corresponding to the synthesized audio and a second discrimination result corresponding to the training audio, wherein the discriminator is used to discriminate whether the input audio is training audio or synthetic audio Audio, which is used to distinguish real audio and generate audio.
  • the electronic device in addition to the feature similarity as part of the loss, the electronic device also takes the discrimination loss of the discriminator as a part of the loss, so that based on the feature similarity, the first discrimination result and the second discrimination result, adjust the The model parameters and the discriminator in the acoustic model are used to complete the acoustic model training.
  • the model parameters in the acoustic model are updated with gradients based on the feature similarity loss, so as to improve the accuracy of the acoustic features of the acoustic model generation model.
  • the discriminator is adjusted based on the discriminative loss to improve its ability to distinguish between model acoustic features and target acoustic features.
  • the acoustic model and the discriminator can compete with each other to improve the accuracy of the model output, and finally an acoustic model with higher accuracy can be obtained.
  • the sub-models in the acoustic model are trained based on the final output of the acoustic model, so that each sub-model in the acoustic model can have the same training target,
  • the sub-models in the acoustic model can have a better fit, and a better speech synthesis effect can be obtained.
  • using the generation confrontation network to train the acoustic model can further improve the effect of the acoustic model and make the final generated synthetic speech more realistic.
  • the above-mentioned duration model is trained in the following manner:
  • the electronic device can further train the duration model on the basis of the overall training of the acoustic model, so as to improve the accuracy of the duration model in predicting the pronunciation duration of characters, so that the output of the acoustic model can be more accurate.
  • the standard duration of characters in the training audio extracted by the electronic device can be regarded as the correct pronunciation duration of the characters.
  • the standard duration of the characters in the training audio can be extracted using a model or manually, which is not limited in this embodiment of the present application.
  • the input of the duration model may be the output of the encoder, thus, the training text may be input into the encoder, and the hidden layer features output by the encoder may be obtained to train the duration model.
  • the electronic device can input the hidden layer features into the duration model to obtain the predicted duration output by the duration model, so that the standard duration of the characters in the training audio can be used as the supervision of the predicted duration, and the duration model can be trained to obtain the training completion duration model.
  • the accuracy rate of the duration model output can be further improved, so that the final synthesized speech can have better quality.
  • the electronic device may input hidden layer features into the duration model, and the duration model may output predicted durations of characters in the training text. Thereafter, the electronic device can determine the time difference between the predicted duration of the output of the duration model and the standard duration of the characters in the training audio, and adjust the model parameters in the duration model according to the duration difference until the output of the duration model If the preset conditions are met, the duration model training is completed.
  • the preset condition can be that the difference between the predicted duration output by the duration model and the standard duration is less than the preset threshold, or that the difference between the predicted duration output by the duration model and the standard duration basically does not change, etc.
  • the application embodiment does not limit this.
  • extracting the standard duration of characters in the training audio includes:
  • the duration model may be trained by using the segmentation model.
  • the segmentation model can be used to segment each character in the training text, and correspondingly mark the pronunciation starting point and pronunciation ending point of each character in the training audio, so that the corresponding pronunciation duration of each character in the training text can be known. It can be considered that the character duration output by the segmentation model is the correct character duration, so that the duration model can be trained based on the output of the segmentation model.
  • the electronic device can input the training audio and hidden layer features into the segmentation model to obtain the output of the segmentation model.
  • the segmentation model can predict characters corresponding to each frame of the training audio based on the hidden layer features. Thereafter, the earliest frame corresponding to the character can be used as the starting point of pronunciation of the character in the training audio, and the latest frame corresponding to the character can be used as the end point of pronunciation of the character in the training audio, so as to realize the labeling of each character in the training audio Pronunciation start and pronunciation end.
  • the target acoustic feature may record changes in frequency and loudness of the training audio over a continuous period of time. Therefore, the segmentation model can predict the character corresponding to each frame in the target acoustic feature on the basis of the target acoustic feature, and mark the start and end points of the character.
  • the electronic device may use the time difference between the pronunciation start point and the pronunciation end point of the character as the standard duration corresponding to the character, so as to obtain the standard duration of each character in the training audio.
  • the model parameters of the segmentation model may also be adjusted based on the feature similarity between the model acoustic features and the target acoustic features. Therefore, during the training process, the segmentation model can also continuously improve the accuracy of segmenting each character in the training text and determining the duration of each character. Therefore, during the training process, the duration model can also obtain a more accurate training target, which can improve the accuracy rate of the output of the duration model, and make the acoustic features finally output by the acoustic model have a higher accuracy rate.
  • the acoustic model can realize end-to-end learning, and each sub-model and segmentation model in the acoustic model can be trained based on the acoustic characteristics of the final output of the overall acoustic model to obtain an acoustic model with high accuracy.
  • the acoustic model can be trained without manual supervision or relatively The training of the model is completed with less manual supervision, so that the acoustic model can be easily adapted to the needs of multiple languages and different pronunciation objects.
  • Fig. 5 is a schematic diagram of training an acoustic model provided by an embodiment of the present application.
  • the training text can be input into the encoder 501, and the hidden layer features output by the encoder 501 can be obtained.
  • the hidden layer features can be input into the duration model 502, and the output of each character of the duration model 502 can be obtained.
  • duration Hidden layer features and target acoustic features may also be input into the segmentation model 505 to obtain a standard duration output by the segmentation model 505 .
  • the standard duration output by the segmentation model 505 can be used as the training target of the duration model 502
  • the hidden layer features can be used as the input of the duration model 502 to train the duration model 502 .
  • the hidden layer features and the target acoustic features extracted from the training audio can also be input into the variational autoencoder 503, and the prosodic features output by the variational autoencoder 503 can be obtained. Thereafter, the decoder 504 may output model acoustic features based on hidden layer features, duration of each character, and prosody features.
  • the discriminator 506 can be used to discriminate the synthetic audio corresponding to the model acoustic feature and the training audio, and determine the feature similarity between the model acoustic feature and the target acoustic feature, and at the same time adjust the acoustic model to be trained Each sub-model The parameters of the model and the parameters of the discriminator finally obtain the acoustic features that have been trained.
  • FIG. 6 it shows a structural block diagram of a speech synthesis device provided by an embodiment of the present application.
  • the device may include the following modules:
  • the first feature generation module 602 is used to generate hidden layer features and prosodic features of the text to be synthesized, and predict the pronunciation duration of characters in the text to be synthesized;
  • the second feature generating module 603 is configured to generate acoustic features corresponding to the text to be synthesized based on the hidden layer features, the prosodic features and the pronunciation duration;
  • the speech synthesis module 604 is configured to generate text speech corresponding to the text to be synthesized according to the acoustic features.
  • the first feature generating module 602 is configured to:
  • the acoustic model Using the acoustic model corresponding to the text to be synthesized to generate the hidden layer features and prosodic features, and predict the pronunciation duration, the acoustic model is determined based on at least one of the language and speech style corresponding to the text to be synthesized get
  • the acoustic model includes an encoder, a duration model, and a variational automatic encoder
  • the first feature generation module 602 is configured to:
  • the prosodic features of the text to be synthesized are extracted by the variational autoencoder
  • the acoustic model includes a decoder
  • the second feature generation block 603 is configured to:
  • the speech synthesis module 604 is configured to:
  • the acoustic features are input into a vocoder, and the text speech corresponding to the text to be synthesized outputted by the vocoder is obtained.
  • the acoustic model is obtained by training with the following modules:
  • a training module configured to obtain training text and training audio corresponding to the training text, the training text adopts a training language
  • the acoustic model to be trained is trained to obtain the trained acoustic model of the training language.
  • the training text and the training audio are from different pronunciation objects
  • the training module is used for:
  • the acoustic model to be trained is trained by using the training text of the target speech style in the training language and the training audio to obtain the trained acoustic model of the target speech style.
  • the training module is used for:
  • the training module is also used for:
  • the acoustic model to be trained includes an encoder and a duration model
  • the training module is also used for:
  • the hidden layer feature is used as an input of the duration model, and the standard duration of characters in the training audio is used as a training target to train the duration model.
  • the training module is also used for:
  • the standard duration of the characters in the training audio is determined.
  • the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.
  • an electronic device 700 may include one or more of the following components: a processing component 702 and a memory 704 .
  • the processing component 702 generally controls the overall operations of the electronic device 700, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations.
  • the processing element 702 may include one or more processors 720 to execute instructions to complete all or part of the steps of the above method. Additionally, processing component 702 may include one or more modules that facilitate interaction between processing component 702 and other components. For example, processing component 702 may include a multimedia module to facilitate interaction between multimedia component 708 and processing component 702 .
  • Memory 704 is configured to store various types of data to support operations at device 700 . Examples of such data include instructions for any application or method operating on the electronic device 700, contact data, phonebook data, messages, pictures, videos, and the like.
  • the memory 704 can be realized by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read-only memory
  • EPROM erasable Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Magnetic or Optical Disk Magnetic Disk
  • the electronic device 700 may further include a power component 706, a multimedia component 708, an audio component 710, an input/output (I/O) interface 712, a sensor component 714, and a communication component 716. This is not limited.
  • electronic device 700 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation for performing the methods described above.
  • ASICs application specific integrated circuits
  • DSPs digital signal processors
  • DSPDs digital signal processing devices
  • PLDs programmable logic devices
  • FPGA field programmable A programmable gate array
  • controller microcontroller, microprocessor or other electronic component implementation for performing the methods described above.
  • non-transitory computer-readable storage medium including instructions, such as the memory 704 including instructions, which can be executed by the processor 720 of the electronic device 700 to complete the above method.
  • the non-transitory computer readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.
  • a non-transitory computer-readable storage medium when the instructions in the storage medium are executed by the processor of the electronic device, the electronic device can execute the speech synthesis method provided by the above-mentioned embodiments
  • Fig. 8 is a schematic structural diagram of an electronic device 800 shown in another exemplary embodiment of the present application.
  • the electronic device 800 may be a server, which may have relatively large differences due to different configurations or performances, and may include one or more central processing units (central processing units, CPU) 822 (for example, one or more processors) And memory 832, one or more storage media 830 (such as one or more mass storage devices) for storing application programs 842 or data 844.
  • the memory 832 and the storage medium 830 may be temporary storage or persistent storage.
  • the program stored in the storage medium 830 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the server.
  • the central processing unit 822 may be configured to communicate with the storage medium 830, and execute a series of instruction operations in the storage medium 830 on the server.
  • the server may also include one or more power supplies 826, one or more wired or wireless network interfaces 850, one or more input and output interfaces 858, one or more keyboards 856, and/or, one or more operating systems 841, Such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and so on.
  • An electronic device comprising a memory and one or more programs, wherein one or more programs are stored in the memory and configured to be executed by one or more processors to realize The speech synthesis method provided by each of the foregoing embodiments.

Abstract

一种语音合成方法、装置、设备及存储介质,该方法包括:获取待合成文本(101);生成待合成文本的隐层特征以及韵律特征,并预测待合成文本中字符的发音时长(102);基于隐层特征、韵律特征以及发音时长,生成待合成文本对应的声学特征(103);根据声学特征,生成待合成文本对应的文本语音(104)。采用该方法有助于降低语音合成的难度。

Description

语音合成方法、装置、设备及存储介质
本申请要求于2021年07月07日提交,申请号为202110769530.2、发明名称为“一种语音合成方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请实施例中。
技术领域
本申请涉及语音处理技术领域,特别是涉及一种语音合成方法、装置、设备及存储介质。
背景技术
语音合成是一种基于文本生成对应音频的技术,被广泛应用于视频配音等应用场景。
相关技术中,通常可以基于音素实现语音合成。基于音素的语音合成需要提前采集大量的单词以及单词对应的音素作为素材,实现文本至语音的转换;还需要提前采集大量的单词与单词对应的停顿信息作为素材,实现语音的韵律预测。
但是,单词、音素、停顿信息等素材的预处理需要花费较大工作量,且基于大量素材的语音合成通常对电子设备的处理能力具有较高要求,导致语音合成的难度较高。
发明内容
本申请实施例提供了一种语音合成方法、装置、设备及存储介质,能够降低语音合成的难度,方案包括:
本申请实施例公开了一种语音合成方法,所述方法由电子设备执行,所述方法包括:
获取待合成文本;
生成所述待合成文本的隐层特征以及韵律特征,并预测所述待合成文本中字符的发音时长;
基于所述隐层特征、所述韵律特征以及所述发音时长,生成所述待合成文本对应的声学特征;
根据所述声学特征,生成所述待合成文本对应的文本语音。
本申请实施例还公开一种语音合成合成装置,所述装置包括:
文本获取模块,用于获取待合成文本;
第一特征生成模块,用于生成所述待合成文本的隐层特征以及韵律特征,并预测所述待合成文本中字符的发音时长;
第二特征生成模块,用于基于所述待合成文本的隐层特征、所述韵律特征以及所述发音时长,生成所述待合成文本对应的声学特征;
语音合成模块,用于根据所述声学特征,生成所述待合成文本对应的文本语音。
本申请实施例还公开了一种可读存储介质,当所述存储介质中的指令由电子设备的处理器执行时,使得所述电子设备执行如上述方面所述的语音合成方法。
本申请实施例还公开了一种电子设备,包括:
存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于所述存储器中,且经配置以由一个或者一个以上处理器执行如上述方面所述的语音合成方法
本申请实施例还公开了一种计算机程序产品,所述计算机程序产品包括计算机指令,所述计算机指令存储在计算机可读存储介质中;电子设备的处理器从所述计算机可读存储介质读取所述计算机指令,所述处理器执行所述计算机指令,使得所述电子设备执行如上述方面所述的语音合成方法。
通过本申请实施例的语音合成方法,获取待合成文本,生成待合成文本的隐层特征以及韵律特征,以基于待合成文本提取与文本特性关联的特征信息以及与语音韵律关联的特征信息,并预测待合成文本中每一字符的时长,以便后续基于字符进行语音合成;基于待合成文本的隐层特征、韵律特征、以及待合成文本中每一字符的时长,生成合成语音所需要的待合成文本对应的声学特征;采用待合成文本对应的声学特征,生成待合成文本对应的文本语音,从而实现无需预处理大量素材,而通过提取文本中的隐层特征以及韵律特征,并基于字符预 测语音时长,实现字符级别的语音合成。而且合成语音质量较好,同时可以降低语音合成的难度,以便用户可以根据实际需要应用于不同场景中,满足用户的个性化需求。
采用本申请实施例的语音合成方法,通过提取文本中的隐层特征以及韵律特征,并基于字符预测语音时长,实现字符级别的语音合成,在保证语音合成质量的情况下,相较于音素级别的语音合成方案无需预处理大量速度,有助于降低语音合成的难度。
附图说明
图1是本申请实施例提供的一种语音合成方法的步骤流程图;
图2是本申请实施例提供的另一种语音合成方法的步骤流程图;
图3是本申请实施例的另一种语音合成方法的步骤流程图;
图4是本申请实施例提供的声学模型的结构示意图;
图5是本申请实施例提供的声学模型的训练示意图;
图6是本申请实施例提供的语音合成装置实施例的结构框图;
图7是本申请一示例性实施例示出的电子设备的结构框图;
图8是本申请另一示例性实施例示出的电子设备的结构示意图。
具体实施方式
为使本申请的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请作进一步详细的说明。
本申请实施例为了降低语音合成的难度,采用字符级别的语音合成方式,不需要获取音素输入,而可以直接预测待合成文本中每一字符的时长,生成待合成文本对应的韵律特征。其后基于待合成文本中每一字符的时长以及待合成文本的韵律特征,生成待合成文本对应的声学特征,并最终基于声学特征合成待合成文本对应的语音,可以使语音合成的流程更加简单,降低语音合成的难度。面对多种不同的个性化需求,也可以较为便利地实现对个性化需求的支持。
本申请实施例提供的语音合成方法,可以由具有语音合成需求的电子设备执行,该电子设备可以是智能手机、平板电脑、车载终端、智能电视、可穿戴式设备、便携式个人计算机等算力较弱的移动终端,也可以是个人计算机、服务器等算力较强的非移动终端。在一种可能的应用场景下,本申请实施例提供的语音合成方法可以应用于视频配音场景。在该场景下,视频编辑应用可以采用该方法实现视频自动配音。比如,视频编辑应用获取到视频对应的旁白文案后,对旁白文案进行语音合成,得到旁白语音,从而基于时间轴对视频和旁白语音进行合成,实现视频自动配音。
在另一种可能的应用场景下,本申请实施例提供的语音合成方法可以应用于无障碍场景。在该场景下,视力障碍人士使用的视障辅助设备(比如视障眼镜)可以集成语音合成功能。工作过程中,视障辅助设备通过摄像头进行环境图像采集,并对环境图像进行识别,得到描述当前所处环境的环境描述文本,从而进一步通过语音合成技术将环境描述文本转换为环境描述语音,进而通过视障辅助设备的扬声器进行播放,以实现对视障人士的提醒。
在其他可能的应用场景下,本申请实施例提供的语音合成方法还可以用于听书场景。在该场景下,服务器采用本申请实施例提供的方法对听书文本进行语音合成,得到听书音频,并将听书音频发布至听书应用,以便用户选择收听。其中,服务器在进行语音合成时,可以基于听书文本合成不同语种、不同音色、不同风格的听书音频,供用户选择播放。
当然,除了上述几种可能的应用场景外,本申请实施例提供的语音合成方法可以应用于其他需要将文本转换为语音的场景,本申请实施例并不对此构成限定。
参照图1,示出了本申请实施例提供的一种语音合成方法实施例的步骤流程图,可以包括如下步骤:
步骤101,获取待合成文本;
在本申请实施例中,在需要进行语音合成的情况下,用户可以提交待合成文本,电子设备从而可以获取需要进行语音合成的待合成文本。
其中,待合成文本可以为一种语言的单词、短句、长句、文章等,本申请对此不做限制。
步骤102,生成待合成文本的隐层特征以及韵律特征,并预测待合成文本中字符的发音时长;
在本申请实施例中,在获取待合成文本之后,电子设备可以提取待合成文本中的与语音合成相关的特征信息,生成待合成文本的隐层特征。其中,隐层特征可以与待合成文本中字符的字符词性、字符上下文关联、字符情感等文本特性存在关联,通常可以采用向量形式表达。
一般来说,在得到待合成文本的隐层特征之后,通常可以基于隐层特征中隐含的待合成文本的字符词性、字符上下文关联、字符情感等特征,确定文本中的字符的发音、时长、声调、语调、以及文本整体的发声节奏等,生成待合成文本对应的声音波形特征,得到声学特征。但是,仅采用隐层特征生成声学特征,通常无法获得效果较好的合成语音,合成语音可以仍然存在发声自然度不足、表现力不足的情况。
由此,电子设备可以进一步生成与声调、语调、重音、节奏等韵律特性存在关联的韵律特征,并预测待合成文本中字符的发音时长,以便在后续的语音合成过程中,可以得到更加自然、表现力更好的合成语音,同时字符的发声时长预测可以更加准确。
可选的,电子设备预测得到待合成文本中的每个字符的发音时长,或者,电子设备预测得到待合成文本中部分字符的发音时长,该部分字符可以包括关键字符。为了方便表述,下述实施例中以预测每个字符的发音时长为例进行示意性说明。
其中,字符可以为语言学中可以辨认的抽象图形符号,文字中最小的区别性单位。例如,英语中的字母“a、b、c”等,中文中的汉字“你、我、他”等,日语中的平假名“あ、い、う”等。
在一些实施例中,在文本中,根据词性、上下文、情感等因素,字符可以分别具有对应的发音时长。若存在字符不需要发音的情况,发音时长也可以为0。电子设备可以以字符为单位,预测其发音所需的时长,以便合成的语音可以具有更加准确的发音时长,使最终的合成语音具有较好的效果。
在一些实施例中,由于基于待合成文本提取得到的隐层特征可以与字符词性、字符上下文关联、字符情感等文本特性存在关联,因此在预测字符的发音时长时,可以基于隐层特征进行字符发音时长预测,以根据词性、上下文、情感等因素预测字符时长,得到较为准确的时长预测效果。
步骤103,基于隐层特征、韵律特征以及发音时长,生成待合成文本对应的声学特征;
在本申请实施例中,在得到待合成文本中字符的发音时长以及待合成文本的隐层特征以及韵律特征之后,可以基于待合成文本的隐层特征中隐含的文本相关特征、韵律特征中隐含的韵律相关特征、以及待合成文本中字符的发音时长,生成待合成文本对应的声音波形特征,得到声学特征。由于在语音合成的过程中在隐层特征的技术上进一步考虑了韵律特征以及字符的发音时长,因此生成的声音波形特征可以具有更加准确的韵律以及发音时长,使合成的语音可以具有较好的发音自然度以及表现力。
其中,声学特征可以为声音的波形特征信息,例如,随时间变化的响度、频率信息。声学特征可以采用频谱图表达,例如,梅尔顿谱、线性谱等。
步骤104,根据声学特征,生成待合成文本对应的文本语音。
在本申请实施例中,由于声音为物体振动产生的波,在得到声音的波形特征之后,即可以还原声音信号。由此,在得到待合成文本对应的声学特征之后,电子设备即可以采用待合成文本对应的声学特征还原声音信号,生成待合成文本对应的文本语音,完成待合成文本的语音合成。
由于语音合成过程中基于字符预测待合成文本的时长,同时生成隐层特征以及韵律特征,并最终基于待合成文本的隐层特征、韵律特征以及待合成文本中字符的发音时长,生成待合成文本对应的声学特征,完成基于字符级别的语音合成,可以无需提取大量的单词、音素、 停顿等信息构建语音库,使语音合成的流程更加简单,降低语音合成的难度,且由于生成声学特征的过程中在隐层特征的基础上进一步参考了韵律特征以及字符的发音时长,可以使语音合成的质量进一步提高。此外,面对用户需要合成不同人物的语音等个性化需求时,也可以较为简单地完成语音合成的个性化支持。
综上所述,采用本申请实施例的语音合成方法,通过提取文本中的隐层特征以及韵律特征,并基于字符预测语音时长,实现字符级别的语音合成,在保证语音合成质量的情况下,相较于音素级别的语音合成方案无需预处理大量速度,有助于降低语音合成的难度。
参照图2,示出了本申请实施例提供的一种语音合成方法实施例的步骤流程图,可以包括如下步骤:
步骤201,获取待合成文本;
步骤202,采用待合成文本对应的声学模型,生成待合成文本的隐层特征以及韵律特征,并预测所述待合成文本中字符的发音时长;
在一种可能的实施方式中,声学模型中可以包含多个子模型,其中一个子模型可以用于预设待合成文本中字符的发音时长,一个子模型可以用于生成待合成文本的韵律特征、一个子模型可以用于生成待合成文本的隐层特征。
在模型训练过程中,用于预测字符时长的子模型可以将待合成文本作为输入,并将待合成文本中每一字符的发音时长作为输出。用于生成韵律特征的子模型将待合成文本作为输入,并将待合成文本的韵律特征作为输出。用于生成隐层特征的模型将待合成文本作为输入,并将待合成文本的隐层特征作为输出。
在本申请实施例中,根据训练过程中使用的语音样本的区别,声学模型可以具有多种类型。在一种可能的实施方式中,声学模型可以适配多种不同的语种,例如适用于中文的声学模型、适用于英语的声学模型、适用于日语的声学模型等。声学模型还可以具有个性化的语音风格,例如,女高音、男中音、女低音、男低音、儿童音、特定卡通人物的语音风格、特定明星的语音风格等。同时,由于声学模型基于字符进行语音合成,无需提取大量的单词、音素、停顿等信息构建语音库,因此声学模型的训练过程可以较为简单。由此,可以较为容易根据用户不同的需求,部署相应的声学模型,满足多语种、个性化风格语音的需求。
可选的,在获取待合成文本后,电子设备还可以根据待合成文本对应的语种和语音风格中的至少一种要求,选取适用于语种和/或语音风格的声学模型对待合成文本进行处理,生成待合成文本的隐层特征以及韵律特征,并预测待合成文本中字符的发音时长。后续在生成声学特征的过程中,也可以采用适用于语种和/或语音风格的声学模型进行处理,从而可以满足语音合成的个性化需求。
步骤203,采用声学模型,基于待合成文本的隐层特征、韵律特征以及待合成文本中字符的发音时长,生成待合成文本对应的声学特征;
在本申请实施例中,在得到待合成文本中字符的发音时长以及待合成文本的隐层特征以及韵律特征之后,可以采用声学模型,基于待合成文本的隐层特征、韵律特征以及发音时长,生成待合成文本对应的声学特征。声学特征可以为声音的波形特征信息,例如,随时间变化的响度、频率信息。声学特征可以采用频谱图表达,例如,梅尔顿谱、线性谱等。
在一种可能的实施方式中,声学模型还可以包含用于生成待合成文本对应声学特征的子模型。在模型训练过程中,用于合成待合成文本对应声学特征的子模型可以以待合成文本的隐层特征、韵律特征以及字符的发音时长作为输入,并将待合成文本对应的声学特征作为输出,从而可以得到用于合成待合成文本对应声学特征的子模型。
步骤204,根据声学特征,生成待合成文本对应的文本语音。
本步骤的实施方式可以参考上述步骤104,本实施例在此不作赘述。
在本申请的一种实施例中,声学模型采用如下方式训练得到:
S11,获取训练文本以及所述训练文本对应的训练音频,训练文本采用训练语种;
在本申请实施例中,在需要训练某一语种的声学模型的情况下,可以获取采用该语种的 训练文本以及训练文本对应的训练音频。
其中,训练语种可以为不同地区使用的语言,例如,中文、英语、日语、韩语、法语等;也可以为某种语言分支下的地方方言,例如,客家语、粤语等。
S12,采用训练文本以及训练文本对应的训练音频,对待训练的声学模型进行训练,得到训练完成的训练语种的声学模型。
在本申请实施例中,电子设备可以采用训练语种对应的训练文本以及训练文本对应的训练音频,对待训练的声学模型进行训练,得到训练完成的训练语种的声学模型。其中,训练完成的声学模型可以适用于该训练语种的语音合成。
在一种可能的实施方式中,声学模型可以采用端到端模型的形式,声学模型中包含的子模型是相互联系而非独立的,子模型的输入可以为其他子模型的输出。同时声学模型在训练过程中,也可以整体地进行训练,在获取声学模型的最终输出的声学特征之后,基于声学模型的的最终输出,对声学模型中的各个子模型的进行调整,得到训练完成的训练语种的声学模型。
在一种可能的实施方式中,采用训练语种的训练文本以及训练文本对应的训练音频包括来自若干人的训练文本以及训练文本对应的训练音频,即训练文本和训练音频来自不同发音对象,以此提高训练得到的声学模型对同一语种下不同发音对象的泛化性;
可选的,采用训练文本以及训练文本对应的训练音频,对待训练的声学模型进行训练,得到训练完成的训练语种的声学模型,包括如下步骤:
S21,采用来自不同发音对象训练文本以及训练音频,对待训练的声学模型进行训练,得到训练完成的训练语种的声学模型;
在本申请实施例中,电子设备采用训练语种中若干发音对象的训练文本以及训练文本对应的训练音频,对待训练的声学模型进行训练,得到训练完成的训练语种的声学模型,可以使声学模型可以学习到训练语种普遍的发声规律,降低声学模型的错误率,提高声学模型输出的声学特征的特征质量。
S22,采用训练语种中目标语音风格的训练文本以及训练音频,对待训练的声学模型进行训练,得到训练完成的目标语音风格的声学模型。
在本申请实施例中,在基于采用若干发音对象的语音训练完成的声学模型的基础上,电子设备可以进一步采用训练语种中目标语音风格的训练文本以及该训练文本对应的训练音频,对声学模型进行训练,得到目标语音风格的声学模型。
由于目标语音风格的声学模型在采用若干发音对象的语音训练完成的声学模型的基础上进一步训练得到,因此,基于目标语音风格的声学模型合成的语音,在明显具有目标语音风格的同时,具有较高的发声准确率,同时发声音质也可以得到一定程度的提高。
本实施例中,电子设备首先基于来自不同发音对象的训练文本以及训练音频训练声学模型,然后在该声学模型的基础上,进一步基于不同语音风格的训练文本以及训练音频,训练该语音风格对应的声学模型,在保证语音合成质量的前提下,使合成语音具备特定的语音风格。
参照图3,示出了本申请实施例提供的一种语音合成方法实施例的步骤流程图,可以包括如下步骤:
步骤301,获取待合成文本;
在本申请实施例中,可以采用声学模型完成语音合成。该声学模型可以包括多个子模型。在一种可能的实施方式中,该声学模型可以包括编码器(encoder)、时长模型、变分自动编码器(VariationalAutoEncoder,VAE)以及解码器(decoder)。待合成文本需要经过编码器、时长模型、变分自动编码器以及解码器的处理,最终得到待合成文本对应的声学特征。
在一种可能的设计中,声学模型可以为端到端模型,声学模型中编码器、时长模型、变分自动编码器以及解码器的是相互联系而非独立的。编码器以及变分自动编码器可以不输出独立的结果,而是输出模型处理过程中产生的中间向量,中间向量再输入解码器中,得到待 合成文本的声学特征。对于用户来说,其可以将待合成文本输入声学模型中,即可以直接获取声学模型输出的声学特征。通过采用端到端模型生成声学模型,可以进一步简化声学模型的结构,提高声学模型将待合成文本转换为声学特征的效率。
步骤302,通过编码器对待合成文本进行特征提取,得到待合成文本的隐层特征;
在本申请实施例中,编码器可以学习待合成文本的潜在信息,输出与字符词性、字符上下文关联、字符情感等文本特性存在关联的隐层特征,以便后续模型可以基于隐层特征做进一步处理。其中,编码器输出的隐层特征可以采用向量形式进行表达。由于编码器输出的待合成文本的隐层特征可以认为是模型中间处理过程中的输出,其可以不具备可解释性。
步骤303,基于隐层特征,通过时长模型预测待合成文本中字符的发音时长;
在一种可能的实施方式中,在获得待合成文本的隐层特征之后,即可以采用与编码器对应的解码器,根据隐层特征隐含的字符词性、字符上下文关联、字符情感、字符时长等文本特性,生成待合成文本对应的声学特征。
但是,仅采用编码器与解码器生成待合成文本对应的声学特征的情况下,通常无法获得效果较好的合成语音,合成语音可以仍然存在发声自然度不足、表现力不足的情况。
由此,为了提高合成语音的质量,在另一种可能的实施方式中,电子设备可以采用时长模型对字符的发音时长进行预测,以进一步提高合成语音中字符发音时长的准确率以提高发声自然度。在获取所述隐层特征之后,可以将隐层特征输入时长模型中,时长模型可以通过隐层特征隐含的与字符词性、字符上下文关联、字符情感等文本特性存在关联的信息,预测待合成文本中字符对应语音的持续时间,即字符的发音时长。
步骤304,基于隐层特征,通过变分自动编码器提取待合成文本的韵律特征;
在本申请实施例中,为了进一步提高合成语音的发声自然度和表现力,电子设备还可以进一步将隐层特征通过变分自动编码器输出韵律特征,从而在后续的语音合成的过程中,可以基于韵律特征提高合成语音的发声自然度和表现力。
在一种可能的实施方式中,在将隐层特征输入时长模型的同时,还可以将隐层特征输入变分自动编码器,变分自动编码器可以学习待合成文本中说话人状态的潜在表示,并输出与声调、语调、重音、节奏等韵律特性存在关联的韵律特征。韵律特征可以采用向量形式进行表达。
步骤305,基于字符的发音时长,调整隐层特征的特征长度;
在本申请实施例中,隐层特征的长度可以与语音中字符的发声时长存在关联。为了在生成声学特征的过程中,使解码器可以生成发声时长准确率高的声学特征,电子设备可以基于待合成文本中字符的发音时长,对隐层特征的长度进行调整。
在一种可能的实施方式中,隐层特征的特征长度与发音时长呈正相关关系,即发音时长越长,对应的隐层特征的特征长度越长。
例如,若隐层特征为“abc”,则可以基于待合成文本中字符的发音时长,将隐层特征调整为“aaabbbccc”。
步骤306,将调整后的隐层特征以及韵律特征输入解码器中,得到待合成文本对应的声学特征;
在本申请实施例中,可以将调整后的隐层特征以及待合成文本的韵律特征输入所述解码器中,解码器根据调整后的隐层特征隐含的字符词性、字符上下文关联、字符情感、字符时长等文本特性,以及韵律特征隐含的声调、语调、重音、节奏等韵律特性,生成待合成文本对应的声学特征,其中,解码器进行解码的过程即为特征还原过程。由于解码器在参考调整后的隐层特征的基础上,进一步参考了变分自编码输出韵律特征生成声学特征,可以使合成的语音的韵律特征更加准确,进一步提高了合成语音的质量。
步骤307,将声学特征输入声码器中,获取声码器输出的待合成文本对应的文本语音。
在本申请实施例中,在得到待合成文本对应的声学特征之后,电子设备可以将待合成文本对应的声学特征输入声码器(vocoder)中,由声码器基于声学特征生成文本语音,完成 待合成文本的语音合成。
其中,声码器可以为经过训练的,用于将声学特征转换为语音的模型。声码器可以为循环神经网络、基于源-滤波器模型等,本申请实施例对此不做限制。
图4为本申请实施例提供的一种声学模型的结构示意图。声学模型可以包括编码器401、时长模型402、变分自动编码器403以及解码器404。电子设备将待合成文本输入编码器401中,并获取编码器401输出的隐层特征。其后,可以将隐层特征输入时长模型402中,获取时长模型402输出的待合成文本中每一字符的发音时长。同时,可以将编码器401输出的隐层特征输入变分自动编码器403中,并获取变分自动编码器403输出的韵律特征。其后,可以采用待合成文本中每一字符的发音时长对隐层特征进行调整,并将调整后的隐层特征以及待合成文本的韵律特征输入解码器404中,并获取解码器404输出的待合成文本对应的声学特征。其后,可以采用预先训练的声码器,对声学特征进行处理,得到待合成文本对应的文本语音。
在本申请的一种实施例中,声学模型采用如下方式训练得到:
S31,获取训练文本以及训练文本对应的训练音频;
在本申请实施例中,电子设备可以获取训练文本以及训练文本对应的训练音频,训练音频可以同一种语言中来自多个不同的人物或来自同一人物的音频。训练文本可以为一种语言的单词、短句、长句、文章等,本申请实施例对此不做限制。
S32,提取训练音频中的目标声学特征;
在本申请实施例中,电子设备可以提取训练音频中的目标声学特征,作为声学模型整体的训练目标。目标声学特征可以为声音的波形特征信息,例如,随时间变化的响度、频率信息。声学特征可以采用频谱图表达,例如,梅尔顿谱、线性谱等。
在一种可能的实施方式中,电子设备可以采用声学特征提取算法,在训练音频中提取训练音频中的目标声学特征。例如,可以采用MFCC(Mel Frequency Cepstrum Coefficient,梅尔导谱系数算法)、FBank(Filter Banks,滤波器组算法)、LogFBank(Log Filter Banks,对数滤波器组算法)等,本申请实施例对此不做限制。
S33,将训练文本输入待训练的声学模型中,获取待训练的声学模型输出的模型声学特征;
在本申请实施例中,电子设备可以将训练文本输入待训练的声学模型中,经过声学模型中编码器、时长模型、变分自动编码器、解码器的处理,模型可以输出一模型声学特征。对声学模型进行训练过程,即模型声学特征向目标声学特征逼近的过程。
在一种可能的实施方式中,对于变分自动编码器来说,其在训练中,可以将编码器输出的隐层特征以及目标声学特征皆作为输入。变分自动编码器可以将目标声学特征以及编码器输出的隐层特征通过其自身包含的两个神经网络拟合成一个值,变分值编码器可以学习该值,其后在应用阶段中,在获取编码器输出的隐层特征之后,即可基于隐层特征以及其自身学习到的值,相应地的输出待合成文本的韵律特征。
S34,确定模型声学特征与目标声学特征之间的特征相似度;
在本申请实施例中,电子设备可以通过计算模型声学特征与目标声学特征之间的特征相似度,以确定模型声学特征是否与目标声学特征接近,进而确定声学模型是否已经完成训练。
在一种可能的实施方式中,当声学特征采用向量化表示时,电子设备可以计算模型声学特征与目标声学特征之间的向量距离,从而将向量距离确定为特征相似度。
S35,基于特征相似度,调整待训练的声学模型中的模型参数,完成声学模型训练。
在本申请实施例中,电子设备可以基于模型声学特征以及目标声学特征之间的特征相似度(作为损失函数),调整待训练的声学模型中的模型参数,使声学模型输出的模型声学特征可以不断接近目标声学特征。
在一种可能的实施方式中,电子设备可以采用梯度下降或反向传播算法,调整声学模型的模型参数。
其后,若模型声学特征以及目标声学特征之间的特征相似度满足预设条件,可以认为声学模型训练完成。
预设条件可以为模型声学特征以及目标声学特征之间的特征相似度高于预设阈值;模型声学特征以及目标声学特征之间的相似度基本不再变化等,本申请实施例对此不做限制。
为了进一步提高合成语音的真实度,在训练声学模型过程中,还可以引入对抗生成网络(Generative Adversarial Networks,GAN)的思想。在一种可能的实施方式中,通过声学模型得到模型声学特征后,电子设备将模型声学特征输入声码器中,获取声码器输出的合成音频。
进一步的,电子设备将合成音频以及训练音频输入判别器,得到合成音频对应的第一判别结果,以及训练音频对应的第二判别结果,其中,该判别器用于判别输入的音频为训练音频或合成音频,即用于判别真实音频和生成音频。
在模型训练过程中,电子设备除了以特征相似度作为损失的一部分外,还将判别器的判别损失作为损失的一部分,从而基于特征相似度、第一判别结果以及第二判别结果,调整待训练的声学模型中的模型参数以及判别器,完成声学模型训练。
在一种可能的实施方式中,在声学模型与判别器构成生成对抗网络的情况下,基于特征相似度损失对声学模型中的模型参数进行梯度更新,提高声学模型生成模型声学特征的准确度。同时,基于判别损失对判别器进行调整,提高自身区分模型声学特征以及目标声学特征的能力。声学模型以及判别器可以相互对抗,相互提高模型输出的准确率,最终可以得到具有较高准确率的声学模型。
在本申请实施例中,通过整体地训练声学模型中的子模型,基于声学模型的最终输出对声学模型中的子模型进行训练,可以使声学模型中每一子模型可以具有相同的训练目标,使声学模型中子模型之间可以具有更好的契合度,获得更好的语音合成效果。同时,采用生成对抗网络对声学模型进行训练,可以进一步提高声学模型的效果,使最终生成的合成语音更加真实。
在本申请的一种实施例中,上述时长模型采用如下方式训练得到:
S41,提取训练音频中字符的标准时长;
在本申请实施例中,电子设备可以在声学模型整体训练的基础上,进一步针对时长模型进行训练,以提高时长模型预测字符发音时长的准确性,使所述声学模型的输出可以更加准确。
其中,电子设备提取得到的训练音频中字符的标准时长,可以认为是字符正确的发音时长。训练音频中字符的标准时长的提取可以采用模型进行提取,也可以采用人工进行提取,本申请实施例对此不做限制。
S42,将训练文本输入编码器中,获取编码器输出的隐层特征;
在本申请实施例中,时长模型的输入可以为编码器的输出,由此,可以将训练文本输入编码器中,并获取编码器输出的隐层特征,以对时长模型进行训练。
S43,将隐层特征作为时长模型的输入,将训练音频中字符的标准时长作为训练目标,对时长模型进行训练。
在本申请实施例中,电子设备可以将隐层特征输入时长模型,得到时长模型输出的预测时长,从而将训练音频中字符的标准时长作预测时长的监督,对时长模型进行训练,得到训练完成的时长模型。通过对时长模型进一步进行训练,可以进一步提高时长模型输出的准确率,使最终合成的语音可以具有更好的质量。
在一种可能的实施方式中,电子设备可以将隐层特征输入时长模型中,时长模型可以输出训练文本中字符的预测时长。其后,电子设备可以确定时长模型输出的输出的预测时长与训练音频中字符的标准时长之间的时长差值,并根据时长差值对时长模型中的模型参数进行调整,直至时长模型的输出满足预设条件,时长模型训练完成。
预设条件可以为时长模型输出的输出的预测时长与标准时长之间的差值小于预设阈值, 也可以为时长模型输出的预测时长与标准时长之间的差值基本不再变化等,本申请实施例对此不做限制。
在本申请的一种实施例中,提取训练音频中字符的标准时长,包括:
S51,基于训练音频以及隐层特征,通过切分模型标注训练音频中字符的发音起点与发音终点;
在本申请实施例中,在训练过程中,可以采用切分模型对时长模型进行训练。切分模型可以用于切分训练文本中的每一个字符,并相应地标注每一字符在训练音频中的发音起点以及发音终点,从而可以得知训练文本中每一字符对应的发音时长。可以认为切分模型输出的字符时长是正确的字符时长,从而可以基于切分模型的输出,对时长模型进行训练。
由此,电子设备可以将训练音频以及隐层特征输入切分模型中,以获取切分模型的输出。
在本申请实施例中,切分模型可以基于隐层特征,预测训练音频每一帧对应的字符。其后,可以将字符对应的最早一帧作为字符在训练音频中的发音起点,将字符对应的最晚一帧作为字符在训练音频中的发音终点,从而可以实现标注训练音频中每一字符的发音起点与发音终点。
可选的,目标声学特征可以记载有训练音频在连续时长中频度、响度的变化。由此,切分模型可以在目标声学特征的基础上,预测目标声学特征中每一帧对应的字符,对字符的起点与终点进行标注。
S52,基于训练音频中字符的发音起点与发音终点,确定训练音频中字符的标准时长。
在一种可能的实施方式中,电子设备可以将字符的发音起点与发音终点之间的时间差,作为字符对应的标准时长,从而可以得到训练音频中每一字符的标准时长。
可选的,在声学模型的训练过程中,同样可以基于模型声学特征以及目标声学特征之间的特征相似度,调整切分模型的模型参数。由此,在训练过程中,切分模型也可以不断提高自身切分训练文本中的每一个字符并确定每一字符时长的准确率。从而在训练过程中,时长模型也可以获得更加准确的训练目标,可以提高时长模型输出的准确率,并使声学模型最终输出的声学特征可以具有更高的准确率。
由此,声学模型可以实现端到端的学习,可以基于声学模型整体最终输出的声学特征,对声学模型中的每一子模型以及切分模型进行训练,得到具有较高准确率的声学模型。同时,训练过程中,由于采用对抗训练的方式对声学模型以及切分模型进行训练,同时采用切分模型的输出对时长模型进行训练,从而声学模型在训练过程中,可以在无人工监督或者较少人工监督的情况下完成模型的训练,便于声学模型可以较为简便地适配多种语种以及不同发音对象的需求。
图5为本申请应实施例提供的一种声学模型的训练示意图。在训练过程中,可以将训练文本输入编码器501中,并获取编码器501输出的隐层特征,其后,可以将隐层特征输入时长模型502中,获取时长模型502输出的每一字符的时长。还可以将隐层特征以及目标声学特征输入所述切分模型505中,以获取切分模型505输出的标准时长。可以将切分模型505输出的标准时长作为时长模型502的训练目标,将隐层特征作为时长模型502的输入,对时长模型502进行训练。
同时,还可以将隐层特征以及从训练音频提取得到的目标声学特征输入变分自动编码器503中,并获取变分自动编码器503输出的韵律特征。其后,解码器504可以基于隐层特征、每一字符的时长、以及韵律特征,输出模型声学特征。其后,可以采用判别器506对模型声学特征对应的合成音频以及训练音频进行判别,并确定模型声学特征与目标声学特征之间的特征相似度,同时调整待训练的声学模型每一子模型的模型参数以及判别器的参数,最终得到训练完成的声学特征。
需要说明的是,对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请实施例并不受所描述的动作顺序的限制,因为依据本申请实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉, 说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本申请实施例所必须的。
参照图6,示出了本申请实施例提供的一种语音合成装置的结构框图,该装置可以包括如下模块:
文本获取模块601,用于获取待合成文本;
第一特征生成模块602,用于生成所述待合成文本的隐层特征以及韵律特征,并预测所述待合成文本中字符的发音时长;
第二特征生成模块603,用于基于所述隐层特征、所述韵律特征以及所述发音时长,生成所述待合成文本对应的声学特征;
语音合成模块604,用于根据所述声学特征,生成所述待合成文本对应的文本语音。
在本申请一种实施例中,所述第一特征生成模块602,用于:
采用所述待合成文本对应的声学模型,生成所述隐层特征以及韵律特征,并预测所述发音时长,所述声学模型基于所述待合成文本对应的语种和语音风格中的至少一种确定得到
在本申请一种实施例中,所述声学模型包括编码器、时长模型和变分自动编码器,所述第一特征生成模块602,用于:
通过所述编码器对所述待合成文本进行特征提取,得到所述待合成文本的所述隐层特征;
基于所述隐层特征,通过所述时长模型预测所述待合成文本中字符的所述发音时长;
基于所述隐层特征,通过所述变分自动编码器提取所述待合成文本的所述韵律特征
在本申请一种实施例中,所述声学模型包括解码器;
所述第二特征生成块603,用于:
基于所述发音时长,调整所述隐层特征的特征长度;
将调整后的所述隐层特征以及所述韵律特征输入所述解码器,得到所述待合成文本对应的所述声学特征。
在本申请一种实施例中,所述语音合成模块604,用于:
将所述声学特征输入声码器中,获取所述声码器输出的所述待合成文本对应的所述文本语音。
在本申请一种实施例中,所述声学模型采用如下模块训练得到:
训练模块,用于获取训练文本以及所述训练文本对应的训练音频,所述训练文本采用训练语种;
采用所述训练文本以及所述训练音频,对待训练的所述声学模型进行训练,得到训练完成的所述训练语种的声学模型。
在本申请一种实施例中,所述训练文本和训练音频来自不同发音对象;
所述训练模块,用于:
采用来自不同发音对象的所述训练文本以及所述训练音频,对待训练的所述声学模型进行训练,得到训练完成的所述训练语种的声学模型;
采用所述训练语种中目标语音风格的所述训练文本以及所述训练音频,对待训练的所述声学模型进行训练,得到训练完成的所述目标语音风格的声学模型。
在本申请一种实施例中,所述训练模块,用于:
提取所述训练音频中的目标声学特征;
将所述训练文本输入待训练的声学模型中,获取所述待训练的声学模型输出的模型声学特征;
确定所述模型声学特征与所述目标声学特征之间的特征相似度;
基于所述特征相似度,调整所述待训练的声学模型中的模型参数,完成所述声学模型训练。
在本申请一种实施例中,所述训练模块,还用于:
将所述模型声学特征输入声码器中,获取所述声码器输出的合成音频;
将所述合成音频以及所述训练音频输入判别器,得到所述合成音频对应的第一判别结果,以及所述训练音频对应的第二判别结果,所述判别器用于判别输入的音频为训练音频或合成音频;
基于所述特征相似度、所述第一判别结果以及所述第二判别结果,调整所述待训练的声学模型中的模型参数以及所述判别器,完成所述声学模型训练。
在本申请一种实施例中,所述待训练的声学模型中包括编码器和时长模型;
所述训练模块,还用于:
提取所述训练音频中字符的标准时长;
将所述训练文本输入所述编码器中,获取所述编码器输出的所述训练文本的隐层特征;
将所述隐层特征作为所述时长模型的输入,将所述训练音频中字符的所述标准时长作为训练目标,对所述时长模型进行训练。
在本申请一种实施例中,所述训练模块还用于:
基于所述训练音频以及所述隐层特征,通过切分模型标注所述训练音频中字符的发音起点与发音终点;
基于所述训练音频中字符的所述发音起点与发音终点,确定所述训练音频中字符的所述标准时长。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
参照图7,电子设备700可以包括以下一个或多个组件:处理组件702,存储器704。
处理组件702通常控制电子设备700的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理元件702可以包括一个或多个处理器720来执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件702可以包括一个或多个模块,便于处理组件702和其他组件之间的交互。例如,处理部件702可以包括多媒体模块,以方便多媒体组件708和处理组件702之间的交互。
存储器704被配置为存储各种类型的数据以支持在设备700的操作。这些数据的示例包括用于在电子设备700上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器704可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。
在一些实施例中,该电子设备700还可以包括电力组件706,多媒体组件708,音频组件710,输入/输出(I/O)的接口712,传感器组件714,以及通信组件716,本实施例对此不作限定。
在示例性实施例中,电子设备700可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述方法。
在示例性实施例中,还提供了一种包括指令的非临时性计算机可读存储介质,例如包括指令的存储器704,上述指令可由电子设备700的处理器720执行以完成上述方法。例如,所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
一种非临时性计算机可读存储介质,当所述存储介质中的指令由电子设备的处理器执行时,使得电子设备能够执行上述各个实施例提供的语音合成方法
图8是本申请另一示例性实施例示出的一种的电子设备800的结构示意图。该电子设备800可以是服务器,该服务器可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)822(例如,一个或一个以上处理器)和 存储器832,一个或一个以上存储应用程序842或数据844的存储介质830(例如一个或一个以上海量存储设备)。其中,存储器832和存储介质830可以是短暂存储或持久存储。存储在存储介质830的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对服务器中的一系列指令操作。更进一步地,中央处理器822可以设置为与存储介质830通信,在服务器上执行存储介质830中的一系列指令操作。
服务器还可以包括一个或一个以上电源826,一个或一个以上有线或无线网络接口850,一个或一个以上输入输出接口858,一个或一个以上键盘856,和/或,一个或一个以上操作系统841,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
一种电子设备,包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于存储器中,且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序,以实现上述各个实施例提供的语音合成方法。
以上对本申请所提供的语音合成方法和装置,进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (15)

  1. 一种语音合成方法,所述方法由电子设备执行,所述方法包括:
    获取待合成文本;
    生成所述待合成文本的隐层特征以及韵律特征,并预测所述待合成文本中字符的发音时长;
    基于所述隐层特征、所述韵律特征以及所述发音时长,生成所述待合成文本对应的声学特征;
    根据所述声学特征,生成所述待合成文本对应的文本语音。
  2. 根据权利要求1所述的方法,其中,所述生成所述待合成文本的隐层特征以及韵律特征,并预测所述待合成文本中字符的发音时长,包括:
    采用所述待合成文本对应的声学模型,生成所述隐层特征以及韵律特征,并预测所述发音时长,所述声学模型基于所述待合成文本对应的语种和语音风格中的至少一种确定得到。
  3. 根据权利要求2所述的方法,其中,所述声学模型包括编码器、时长模型和变分自动编码器;
    所述采用所述待合成文本对应的目标声学模型,生成所述隐层特征以及韵律特征,并预测所述发音时长,包括:
    通过所述编码器对所述待合成文本进行特征提取,得到所述待合成文本的所述隐层特征;
    基于所述隐层特征,通过所述时长模型预测所述待合成文本中字符的所述发音时长;
    基于所述隐层特征,通过所述变分自动编码器提取所述待合成文本的所述韵律特征。
  4. 根据权利要求2所述的方法,其中,所述声学模型还包括解码器;
    所述基于所述隐层特征、所述韵律特征以及所述发音时长,生成所述待合成文本对应的声学特征,包括:
    基于所述发音时长,调整所述隐层特征的特征长度;
    将调整后的所述隐层特征以及所述韵律特征输入所述解码器,得到所述待合成文本对应的所述声学特征。
  5. 根据权利要求1所述的方法,其中,所述根据所述声学特征,生成所述待合成文本对应的文本语音,包括:
    将所述声学特征输入声码器中,获取所述声码器输出的所述待合成文本对应的所述文本语音。
  6. 根据权利要求2所述的方法,其中,所述声学模型采用如下方式训练得到:
    获取训练文本以及所述训练文本对应的训练音频,所述训练文本采用训练语种;
    采用所述训练文本以及所述训练音频,对待训练的所述声学模型进行训练,得到训练完成的所述训练语种的声学模型。
  7. 根据要求6所述的方法,其中,所述训练文本和训练音频来自不同发音对象;
    所述采用所述训练文本以及所述训练音频,对待训练的所述声学模型进行训练,得到训练完成的所述训练语种的声学模型,包括:
    采用来自不同发音对象的所述训练文本以及所述训练音频,对待训练的所述声学模型进行训练,得到训练完成的所述训练语种的声学模型;
    采用所述训练语种中目标语音风格的所述训练文本以及所述训练音频,对待训练的所述声学模型进行训练,得到训练完成的所述目标语音风格的声学模型。
  8. 根据权利要求6所述的方法,其中,所述采用所述训练文本以及所述训练音频,对待训练的所述声学模型进行训练,得到训练完成的所述训练语种的声学模型,包括:
    提取所述训练音频中的目标声学特征;
    将所述训练文本输入待训练的声学模型中,获取所述待训练的声学模型输出的模型声学特征;
    确定所述模型声学特征与所述目标声学特征之间的特征相似度;
    基于所述特征相似度,调整所述待训练的声学模型中的模型参数,完成所述声学模型训练。
  9. 根据权利要求8所述的方法,其中,所述方法还包括:
    将所述模型声学特征输入声码器中,获取所述声码器输出的合成音频;
    将所述合成音频以及所述训练音频输入判别器,得到所述合成音频对应的第一判别结果,以及所述训练音频对应的第二判别结果,所述判别器用于判别输入的音频为训练音频或合成音频;
    基于所述特征相似度、所述第一判别结果以及所述第二判别结果,调整所述待训练的声学模型中的模型参数以及所述判别器,完成所述声学模型训练。
  10. 根据权利要求8所述的方法,其中,所述待训练的声学模型中包括编码器和时长模型;
    所述采用所述训练文本以及所述训练音频,对待训练的所述声学模型进行训练,得到训练完成的所述训练语种的声学模型,还包括:
    提取所述训练音频中字符的标准时长;
    将所述训练文本输入所述编码器中,获取所述编码器输出的所述训练文本的隐层特征;
    将所述隐层特征作为所述时长模型的输入,将所述训练音频中字符的所述标准时长作为训练目标,对所述时长模型进行训练。
  11. 根据权利要求9所述的方法,其中,所述提取所述训练音频中字符的标准时长,包括:
    基于所述训练音频以及所述隐层特征,通过切分模型标注所述训练音频中字符的发音起点与发音终点;
    基于所述训练音频中字符的所述发音起点与发音终点,确定所述训练音频中字符的所述标准时长。
  12. 一种语音合成装置,所述装置包括:
    文本获取模块,用于获取待合成文本;
    第一特征生成模块,用于生成所述待合成文本的隐层特征以及韵律特征,并预测所述待合成文本中字符的发音时长;
    第二特征生成模块,用于基于所述隐层特征、所述韵律特征以及所述发音时长,生成所述待合成文本对应的声学特征;
    语音合成模块,用于根据所述声学特征,生成所述待合成文本对应的文本语音。
  13. 一种可读存储介质,当所述存储介质中的指令由电子设备的处理器执行时,使得所述电子设备执行如方法权利要求1-11任一所述的语音合成方法。
  14. 一种电子设备,包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于所述存储器中,且经配置以由一个或者一个以上处理器执行如方法权利要求1-11任一所述的语音合成方法。
  15. 一种计算机程序产品,所述计算机程序产品包括计算机指令,所述计算机指令存储在计算机可读存储介质中;电子设备的处理器从所述计算机可读存储介质读取所述计算机指令,所述处理器执行所述计算机指令,使得所述电子设备执行如权利要求1至11任一所述的语音合成方法。
PCT/CN2022/100747 2021-07-07 2022-06-23 语音合成方法、装置、设备及存储介质 WO2023279976A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/201,105 US20230298564A1 (en) 2021-07-07 2023-05-23 Speech synthesis method and apparatus, device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110769530.2A CN113488022A (zh) 2021-07-07 2021-07-07 一种语音合成方法和装置
CN202110769530.2 2021-07-07

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/201,105 Continuation US20230298564A1 (en) 2021-07-07 2023-05-23 Speech synthesis method and apparatus, device, and storage medium

Publications (1)

Publication Number Publication Date
WO2023279976A1 true WO2023279976A1 (zh) 2023-01-12

Family

ID=77935691

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/100747 WO2023279976A1 (zh) 2021-07-07 2022-06-23 语音合成方法、装置、设备及存储介质

Country Status (3)

Country Link
US (1) US20230298564A1 (zh)
CN (1) CN113488022A (zh)
WO (1) WO2023279976A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113488022A (zh) * 2021-07-07 2021-10-08 北京搜狗科技发展有限公司 一种语音合成方法和装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108597492A (zh) * 2018-05-02 2018-09-28 百度在线网络技术(北京)有限公司 语音合成方法和装置
CN111862938A (zh) * 2020-05-07 2020-10-30 北京嘀嘀无限科技发展有限公司 一种智能应答方法与终端、计算机可读存储介质
CN112435650A (zh) * 2020-11-11 2021-03-02 四川长虹电器股份有限公司 一种多说话人、多语言的语音合成方法及系统
WO2021118543A1 (en) * 2019-12-10 2021-06-17 Google Llc Attention-based clockwork hierarchical variational encoder
CN113488022A (zh) * 2021-07-07 2021-10-08 北京搜狗科技发展有限公司 一种语音合成方法和装置

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5105468A (en) * 1991-04-03 1992-04-14 At&T Bell Laboratories Time delay neural network for printed and cursive handwritten character recognition
CN109003601A (zh) * 2018-08-31 2018-12-14 北京工商大学 一种针对低资源土家语的跨语言端到端语音识别方法
CN110070852B (zh) * 2019-04-26 2023-06-16 平安科技(深圳)有限公司 合成中文语音的方法、装置、设备及存储介质
CN112289304A (zh) * 2019-07-24 2021-01-29 中国科学院声学研究所 一种基于变分自编码器的多说话人语音合成方法
WO2021127978A1 (zh) * 2019-12-24 2021-07-01 深圳市优必选科技股份有限公司 语音合成方法、装置、计算机设备和存储介质
CN112151009A (zh) * 2020-09-27 2020-12-29 平安科技(深圳)有限公司 一种基于韵律边界的语音合成方法及装置、介质、设备
CN112214653A (zh) * 2020-10-29 2021-01-12 Oppo广东移动通信有限公司 字符串识别方法、装置、存储介质及电子设备
CN112735378A (zh) * 2020-12-29 2021-04-30 科大讯飞股份有限公司 泰语语音合成方法、装置以及设备
CN112786005B (zh) * 2020-12-30 2023-12-01 科大讯飞股份有限公司 信息合成方法、装置、电子设备和计算机可读存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108597492A (zh) * 2018-05-02 2018-09-28 百度在线网络技术(北京)有限公司 语音合成方法和装置
WO2021118543A1 (en) * 2019-12-10 2021-06-17 Google Llc Attention-based clockwork hierarchical variational encoder
CN111862938A (zh) * 2020-05-07 2020-10-30 北京嘀嘀无限科技发展有限公司 一种智能应答方法与终端、计算机可读存储介质
CN112435650A (zh) * 2020-11-11 2021-03-02 四川长虹电器股份有限公司 一种多说话人、多语言的语音合成方法及系统
CN113488022A (zh) * 2021-07-07 2021-10-08 北京搜狗科技发展有限公司 一种语音合成方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MENG FANBO, WANG RUIMIN, FANG PENG, ZOU SHUANGYUAN, DUAN WENJUN, ZHOU MING, LIU KAI, CHEN WEI: "The Sogou System for Blizzard Challenge 2020", JOINT WORKSHOP FOR THE BLIZZARD CHALLENGE AND VOICE CONVERSION CHALLENGE 2020, 30 October 2020 (2020-10-30), ISCA, pages 49 - 53, XP093020488, DOI: 10.21437/VCC_BC.2020-8 *

Also Published As

Publication number Publication date
CN113488022A (zh) 2021-10-08
US20230298564A1 (en) 2023-09-21

Similar Documents

Publication Publication Date Title
US11929059B2 (en) Method, device, and computer readable storage medium for text-to-speech synthesis using machine learning on basis of sequential prosody feature
US10789290B2 (en) Audio data processing method and apparatus, and computer storage medium
KR102581346B1 (ko) 다국어 음성 합성 및 언어간 음성 복제
US11514888B2 (en) Two-level speech prosody transfer
JP2022107032A (ja) 機械学習を利用したテキスト音声合成方法、装置およびコンピュータ読み取り可能な記憶媒体
KR20190085883A (ko) 다중 언어 텍스트-음성 합성 모델을 이용한 음성 번역 방법 및 시스템
KR20220000391A (ko) 순차적 운율 특징을 기초로 기계학습을 이용한 텍스트-음성 합성 방법, 장치 및 컴퓨터 판독가능한 저장매체
CN115485766A (zh) 使用bert模型的语音合成韵律
US11810471B2 (en) Computer implemented method and apparatus for recognition of speech patterns and feedback
JP2015068897A (ja) 発話の評価方法及び装置、発話を評価するためのコンピュータプログラム
CN114242033A (zh) 语音合成方法、装置、设备、存储介质及程序产品
Chittaragi et al. Acoustic-phonetic feature based Kannada dialect identification from vowel sounds
US20230298564A1 (en) Speech synthesis method and apparatus, device, and storage medium
Chen et al. Polyglot speech synthesis based on cross-lingual frame selection using auditory and articulatory features
Seong et al. Multilingual speech synthesis for voice cloning
Cahyaningtyas et al. Synthesized speech quality of Indonesian natural text-to-speech by using HTS and CLUSTERGEN
CN113948062A (zh) 数据转换方法及计算机存储介质
CN115700871A (zh) 模型训练和语音合成方法、装置、设备及介质
Stan et al. The MARA corpus: Expressivity in end-to-end TTS systems using synthesised speech data
Sulír et al. Development of the Slovak HMM-based tts system and evaluation of voices in respect to the used vocoding techniques
Houidhek et al. Evaluation of speech unit modelling for HMM-based speech synthesis for Arabic
Huckvale 14 An Introduction to Phonetic Technology
US11335321B2 (en) Building a text-to-speech system from a small amount of speech data
Dalva Automatic speech recognition system for Turkish spoken language
Louw Neural speech synthesis for resource-scarce languages

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22836730

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE