WO2022105553A1 - 语音合成方法、装置、可读介质及电子设备 - Google Patents

语音合成方法、装置、可读介质及电子设备 Download PDF

Info

Publication number
WO2022105553A1
WO2022105553A1 PCT/CN2021/126431 CN2021126431W WO2022105553A1 WO 2022105553 A1 WO2022105553 A1 WO 2022105553A1 CN 2021126431 W CN2021126431 W CN 2021126431W WO 2022105553 A1 WO2022105553 A1 WO 2022105553A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
speech synthesis
specified
training
synthesis model
Prior art date
Application number
PCT/CN2021/126431
Other languages
English (en)
French (fr)
Inventor
潘俊杰
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Priority to US18/020,198 priority Critical patent/US20230306954A1/en
Publication of WO2022105553A1 publication Critical patent/WO2022105553A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers

Definitions

  • the present disclosure relates to the technical field of electronic information processing, and in particular, to a speech synthesis method, apparatus, readable medium, and electronic device.
  • Speech synthesis refers to synthesizing the text specified by the user into audio.
  • the audio corresponding to the text needs to be generated with the help of the original sound library.
  • the data in the original sound library usually has no emotion, and correspondingly, the audio obtained by the speech synthesis processing does not have emotion either, and the expressive power of the audio is weak.
  • the present disclosure provides a speech synthesis method, the method comprising:
  • the present disclosure provides a speech synthesis device, the device comprising:
  • the acquisition module is used to acquire the text to be synthesized and the specified emotion type
  • a determination module configured to determine the specified acoustic feature corresponding to the specified emotion type
  • a synthesis module configured to input the text to be synthesized and the specified acoustic features into a pre-trained speech synthesis model, to obtain the target output of the speech synthesis model that corresponds to the text to be synthesized and has the specified emotion type Audio, the acoustic feature of the target audio matches the specified acoustic feature, and the speech synthesis model is obtained by training according to the corpus that does not have the specified emotion type.
  • the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing apparatus, implements the steps of the method described in the first aspect of the present disclosure.
  • the present disclosure provides an electronic device, comprising:
  • a processing device is configured to execute the computer program in the storage device to implement the steps of the method in the first aspect of the present disclosure.
  • the present disclosure provides a computer program product comprising instructions that, when executed by a computer, cause the computer to implement the steps of the method in the first aspect.
  • the present disclosure first obtains the text to be synthesized and the specified emotion type, then determines the corresponding specified acoustic feature according to the specified emotion type, and finally inputs the text to be synthesized and the specified acoustic feature into the pre-trained speech synthesis model.
  • the output of the speech synthesis model is the target audio with the specified emotional type corresponding to the text to be synthesized, wherein the acoustic features of the target audio match the specified acoustic features, and the speech synthesis model is obtained by training the corpus without the specified emotional type of.
  • the present disclosure can control the speech synthesis of text through the acoustic features corresponding to the emotion types, so that the target audio output by the speech synthesis model can correspond to the acoustic features, and the expressiveness of the target audio is improved.
  • FIG. 1 is a flowchart of a method for speech synthesis according to an exemplary embodiment
  • Fig. 2 is a schematic diagram showing an association relationship according to an exemplary embodiment
  • FIG. 3 is a block diagram of a speech synthesis model according to an exemplary embodiment
  • FIG. 4 is a flowchart of another speech synthesis method shown according to an exemplary embodiment
  • Fig. 5 is a flow chart of training a speech synthesis model according to an exemplary embodiment
  • FIG. 6 is a flowchart of another training speech synthesis model according to an exemplary embodiment
  • FIG. 7 is a flowchart illustrating another training speech synthesis model according to an exemplary embodiment
  • FIG. 8 is a block diagram of a speech synthesis apparatus according to an exemplary embodiment
  • FIG. 9 is a block diagram of another speech synthesis apparatus according to an exemplary embodiment.
  • Fig. 10 is a block diagram of an electronic device according to an exemplary embodiment.
  • the term “including” and variations thereof are open-ended inclusions, ie, "including but not limited to”.
  • the term “based on” is “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
  • Fig. 1 is a flowchart of a speech synthesis method according to an exemplary embodiment. As shown in Fig. 1 , the method includes:
  • Step 101 Acquire the text to be synthesized and the specified emotion type.
  • the text to be synthesized can be, for example, one or more sentences in a text file specified by a user, one or more paragraphs in a text file, or one or more chapters in a text file.
  • the text file may be, for example, an e-book, or other types of files, such as news, articles on official accounts, blogs, and the like.
  • a specified emotion type can also be obtained, and the specified emotion type can be understood as specified by the user, and it is desired to synthesize the text to be synthesized into audio that conforms to the specified emotion type (that is, the target audio mentioned later).
  • the specified emotion type may be, for example, happy, surprised, disgusted, angry, shy, fearful, sad, disdain, and the like.
  • Step 102 Determine the specified acoustic feature corresponding to the specified emotion type.
  • the sounds made by people in different emotional states will have different acoustic features, so the specified acoustic features that conform to the specified emotion type can be determined according to the specified emotion type.
  • the acoustic feature can be understood as the property of sound in multiple dimensions, for example, it may include: volume (ie energy), fundamental frequency (ie pitch), speech rate (ie duration) and so on.
  • volume ie energy
  • fundamental frequency ie pitch
  • speech rate ie duration
  • the specified acoustic feature corresponding to the specified emotion type may be determined according to the corresponding relationship between the emotion type and the acoustic feature, and the corresponding relationship between the emotion type and the acoustic feature may be established in advance, for example, may be established according to historical statistical data.
  • the recognition model can be, for example, RNN (English: Recurrent Neural Network, Chinese: Recurrent Neural Network), CNN (English: Convolutional Neural Networks, Chinese: Convolutional Neural Network), LSTM (English: Long Short-Term Memory, Chinese: Long-term and short-term memory network) and other neural networks, which are not specifically limited in the present disclosure.
  • Step 103 input the text to be synthesized and the specified acoustic features into the pre-trained speech synthesis model, to obtain the target audio output of the speech synthesis model, the corresponding text to be synthesized has the specified emotion type, and the acoustic features of the target audio match the specified acoustic features. , and the speech synthesis model is trained on the corpus without the specified emotion type.
  • a speech synthesis model can be pre-trained.
  • the speech synthesis model can be understood as a TTS (English: Text To Speech, Chinese: from text to speech) model, which can generate corresponding text to be synthesized according to the text to be synthesized and the specified acoustic features.
  • the target audio with the specified emotion type i.e. matching the specified acoustic features.
  • the text to be synthesized and the specified acoustic features are used as the input of the speech synthesis model, and the output of the speech synthesis model is the target audio.
  • the speech synthesis model may be obtained by training based on the Tacotron model, the Deepvoice 3 model, the Tacotron 2 model, the Wavenet model, etc., which is not specifically limited in the present disclosure.
  • the corpus with the specified emotion type (which can be understood as a speech database) is not required, and the existing corpus without the specified emotion type can be directly used for training.
  • the acoustic feature corresponding to the specified emotion type is also considered, so that the target audio can have the specified emotion type.
  • Existing corpus without a specified emotion type can be used to realize explicit control of emotion type in the process of speech synthesis, without spending a lot of time and labor costs to create emotional corpus in advance, improving the efficiency of speech synthesis.
  • the expressiveness of the target audio while also improving the user's listening experience.
  • the present disclosure first obtains the text to be synthesized and the specified emotion type, then determines the corresponding specified acoustic feature according to the specified emotion type, and finally inputs the to-be-synthesized text and the specified acoustic feature together into a pre-trained speech synthesis model.
  • the output of the speech synthesis model is the target audio with the specified emotional type corresponding to the text to be synthesized, wherein the acoustic features of the target audio match the specified acoustic features, and the speech synthesis model is obtained by training the corpus without the specified emotional type of.
  • the present disclosure can control the speech synthesis of text through the acoustic features corresponding to the emotion types, so that the target audio output by the speech synthesis model can correspond to the acoustic features, and the expressiveness of the target audio is improved.
  • the specified acoustic characteristics include at least one of fundamental frequency, volume, and speech rate.
  • Step 102 can be implemented in the following ways:
  • the corresponding specified acoustic feature is determined.
  • the association of emotion types with acoustic features can be determined in various suitable ways.
  • audio that conforms to a certain emotion type may be obtained first, and then the acoustic features in the audio may be determined by processing methods such as signal processing and labeling, so as to obtain the acoustic features corresponding to the emotion type.
  • processing methods such as signal processing and labeling, so as to obtain the acoustic features corresponding to the emotion type.
  • the acoustic features may include at least one of fundamental frequency, volume, and speech rate, and may also include pitch, timbre, loudness, etc., which are not specifically limited in the present disclosure.
  • the association relationship can be shown in Figure 2, and the emotion type can be represented from the three dimensions of fundamental frequency, volume, and speech rate, in which (a) in Figure 2 shows the scene in which the volume is low (ie, Low Energy). , the corresponding four emotional types: shy, fearful, sad, and disdainful. Figure 2 (b) shows the corresponding four emotional types in the scene with high volume (ie High Energy): surprise, happiness , angry, hateful. Further, the correlation relationship can also be quantified. For example, (a) in Figure 2 shows that shyness is located in the second quadrant with lower volume, and the acoustic feature corresponding to shyness can be determined as (volume: -2, fundamental frequency: +3, speed of speech: -3).
  • the target audio may be obtained by a speech synthesis model as follows:
  • text features corresponding to the text to be synthesized and predicted acoustic features corresponding to the text to be synthesized are obtained from the text to be synthesized.
  • the target audio with the specified emotion type is obtained.
  • text features corresponding to the text to be synthesized may be extracted first, and acoustic features corresponding to the text to be synthesized may be predicted.
  • the text feature can be understood as a text vector that can represent the text to be synthesized.
  • the predicted acoustic features can be understood as the acoustic features predicted by the speech synthesis model according to the text to be synthesized, and the predicted acoustic features may include: at least one of fundamental frequency, volume, and speed of speech, and may also include: pitch, tone, loudness, etc.
  • the specified acoustic features can be combined to generate the target audio with the specified emotion type.
  • An implementation method can superimpose the specified acoustic feature and the predicted acoustic feature to obtain an acoustic feature vector, and then generate the target audio according to the acoustic feature vector and the text vector.
  • the specified acoustic feature, the predicted acoustic feature and the text vector can also be superimposed to obtain a combined vector, and then the target audio is generated according to the combined vector, which is not specifically limited in this disclosure.
  • Fig. 3 is a block diagram of a speech synthesis model according to an exemplary embodiment.
  • the speech synthesis model includes: a first encoder, a second encoder and a synthesizer.
  • the structure of the first encoder can be the same as the structure of the encoder (ie Encoder) in the Tacotron model.
  • the synthesizer can be understood as the attention network (ie Attention), decoder (ie Decoder) and post-processing in the Tacotron model.
  • a combination of processing networks ie Post-processing).
  • the second encoder (which can be expressed as Feature Extractor) can be understood as an extraction model, which can predict the acoustic features corresponding to the text according to the input text (that is, the predicted acoustic features mentioned later).
  • Fig. 4 is a flowchart of another speech synthesis method according to an exemplary embodiment. As shown in Fig. 4, step 103 may include:
  • Step 1031 Extract text features corresponding to the text to be synthesized through the first encoder.
  • the first encoder may include an embedding layer (ie Character Embedding layer), a pre-net sub-model and a CBHG (English: Convolution Bank+Highway network+bidirectional Gated Recurrent Unit, Chinese: Convolutional Layer) + high-speed network + bidirectional recurrent neural network) submodel.
  • an embedding layer ie Character Embedding layer
  • pre-net sub-model ie Character Embedding layer
  • CBHG Chinese: Convolution Bank+Highway network+bidirectional Gated Recurrent Unit, Chinese: Convolutional Layer
  • CBHG Chinese: Convolutional Layer
  • Step 1032 extract the predicted acoustic features corresponding to the text to be synthesized by the second encoder.
  • the text feature determined in step 1031 may be input to the second encoder, so that the second encoder predicts the predicted acoustic feature corresponding to the text to be synthesized according to the text vector.
  • the second encoder can be, for example, a Transformer with 3 layers, 256 units, and 8 heads.
  • Step 1033 through the synthesizer, generate the target audio according to the specified acoustic feature, the predicted acoustic feature and the text feature.
  • the synthesizer can include an attention network, a decoder, and a post-processing network.
  • the text feature can be input into the attention network first, and the attention network can add an attention weight to each element in the text vector, so that the text feature with a fixed length becomes a variable-length semantic vector, where the semantic vector can represent Text to be synthesized.
  • the attention network may be a location-sensitive attention (English: Locative Sensitive Attention) network, or a GMM (English: Gaussian Mixture Model, abbreviated GMM) attention network, or a Multi-Head Attention network. This is not specifically limited.
  • the specified acoustic feature, the predicted acoustic feature and the semantic vector can be input into the decoder.
  • the specified acoustic feature and the predicted acoustic feature can be superimposed to obtain an acoustic feature vector, and then the acoustic feature vector and the semantic vector can be combined. as the input to the decoder.
  • the specified acoustic feature, the predicted acoustic feature and the semantic vector can be superimposed to obtain a combined vector, and then the combined vector can be used as the input of the decoder.
  • the decoder may include a pre-processing network sub-model (which may be the same as the pre-processing network sub-model included in the first encoder), Attention-RNN, Decoder-RNN.
  • the preprocessing network sub-model is used to perform nonlinear transformation on the input specified acoustic features, predicted acoustic features and semantic vectors.
  • the structure of Attention-RNN is a layer of unidirectional, zoneout-based LSTM (English: Long Short-Term Memory, Chinese: Long Short-Term Memory Network), which can take the output of the preprocessing network sub-model as input, and output it to the Decoder-RNN after passing through the LSTM unit.
  • Decode-RNN is a two-layer unidirectional, zoneout-based LSTM, which outputs Mel spectrum information through the LSTM unit, and the Mel spectrum information can include one or more Mel spectrum features.
  • the mel spectral information is input into the post-processing network, which can include a vocoder (eg, Wavenet vocoder, Griffin-Lim vocoder, etc.) to transform the mel spectral feature information to obtain the target audio.
  • a vocoder eg, Wavenet vocoder, Griffin-Lim vocoder, etc.
  • the text feature may include multiple text elements, and the implementation of step 1033 may include:
  • Step 1) Through the synthesizer, according to the current text element, the historical Mel spectrum feature, the specified acoustic feature and the predicted acoustic feature, determine the Mel spectrum feature of the current moment, and the current text element is the text input to the synthesizer at the current moment in the text feature.
  • the historical Mel spectrum feature is the Mel spectrum feature at the last moment determined by the synthesizer.
  • Step 2 Through the synthesizer, the target audio is generated according to the Mel spectrum features at each moment.
  • the text feature may include a first number of text elements (the first number is greater than 1), then correspondingly, the semantic vector output by the attention network in the synthesizer may include a second number of semantic elements, the synthesizer
  • the mel-spectral information output by the decoder in can include a third number of mel-spectral features.
  • the first quantity, the second quantity and the third quantity may be the same or different, which are not specifically limited in the present disclosure.
  • the first number of text elements are input to the attention network in the synthesizer according to the preset timestep (time step), the text element input to the attention network at the current moment is the current text element, and the decoder at the previous moment will also be
  • the output historical Mel spectral features are input to the attention network together to obtain the current semantic element output by the attention network (the current semantic element can be one or more semantic elements output by the attention network at the current moment).
  • the specified acoustic features, predicted acoustic features, historical mel spectral features and current semantic elements can be input into the decoder in the synthesizer to obtain the current mel spectral features output by the decoder.
  • the decoder After the text features are all input to the attention network, the decoder will sequentially output the third number of mel spectral features, that is, the mel spectral information. Finally, the mel spectral information (ie, the mel spectral features at each moment) is input to the post-processing network in the synthesizer to obtain the target audio generated by the post-processing network.
  • Fig. 5 is a flowchart illustrating a training process of a speech synthesis model according to an exemplary embodiment.
  • the speech synthesis model training process can be included in the speech synthesis method according to the present disclosure, and can also be applied to the speech synthesis method according to the present disclosure in addition to the speech synthesis method according to the present disclosure.
  • the speech synthesis model is obtained by training as follows:
  • step A the real acoustic features corresponding to the training audio are extracted through the training audio corresponding to the training text that does not have the specified emotion type.
  • step B the real acoustic features and the training text are input into the speech synthesis model, and the speech synthesis model is trained according to the output of the speech synthesis model and the training audio.
  • training text For example, to train a speech synthesis model, it is necessary to first obtain the training text and the training audio corresponding to the training text.
  • the training audio may not have any emotion type.
  • real acoustic features corresponding to training audio without the specified emotion type can be extracted.
  • the real acoustic features corresponding to the training audio can be obtained by means of signal processing, labeling, etc.
  • the training text and real acoustic features are used as input to the speech synthesis model, and the speech synthesis model is trained based on the output of the speech synthesis model and the training audio.
  • the difference between the output of the speech synthesis model and the training audio can be used as the loss function of the speech synthesis model, with the goal of reducing the loss function, and the back-propagation algorithm is used to modify the parameters of the neurons in the speech synthesis model.
  • the parameters may be, for example, the weight (English: Weight) and the bias (English: Bias) of the neuron.
  • the speech synthesis model may include: a first encoder, a second encoder and a synthesizer, a blocking structure is provided between the first encoder and the second encoder, and the blocking structure is used to block the second encoder Return the gradient to the first encoder.
  • the blocking structure can be understood as stop_gradient(), which can truncate the second loss of the second encoder, thereby preventing the second encoder from returning the gradient to the first encoder, that is, when the second encoder is based on the second encoder When the loss is adjusted, the first encoder will not be affected, thereby avoiding the problem of unstable training of the speech synthesis model.
  • FIG. 6 is another flowchart of training a speech synthesis model according to an exemplary embodiment. As shown in FIG. 6 , the implementation of step B may include:
  • Step B1 extracting training text features corresponding to the training text through the first encoder.
  • step B2 the predicted training acoustic features corresponding to the training text are extracted by the second encoder.
  • step B3 the output of the speech synthesis model is generated by the synthesizer according to the real acoustic features, the predicted training acoustic features and the training text features.
  • the training text can be input into the first encoder to obtain training text features corresponding to the training text output by the first encoder.
  • the training text features are input into the second encoder to obtain predicted training acoustic features corresponding to the training text features output by the second encoder.
  • the real acoustic features, the predicted training acoustic features and the training text features are input into the synthesizer, so that the output of the synthesizer is used as the output of the speech synthesis model.
  • the loss function of the speech synthesis model is determined by a first loss and a second loss, the first loss is determined by the output of the speech synthesis model, and the training audio, and the second loss is determined by the output of the second encoder, and the real Acoustic signature determination.
  • the loss function may be jointly determined by the first loss and the second loss, for example, the weighted summation of the first loss and the second loss may be performed.
  • the first loss can be understood as inputting the training text and the corresponding real acoustic features into the speech synthesis model, and according to the output of the speech synthesis model, the difference between the training audio corresponding to the training text (which may also be the mean square error) is determined. loss function.
  • the second loss can be understood as inputting the training text into the first encoder to obtain the corresponding training text features, then inputting the training text features into the second encoder, and according to the output of the second encoder, the real acoustic features corresponding to the training text
  • the difference (which can also be the mean squared error) to determine the loss function.
  • the weighting weight can be set in various appropriate ways, for example, it can be set according to the characteristics of the output of the second encoder, so that in the process of training the speech synthesis model, the speech synthesis can be adjusted as a whole.
  • the weight and connection relationship of neurons in the model can also be adjusted to the weight and connection relationship of neurons in the second encoder, which ensures the accuracy and effectiveness of the speech synthesis model and the second encoder therein.
  • FIG. 7 is another flowchart of training a speech synthesis model according to an exemplary embodiment. As shown in FIG. 7 , the speech synthesis model can also be obtained by training in the following manner:
  • Step C through the training audio, extract the real Mel spectrum information corresponding to the training audio.
  • step B can be:
  • the real acoustic features, training text and real mel spectrum information are used as the input of the speech synthesis model, and the speech synthesis model is trained according to the output of the speech synthesis model and the training audio.
  • the real Mel spectrum information corresponding to the training audio may also be obtained.
  • the real Mel spectrum information corresponding to the training audio can be obtained by means of signal processing.
  • the real acoustic features, training text and real mel spectrum information can be used as the input of the speech synthesis model, and the speech synthesis model will be trained according to the output of the speech synthesis model and the training audio.
  • the training text may be first input into the first encoder to obtain training text features corresponding to the training text output by the first encoder.
  • the training text features are input into the second encoder to obtain predicted training acoustic features corresponding to the training text features output by the second encoder.
  • the training text features and the real Mel spectrum information corresponding to the training text are input into the attention network to obtain the training semantic vector corresponding to the training text output by the attention network.
  • the predicted training acoustic features, the training semantic vectors, the real acoustic features corresponding to the training text, and the real Mel spectrum information corresponding to the training text are input into the decoder to obtain the training Mel spectrum information output by the decoder.
  • the training mel spectral information is input into the post-processing network, and the output of the post-processing network is used as the output of the synthesizer (ie, the output of the speech synthesis model).
  • the present disclosure first obtains the text to be synthesized and the specified emotion type, then determines the corresponding specified acoustic feature according to the specified emotion type, and finally inputs the to-be-synthesized text and the specified acoustic feature together into a pre-trained speech synthesis model.
  • the output of the speech synthesis model is the target audio with the specified emotional type corresponding to the text to be synthesized, wherein the acoustic features of the target audio match the specified acoustic features, and the speech synthesis model is obtained by training the corpus without the specified emotional type of.
  • the present disclosure can control the speech synthesis of text through the acoustic features corresponding to the emotion types, so that the target audio output by the speech synthesis model can correspond to the acoustic features, and the expressiveness of the target audio is improved.
  • FIG. 8 is a block diagram of a speech synthesis apparatus according to an exemplary embodiment. As shown in FIG. 8 , the apparatus 200 includes:
  • the obtaining module 201 is used for obtaining the text to be synthesized and the specified emotion type.
  • the determining module 202 is configured to determine the specified acoustic feature corresponding to the specified emotion type.
  • the synthesis module 203 is used to input the text to be synthesized and the specified acoustic features into the pre-trained speech synthesis model, so as to obtain the target audio output of the speech synthesis model, the corresponding text to be synthesized has the specified emotion type, and the acoustic features of the target audio are the same as those of the specified emotional type. Acoustic feature matching, the speech synthesis model is trained on corpus that does not have the specified emotion type.
  • the specified acoustic features include: at least one of fundamental frequency, volume, and speech rate, and the determining module 202 may be used to:
  • the corresponding specified acoustic feature is determined.
  • speech synthesis models can be used to:
  • text features corresponding to the text to be synthesized and predicted acoustic features corresponding to the text to be synthesized are obtained from the text to be synthesized.
  • the target audio with the specified emotion type is obtained.
  • Fig. 9 is a block diagram of another speech synthesis apparatus according to an exemplary embodiment.
  • the speech synthesis model includes: a first encoder, a second encoder and a synthesizer.
  • the synthesis module 203 may include:
  • the first processing sub-module 2031 is configured to extract text features corresponding to the text to be synthesized through the first encoder.
  • the second processing sub-module 2032 is configured to extract predicted acoustic features corresponding to the text to be synthesized through the second encoder.
  • the third processing sub-module 2033 is configured to generate target audio according to the specified acoustic features, predicted acoustic features and text features through the synthesizer.
  • the third processing sub-module 2033 can be used for:
  • Step 1) Through the synthesizer, according to the current text element, the historical Mel spectrum feature, the specified acoustic feature and the predicted acoustic feature, determine the Mel spectrum feature of the current moment, and the current text element is the text input to the synthesizer at the current moment in the text feature.
  • the historical Mel spectrum feature is the Mel spectrum feature at the last moment determined by the synthesizer.
  • Step 2 Through the synthesizer, the target audio is generated according to the Mel spectrum features at each moment.
  • step A the real acoustic features corresponding to the training audio are extracted through the training audio corresponding to the training text that does not have the specified emotion type.
  • step B the real acoustic features and the training text are input into the speech synthesis model, and the speech synthesis model is trained according to the output of the speech synthesis model and the training audio.
  • the speech synthesis model may include: a first encoder, a second encoder and a synthesizer, a blocking structure is provided between the first encoder and the second encoder, and the blocking structure is used to block the second encoder Return the gradient to the first encoder.
  • step B may include:
  • Step B1 extracting training text features corresponding to the training text through the first encoder.
  • step B2 the predicted training acoustic features corresponding to the training text are extracted by the second encoder.
  • step B3 the output of the speech synthesis model is generated by the synthesizer according to the real acoustic features, the predicted training acoustic features and the training text features.
  • the loss function of the speech synthesis model is determined by a first loss and a second loss, the first loss is determined by the output of the speech synthesis model, and the training audio, and the second loss is determined by the output of the second encoder, and the real Acoustic signature determination.
  • the speech synthesis model can also be obtained by training in the following manner:
  • Step C through the training audio, extract the real Mel spectrum information corresponding to the training audio.
  • step B can be:
  • the real acoustic features, training text and real mel spectrum information are used as the input of the speech synthesis model, and the speech synthesis model is trained according to the output of the speech synthesis model and the training audio.
  • each module performs the operation has been described in detail in the embodiment of the method, and will not be described in detail here.
  • the division of the above modules does not limit the specific implementation, and the above modules may be implemented in software, hardware, or a combination of software and hardware, for example.
  • the above-mentioned modules may be implemented as independent physical entities, or may also be implemented by a single entity (eg, a processor (CPU or DSP, etc.), an integrated circuit, etc.).
  • a processor CPU or DSP, etc.
  • integrated circuit etc.
  • the respective modules are shown as separate modules in the figures, one or more of these modules may also be combined into one module or split into multiple modules.
  • the above-mentioned accent word determination module and speech synthesis model determination module are shown with dotted lines in the drawings to indicate that these modules do not have to be included in the speech synthesis device, but can be implemented outside the speech synthesis device or by outside the speech synthesis device The other device implements and informs the speech synthesis device of the result.
  • the above accent word determination module and speech synthesis model determination module are shown with dotted lines in the drawings to indicate that these modules may not actually exist, and the operations/functions they implement can be implemented by the speech synthesis device itself.
  • the present disclosure first obtains the text to be synthesized and the specified emotion type, then determines the corresponding specified acoustic feature according to the specified emotion type, and finally inputs the to-be-synthesized text and the specified acoustic feature together into a pre-trained speech synthesis model.
  • the output of the speech synthesis model is the target audio with the specified emotional type corresponding to the text to be synthesized, wherein the acoustic features of the target audio match the specified acoustic features, and the speech synthesis model is obtained by training the corpus without the specified emotional type of.
  • the present disclosure can control the speech synthesis of text through the acoustic features corresponding to the emotion types, so that the target audio output by the speech synthesis model can correspond to the acoustic features, and the expressiveness of the target audio is improved.
  • Terminal devices in the embodiments of the present disclosure may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablets), PMPs (portable multimedia players), vehicle-mounted terminals (eg, mobile terminals such as in-vehicle navigation terminals) and the like, and stationary terminals such as digital TVs, desktop computers, and the like.
  • the electronic device shown in FIG. 10 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.
  • an electronic device 300 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 301 that may be loaded into random access according to a program stored in a read only memory (ROM) 302 or from a storage device 308 Various appropriate actions and processes are executed by the programs in the memory (RAM) 303 .
  • RAM 303 various programs and data required for the operation of the electronic device 300 are also stored.
  • the processing device 301, the ROM 302, and the RAM 303 are connected to each other through a bus 304.
  • An input/output (I/O) interface 305 is also connected to bus 304 .
  • the following devices may be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 307 of a computer, etc.; a storage device 308 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 309. Communication means 309 may allow electronic device 300 to communicate wirelessly or by wire with other devices to exchange data. While FIG. 10 shows electronic device 300 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication device 309, or from the storage device 308, or from the ROM 302.
  • the processing device 301 When the computer program is executed by the processing device 301, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, electrical wire, optical fiber cable, RF (radio frequency), etc., or any suitable combination of the foregoing.
  • terminal devices and servers can use any currently known or future developed network protocols such as HTTP (HyperText Transfer Protocol) to communicate, and can communicate with digital data in any form or medium Communication (eg, a communication network) interconnects.
  • network protocols such as HTTP (HyperText Transfer Protocol) to communicate, and can communicate with digital data in any form or medium Communication (eg, a communication network) interconnects.
  • Examples of communication networks include local area networks (“LAN”), wide area networks (“WAN”), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any currently known or future development network of.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: obtains the text to be synthesized and the specified emotion type; determines the specified emotion type corresponding to the specified emotion type Acoustic features; input the text to be synthesized and the specified acoustic features into a pre-trained speech synthesis model, to obtain the target audio output of the speech synthesis model, the text to be synthesized corresponds to the target audio with the specified emotion type, The acoustic features of the target audio are matched with the specified acoustic features, and the speech synthesis model is obtained by training according to the corpus without the specified emotion type.
  • Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider to via Internet connection).
  • LAN local area network
  • WAN wide area network
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
  • the modules involved in the embodiments of the present disclosure may be implemented in software or hardware.
  • the name of the module does not constitute a limitation of the module itself under certain circumstances, for example, the acquisition module can also be described as "a module for acquiring the text to be synthesized and specifying the emotion type".
  • exemplary types of hardware logic components include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), Systems on Chips (SOCs), Complex Programmable Logical Devices (CPLDs) and more.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs Systems on Chips
  • CPLDs Complex Programmable Logical Devices
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • Exemplary Embodiment 1 provides a speech synthesis method, including: acquiring text to be synthesized and a specified emotion type; determining specified acoustic features corresponding to the specified emotion type; The text to be synthesized and the specified acoustic features are input into a pre-trained speech synthesis model to obtain the target audio output of the speech synthesis model, the target audio corresponding to the to-be-synthesized text with the specified emotion type, the acoustic The feature matches the specified acoustic feature, and the speech synthesis model is obtained by training from corpus without the specified emotion type.
  • exemplary embodiment 2 provides the method of exemplary embodiment 1, wherein the specifying acoustic characteristics includes at least one of fundamental frequency, volume, and speech rate, and the determining Determining the specified acoustic feature corresponding to the specified emotion type includes: determining the corresponding specified acoustic feature according to the specified emotion type and a preset association relationship between the emotion type and the acoustic feature.
  • Exemplary Embodiment 3 provides the method of Exemplary Embodiment 1 or Exemplary Embodiment 2, wherein the speech synthesis model is used for: obtaining the to-be-synthesized text through the to-be-synthesized text The text feature corresponding to the synthesized text and the predicted acoustic feature corresponding to the text to be synthesized; the target audio having the designated emotion type is obtained through the designated acoustic feature, the predicted acoustic feature and the text feature.
  • exemplary embodiment 4 provides the method of exemplary embodiment 3, the speech synthesis model includes: a first encoder, a second encoder and a synthesizer; Inputting the text to be synthesized and the specified acoustic features into a pre-trained speech synthesis model, to obtain the target audio output of the speech synthesis model, the text to be synthesized corresponds to the target audio with the specified emotion type, including: The first encoder extracts the text features corresponding to the text to be synthesized; the second encoder extracts the predicted acoustic features corresponding to the text to be synthesized; through the synthesizer, according to the specified The acoustic feature, the predicted acoustic feature, and the text feature generate the target audio.
  • exemplary embodiment 5 provides the method of exemplary embodiment 4, wherein the textual feature includes a plurality of textual elements, through the synthesizer, according to the specified acoustic feature, all the The predicted acoustic features and the text features, and the generation of the target audio includes: by the synthesizer, determining the current moment according to the current text elements, historical Mel spectral features, the specified acoustic features and the predicted acoustic features
  • the Mel spectrum feature, the current text element is the text element input to the synthesizer at the current moment in the text feature, and the historical Mel spectrum feature is the Mel spectrum at the last moment determined by the synthesizer feature; through the synthesizer, the target audio is generated according to the Mel spectrum feature at each moment.
  • Exemplary Embodiment 6 provides the method of Exemplary Embodiment 3, where the speech synthesis model is obtained by training in the following manner: by training text corresponding to not having the specified emotion Type training audio, extract the real acoustic features corresponding to the training audio; input the real acoustic features and the training text into the speech synthesis model, and according to the output of the speech synthesis model and the training audio, train the speech synthesis model.
  • exemplary embodiment 7 provides the method of exemplary embodiment 6, the speech synthesis model includes: a first encoder, a second encoder and a synthesizer, the first A blocking structure is arranged between the encoder and the second encoder, and the blocking structure is used to prevent the second encoder from returning the gradient to the first encoder;
  • the speech synthesis model includes: a first encoder, a second encoder and a synthesizer, the first A blocking structure is arranged between the encoder and the second encoder, and the blocking structure is used to prevent the second encoder from returning the gradient to the first encoder;
  • Inputting the training text into the speech synthesis model, and training the speech synthesis model according to the output of the speech synthesis model and the training audio including: extracting the training text corresponding to the training text through the first encoder text feature; extract the predicted training acoustic feature corresponding to the training text by the second encoder; generate the predicted training acoustic feature by the synthesizer according to the real acoustic feature, the predicted training
  • exemplary embodiment 8 provides the method of exemplary embodiment 6, the loss function of the speech synthesis model is determined by a first loss and a second loss, the first loss is determined by The output of the speech synthesis model is determined from the training audio, and the second loss is determined from the output of the second encoder and the real acoustic features.
  • exemplary embodiment 9 provides the method of exemplary embodiment 6, and the speech synthesis model is further obtained by training in the following manner: extracting the training audio through the training audio Corresponding real Mel spectrum information; Described inputting described speech synthesis model with described real acoustic feature and described training text, and according to the output of described speech synthesis model and described training audio, train described speech synthesis model, Including: using the real acoustic feature, the training text and the real Mel spectrum information as the input of the speech synthesis model, and training the speech according to the output of the speech synthesis model and the training audio synthetic model.
  • Exemplary Embodiment 10 provides a speech synthesis apparatus, including: an acquisition module for acquiring text to be synthesized and a specified emotion type; a determination module for determining the specified emotion The specified acoustic feature corresponding to the type; the synthesis module is used to input the text to be synthesized and the specified acoustic feature into a pre-trained speech synthesis model to obtain the output of the speech synthesis model, and the text to be synthesized corresponds to The target audio of the specified emotion type, the acoustic feature of the target audio matches the specified acoustic feature, and the speech synthesis model is obtained by training according to the corpus without the specified emotion type.
  • Exemplary Embodiment 11 provides a computer-readable medium having stored thereon a computer program that, when executed by a processing apparatus, implements Exemplary Embodiment 1 to Exemplary Embodiments Steps of the method described in 9.
  • exemplary embodiment 12 provides an electronic device, including: a storage device on which a computer program is stored; and a processing device for executing the computer in the storage device program to implement the steps of the methods described in Exemplary Embodiment 1 to Exemplary Embodiment 9.

Abstract

一种语音合成方法、装置、可读介质及电子设备,涉及电子信息处理技术领域,该方法包括:获取待合成文本和指定情感类型(101),确定指定情感类型对应的指定声学特征(102),将待合成文本和指定声学特征输入预先训练的语音合成模型,以获取语音合成模型输出的,待合成文本对应的具有指定情感类型的目标音频(103)。目标音频的声学特征与指定声学特征匹配,语音合成模型为根据不具有指定情感类型的语料训练得到的。该方法能够通过情感类型对应的声学特征来控制对文本的语音合成,使得语音合成模型输出的目标音频能够与声学特征对应,提高了目标音频的表现力。

Description

语音合成方法、装置、可读介质及电子设备
相关申请的交叉引用
本申请是以申请号为202011315115.1、申请日为2020年11月20日的中国申请为基础,并主张其优先权,该中国申请的公开内容在此作为整体引入本申请中。
技术领域
本公开涉及电子信息处理技术领域,具体地,涉及一种语音合成方法、装置、可读介质及电子设备。
背景技术
随着电子信息处理技术的不断发展,语音作为人们获取信息的重要载体,已经被广泛应用于日常生活和工作中。涉及语音的应用场景中,通常会包括语音合成的处理,语音合成是指将用户指定的文本,合成为音频。语音合成过程中,需要借助原始音库来生成文本对应的音频。原始音库中的数据通常是不具有情感的,相应的,语音合成处理得到的音频也不具有情感,音频的表现力较弱。要使语音合成得到的音频具有情感,就需要创建具有情感的音库,对于录音人员来说工作量大、效率低,很难实现。
发明内容
提供该发明内容部分以便以简要的形式介绍构思,这些构思将在后面的具体实施方式部分被详细描述。该发明内容部分并不旨在标识要求保护的技术方案的关键特征或必要特征,也不旨在用于限制所要求的保护的技术方案的范围。
第一方面,本公开提供一种语音合成方法,所述方法包括:
获取待合成文本和指定情感类型;
确定所述指定情感类型对应的指定声学特征;
将所述待合成文本和所述指定声学特征输入预先训练的语音合成模型,以获取所述语音合成模型输出的,所述待合成文本对应的具有所述指定情感类型的目标音频,所述目标音频的声学特征与所述指定声学特征匹配,所述语音合成模型为根据不具有所述指定情感类型的语料训练得到的。
第二方面,本公开提供一种语音合成装置,所述装置包括:
获取模块,用于获取待合成文本和指定情感类型;
确定模块,用于确定所述指定情感类型对应的指定声学特征;
合成模块,用于将所述待合成文本和所述指定声学特征输入预先训练的语音合成模型,以获取所述语音合成模型输出的,所述待合成文本对应的具有所述指定情感类型的目标音频,所述目标音频的声学特征与所述指定声学特征匹配,所述语音合成模型为根据不具有所述指定情感类型的语料训练得到的。
第三方面,本公开提供一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现本公开第一方面所述方法的步骤。
第四方面,本公开提供一种电子设备,包括:
存储装置,其上存储有计算机程序;
处理装置,用于执行所述存储装置中的所述计算机程序,以实现本公开第一方面所述方法的步骤。
第五方面,本公开提供一种计算机程序产品,包括指令,所述指令在由计算机执行时使得计算机实现第一方面中所述方法的步骤。
通过上述技术方案,本公开首先获取待合成文本和指定情感类型,之后根据指定情感类型,确定对应的指定声学特征,最后将待合成文本和指定声学特征一起输入到预先训练好的语音合成模型中,语音合成模型输出的即为待合成文本对应的,具有指定情感类型的目标音频,其中,目标音频的声学特征与指定声学特征匹配,并且语音合成模型为根据不具有指定情感类型的语料训练得到的。本公开能够通过情感类型对应的声学特征来控制对文本的语音合成,使得语音合成模型输出的目标音频能够与声学特征对应,提高了目标音频的表现力。
本公开的其他特征和优点将在随后的具体实施方式部分予以详细说明。
附图说明
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。在附图中:
图1是根据一示例性实施例示出的一种语音合成方法的流程图;
图2是根据一示例性实施例示出的一种关联关系的示意图;
图3是根据一示例性实施例示出的一种语音合成模型的框图;
图4是根据一示例性实施例示出的另一种语音合成方法的流程图;
图5是根据一示例性实施例示出的一种训练语音合成模型的流程图;
图6是根据一示例性实施例示出的另一种训练语音合成模型的流程图;
图7是根据一示例性实施例示出的另一种训练语音合成模型的流程图;
图8是根据一示例性实施例示出的一种语音合成装置的框图;
图9是根据一示例性实施例示出的另一种语音合成装置的框图;
图10是根据一示例性实施例示出的一种电子设备的框图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
图1是根据一示例性实施例示出的一种语音合成方法的流程图,如图1所示,该方法包括:
步骤101,获取待合成文本和指定情感类型。
举例来说,首先获取需要进行合成的待合成文本。待合成文本例如可以是用户指定的文本文件中的一个或多个语句,也可以是文本文件中的一个或多个段落,还可以是一个文本文件中的一个或多个章节。文本文件例如可以是一部电子书,也可以是其他类型的文件,例如新闻、公众号文章、博客等。此外,还可以获取指定情感类型,指定情感类型可以理 解为用户指定的,期望将待合成文本合成为符合指定情感类型的音频(即后文提及的目标音频)。指定情感类型例如可以为:开心、惊讶、憎恶、生气、害羞、恐惧、悲伤、不屑等。
步骤102,确定指定情感类型对应的指定声学特征。
示例的,人在不同情感状态下发出的声音,会有不同的声学特征,因此可以根据指定情感类型,确定符合指定情感类型的指定声学特征。其中,声学特征可以理解为声音在多个维度的属性,例如可以包括:音量(即energy)、基频(即pitch)、语速(即duration)等。例如,可以根据情感类型和声学特征之间对应的关系,确定指定情感类型对应的指定声学特征,该情感类型和声学特征之间对应的关系可预先建立,例如可根据历史统计数据来建立。还可以预先训练一个可以根据情感类型识别声学特征的识别模型,从而将指定情感类型输入该识别模型,该识别模型的输出即为指定声学特征。识别模型例如可以是RNN(英文:Recurrent Neural Network,中文:循环神经网络)、CNN(英文:Convolutional Neural Networks,中文:卷积神经网络)、LSTM(英文:Long Short-Term Memory,中文:长短期记忆网络)等神经网络,本公开对此不作具体限定。
步骤103,将待合成文本和指定声学特征输入预先训练的语音合成模型,以获取语音合成模型输出的,待合成文本对应的具有指定情感类型的目标音频,目标音频的声学特征与指定声学特征匹配,并且语音合成模型为根据不具有指定情感类型的语料训练得到的。
示例的,可以预先训练一个语音合成模型,语音合成模型可以理解成一种TTS(英文:Text To Speech,中文:从文本到语音)模型,能够根据待合成文本和指定声学特征,生成待合成文本对应的,具有指定情感类型(即与指定声学特征匹配)的目标音频。将待合成文本和指定声学特征作为语音合成模型的输入,语音合成模型的输出即为目标音频。具体的,语音合成模型可以是基于Tacotron模型、Deepvoice 3模型、Tacotron 2模型、Wavenet模型等训练得到的,本公开对此不作具体限定。其中,在对语音合成模型进行训练的过程中,不需要具有指定情感类型的语料(可以理解为语音库),可以直接利用现有的、不具有指定情感类型的语料进行训练得到。这样,在对待合成文本中进行语音合成的过程中,除了待合成文本中包括的语义,还考虑了指定情感类型对应的声学特征,能够使目标音频具有指定情感类型。可以利用现有的,不具有指定情感类型的语料,就能实现在语音合成的过程中,情感类型的显式控制,而无需花费大量的时间成本和人力成本预先创建具有情感的语料,提高了目标音频的表现力,同时也改善了用户的听觉体验。
综上所述,本公开首先获取待合成文本和指定情感类型,之后根据指定情感类型,确定对应的指定声学特征,最后将待合成文本和指定声学特征一起输入到预先训练好的语音 合成模型中,语音合成模型输出的即为待合成文本对应的,具有指定情感类型的目标音频,其中,目标音频的声学特征与指定声学特征匹配,并且语音合成模型为根据不具有指定情感类型的语料训练得到的。本公开能够通过情感类型对应的声学特征来控制对文本的语音合成,使得语音合成模型输出的目标音频能够与声学特征对应,提高了目标音频的表现力。
在一些实施例中,指定声学特征包括:基频、音量、语速中的至少一种。步骤102可以通过以下方式来实现:
根据指定情感类型,和情感类型与声学特征的关联关系,确定对应的指定声学特征。
情感类型与声学特征的关联关系可通过各种适当的方式被确定。作为示例的,可以先获取符合某种情感类型的音频,再利用信号处理、标注等处理方式确定这些音频中的声学特征,从而得到该种情感类型对应的声学特征。对多种情感类型重复执行上述步骤,即可得到情感类型与声学特征的关联关系。其中,声学特征可以包括:基频、音量、语速中的至少一种,还可以包括:音调、音色、响度等,本公开对此不作具体限定。关联关系例如可以如图2所示,从基频、音量、语速三个维度来表示情感类型,其中图2中的(a)所示的是在音量较低(即Low Energy)的场景中,对应的四种情感类型:害羞、恐惧、悲伤、不屑,图2中的(b)所示的是在音量较高(即High Energy)的场景中,对应的四种情感类型:惊讶、开心、生气、憎恶。进一步的,还可以将关联关系数值化,例如,图2中的(a)示出害羞位于音量较低的第二象限,可以将害羞对应的声学特征确定为(音量:-2,基频:+3,语速:-3)。
在一些实施例中,可以通过语音合成模型按照以下操作来获得目标音频:
首先,通过待合成文本获得待合成文本对应的文本特征,和待合成文本对应的预测声学特征。
之后,通过指定声学特征、预测声学特征和文本特征,获得具有指定情感类型的目标音频。
示例的,通过语音合成模型合成目标音频的具体过程,可以先提取待合成文本对应的文本特征,并预测待合成文本对应的声学特征。其中,文本特征可以理解为能够表征待合成文本的文本向量。预测声学特征可以理解为语音合成模型根据待合成文本,预测出的符合待合成文本的声学特征,预测声学特征可以包括:基频、音量、语速中的至少一种,还可以包括:音调、音色、响度等。
在获得文本特征和预测声学特征后,可以再结合指定声学特征,生成具有指定情感类型的目标音频。一种实现方式,可以将指定声学特征与预测声学特征进行叠加,得到一个声学特征向量,然后根据声学特征向量与文本向量生成目标音频。另一种实现方式,还可 以将指定声学特征、预测声学特征和文本向量进行叠加,得到一个组合向量,然后根据组合向量生成目标音频,本公开对此不作具体限定。
图3是根据一示例性实施例示出的一种语音合成模型的框图,如图3所示,语音合成模型包括:第一编码器、第二编码器和合成器。其中,第一编码器的结构,可以和Tacotron模型中的编码器(即Encoder)的结构相同,合成器可以理解为Tacotron模型中的注意力网络(即Attention)、解码器(即Decoder)和后处理网络(即Post-processing)的组合。第二编码器(可以表示为Feature Extractor)可以理解为一个提取模型,能够根据输入的文本,预测该文本对应的声学特征(即后文提及的预测声学特征)。
图4是根据一示例性实施例示出的另一种语音合成方法的流程图,如图4所示,步骤103可以包括:
步骤1031,通过第一编码器,提取待合成文本对应的文本特征。
举例来说,第一编码器可以包括嵌入层(即Character Embedding层)、预处理网络(Pre-net)子模型和CBHG(英文:Convolution Bank+Highway network+bidirectional Gated Recurrent Unit,中文:卷积层+高速网络+双向递归神经网络)子模型。将待合成文本输入第一编码器,首先,通过嵌入层将待合成文本转换为词向量,然后将词向量输入至Pre-net子模型,以对词向量进行非线性变换,从而提升语音合成模型的收敛和泛化能力,最后,通过CBHG子模型根据非线性变换后的词向量,获得能够表征待合成文本的文本特征。
步骤1032,通过第二编码器,提取待合成文本对应的预测声学特征。
示例的,可以将步骤1031中确定的文本特征输入到第二编码器,以使第二编码器根据文本向量预测待合成文本对应的预测声学特征。第二编码器例如可以是一个3层,256unit,8head的Transformer。
步骤1033,通过合成器,根据指定声学特征、预测声学特征和文本特征,生成目标音频。
具体的,合成器可以包括注意力网络、解码器和后处理网络。可以先将文本特征输入注意力网络,注意力网络可以为文本向量中的每个元素增加一个注意力权重,从而使得长度固定的文本特征,变为长度可变的语义向量,其中语义向量能够表征待合成文本。具体的,注意力网络可以为位置敏感注意力(英文:Locative Sensitive Attention)网络,也可以为GMM(英文:Gaussian Mixture Model,缩写GMM)attention网络,还可以是Multi-Head Attention网络,本公开对此不作具体限定。
进一步的,可以将指定声学特征、预测声学特征和语义向量输入解码器,一种实现方式,可以将指定声学特征与预测声学特征进行叠加,得到一个声学特征向量,然后将声学 特征向量与语义向量作为解码器的输入。另一种实现方式,还可以将指定声学特征、预测声学特征和语义向量进行叠加,得到一个组合向量,然后将组合向量作为解码器的输入。解码器可以包括预处理网络子模型(可以与第一编码器中包括的预处理网络子模型的相同)、Attention-RNN、Decoder-RNN。预处理网络子模型用于对输入的指定声学特征、预测声学特征和语义向量进行非线性变换,Attention-RNN的结构为一层单向的、基于zoneout的LSTM(英文:Long Short-Term Memory,中文:长短期记忆网络),能够将预处理网络子模型的输出作为输入,经过LSTM单元后输出到Decoder-RNN中。Decode-RNN为两层单向的、基于zoneout的LSTM,经过LSTM单元输出梅尔频谱信息,梅尔频谱信息中可以包括一个或多个梅尔频谱特征。最后将梅尔频谱信息输入后处理网络,后处理网络可以包括声码器(例如,Wavenet声码器、Griffin-Lim声码器等),用于对梅尔频谱特征信息进行转换,以得到目标音频。
在一些实施例中,文本特征中可以包括多个文本元素,步骤1033的实现方式可以包括:
步骤1)通过合成器,根据当前文本元素、历史梅尔频谱特征、指定声学特征和预测声学特征,确定当前时刻的梅尔频谱特征,当前文本元素为文本特征中当前时刻输入到合成器的文本元素,历史梅尔频谱特征为合成器确定的上一时刻的梅尔频谱特征。
步骤2)通过合成器,根据每个时刻的梅尔频谱特征,生成目标音频。
举例来说,文本特征中可以包括第一数量个文本元素(第一数量大于1),那么相应的,合成器中的注意力网络输出的语义向量中可以包括第二数量个语义元素,合成器中的解码器输出的梅尔频谱信息可以包括第三数量个梅尔频谱特征。其中,第一数量、第二数量和第三数量可以相同,也可以不同,本公开对此不作具体限定。
具体的,第一数量个文本元素按照预设的timestep(时间步)输入合成器中的注意力网络,当前时刻输入注意力网络的文本元素为当前文本元素,同时还会将上一时刻解码器输出的历史梅尔频谱特征一起输入注意力网络,从而获得注意力网络输出的当前语义元素(当前语义元素可以为当前时刻注意力网络输出的一个或多个语义元素)。相应的,可以将指定声学特征、预测声学特征、历史梅尔频谱特征和当前语义元素,输入合成器中的解码器,以获取解码器输出的当前梅尔频谱特征。在文本特征全部输入注意力网络之后,解码器将会依次输出第三数量个梅尔频谱特征,即梅尔频谱信息。最后,将梅尔频谱信息(即每个时刻的梅尔频谱特征)输入到合成器中的后处理网络,从而获得后处理网络生成的目标音频。
图5是根据一示例性实施例示出的一种语音合成模型训练过程的流程图。该语音合成 模型训练过程可被包含在根据本公开的语音合成方法中,也可在根据本公开的语音合成方法之外,而可应用于根据本公开的语音合成方法。如图5所示,语音合成模型是通过如下方式训练获得的:
步骤A,通过训练文本对应的不具有指定情感类型的训练音频,提取训练音频对应的真实声学特征。
步骤B,将真实声学特征与训练文本输入语音合成模型,并根据语音合成模型的输出与训练音频,训练语音合成模型。
举例来说,对语音合成模型进行训练,需要先获取训练文本和训练文本对应的训练音频,训练文本可以有多个,相应的,训练音频也有多个。例如可以通过在互联网上抓取大量的文本作为训练文本,然后将训练文本对应的音频,作为训练音频,训练音频可以不具有任何情感类型。针对训练文本,可以提取不具有指定情感类型的训练音频对应的真实声学特征。例如,可以通过信号处理、标注等方式,得到训练音频对应的真实声学特征。最后,将训练文本和真实声学特征,作为语音合成模型的输入,并根据语音合成模型的输出与训练音频,训练语音合成模型。例如,可以根据语音合成模型的输出,与训练音频的差作为语音合成模型的损失函数,以降低损失函数为目标,利用反向传播算法来修正语音合成模型中的神经元的参数,神经元的参数例如可以是神经元的权重(英文:Weight)和偏置量(英文:Bias)。重复上述步骤,直至损失函数满足预设条件,例如损失函数小于预设的损失阈值。
在一些实施例中,语音合成模型可以包括:第一编码器、第二编码器和合成器,第一编码器和第二编码器之间设置有阻止结构,阻止结构用于阻止第二编码器将梯度回传至第一编码器。
其中,阻止结构可以理解为stop_gradient(),可以截断第二编码器的第二损失,从而阻止第二编码器将梯度回传至第一编码器,也就是说,在第二编码器根据第二损失进行调整时,不会影响到第一编码器,从而避免了语音合成模型训练不稳定的问题。
图6是根据一示例性实施例示出的另一种训练语音合成模型的流程图,如图6所示,步骤B的实现方式可以包括:
步骤B1,通过第一编码器提取训练文本对应的训练文本特征。
步骤B2,通过第二编码器提取训练文本对应的预测训练声学特征。
步骤B3,通过合成器,根据真实声学特征、预测训练声学特征和训练文本特征,生成语音合成模型的输出。
举例来说,可以将训练文本输入第一编码器,以获取第一编码器输出的训练文本对应 的训练文本特征。之后,将训练文本特征输入第二编码器,以获取第二编码器输出的训练文本特征对应的预测训练声学特征。再将真实声学特征、预测训练声学特征和训练文本特征,输入合成器,以将合成器的输出作为语音合成模型的输出。
在一些实施例中,语音合成模型的损失函数由第一损失和第二损失确定,第一损失由语音合成模型的输出,与训练音频确定,第二损失由第二编码器的输出,与真实声学特征确定。
示例的,损失函数可以由第一损失和第二损失共同确定的,例如可以是第一损失和第二损失进行加权求和。其中,第一损失可以理解为,将训练文本和对应的真实声学特征输入语音合成模型,根据语音合成模型的输出,与训练文本对应的训练音频的差值(也可以是均方误差)来确定的损失函数。第二损失可以理解为,将训练文本输入第一编码器,得到对应的训练文本特征,再将训练文本特征输入第二编码器,根据第二编码器的输出,与训练文本对应的真实声学特征的差值(也可以是均方误差)来确定的损失函数。加权权重可以采用各种适当的方式被设定,例如可以在第二编码器的输出的特性而设定的,这样,在对语音合成模型进行训练的过程中,既可以从整体上调整语音合成模型中神经元的权重和连接关系,同时还可以对第二编码器中的神经元的权重和连接关系进行调整,保证了语音合成模型和其中第二编码器的准确度和有效性。
图7是根据一示例性实施例示出的另一种训练语音合成模型的流程图,如图7所示,语音合成模型还可以通过如下方式训练获得:
步骤C,通过训练音频,提取训练音频对应的真实梅尔频谱信息。
相应的,步骤B可以为:
将真实声学特征、训练文本和真实梅尔频谱信息,作为语音合成模型的输入,并根据语音合成模型的输出与训练音频,训练语音合成模型。
示例的,在训练语音合成模型的过程中,还可以获取训练音频对应的真实梅尔频谱信息。例如,可以通过信号处理的方式,得到训练音频对应的真实梅尔频谱信息。相应的,可以将真实声学特征、训练文本和真实梅尔频谱信息,作为语音合成模型的输入,将并根据语音合成模型的输出与训练音频,训练语音合成模型。
具体的,可以先将训练文本输入第一编码器,以获取第一编码器输出的训练文本对应的训练文本特征。之后,将训练文本特征输入第二编码器,以获取第二编码器输出的训练文本特征对应的预测训练声学特征。然后将训练文本特征,和训练文本对应的真实梅尔频谱信息,输入注意力网络,以获取注意力网络输出的训练文本对应的训练语义向量。再将预测训练声学特征、训练语义向量、训练文本对应的真实声学特征,和训练文本对应的真 实梅尔频谱信息,输入解码器,以获取解码器输出的训练梅尔频谱信息。最后,将训练梅尔频谱信息输入后处理网络,将后处理网络的输出作为合成器的输出(即语音合成模型的输出)。
综上所述,本公开首先获取待合成文本和指定情感类型,之后根据指定情感类型,确定对应的指定声学特征,最后将待合成文本和指定声学特征一起输入到预先训练好的语音合成模型中,语音合成模型输出的即为待合成文本对应的,具有指定情感类型的目标音频,其中,目标音频的声学特征与指定声学特征匹配,并且语音合成模型为根据不具有指定情感类型的语料训练得到的。本公开能够通过情感类型对应的声学特征来控制对文本的语音合成,使得语音合成模型输出的目标音频能够与声学特征对应,提高了目标音频的表现力。
图8是根据一示例性实施例示出的一种语音合成装置的框图,如图8所示,该装置200包括:
获取模块201,用于获取待合成文本和指定情感类型。
确定模块202,用于确定指定情感类型对应的指定声学特征。
合成模块203,用于将待合成文本和指定声学特征输入预先训练的语音合成模型,以获取语音合成模型输出的,待合成文本对应的具有指定情感类型的目标音频,目标音频的声学特征与指定声学特征匹配,语音合成模型为根据不具有指定情感类型的语料训练得到的。
在一些实施例中,指定声学特征包括:基频、音量、语速中的至少一种,确定模块202可以用于:
根据指定情感类型,和预设的情感类型与声学特征的关联关系,确定对应的指定声学特征。
在一些实施例中,语音合成模型可以用于:
首先,通过待合成文本获得待合成文本对应的文本特征,和待合成文本对应的预测声学特征。
之后,通过指定声学特征、预测声学特征和文本特征,获得具有指定情感类型的目标音频。
图9是根据一示例性实施例示出的另一种语音合成装置的框图,如图9所示,语音合成模型包括:第一编码器、第二编码器和合成器。合成模块203可以包括:
第一处理子模块2031,用于通过第一编码器,提取待合成文本对应的文本特征。
第二处理子模块2032,用于通过第二编码器,提取待合成文本对应的预测声学特征。
第三处理子模块2033,用于通过合成器,根据指定声学特征、预测声学特征和文本 特征,生成目标音频。
在一些实施例中,文本特征中可以包括多个文本元素。第三处理子模块2033可以用于:
步骤1)通过合成器,根据当前文本元素、历史梅尔频谱特征、指定声学特征和预测声学特征,确定当前时刻的梅尔频谱特征,当前文本元素为文本特征中当前时刻输入到合成器的文本元素,历史梅尔频谱特征为合成器确定的上一时刻的梅尔频谱特征。
步骤2)通过合成器,根据每个时刻的梅尔频谱特征,生成目标音频。
需要说明的是,上述实施例中的语音合成模型是通过如下方式训练获得的:
步骤A,通过训练文本对应的不具有指定情感类型的训练音频,提取训练音频对应的真实声学特征。
步骤B,将真实声学特征与训练文本输入语音合成模型,并根据语音合成模型的输出与训练音频,训练语音合成模型。
在一些实施例中,语音合成模型可以包括:第一编码器、第二编码器和合成器,第一编码器和第二编码器之间设置有阻止结构,阻止结构用于阻止第二编码器将梯度回传至第一编码器。
在一些实施例中,步骤B的实现方式可以包括:
步骤B1,通过第一编码器提取训练文本对应的训练文本特征。
步骤B2,通过第二编码器提取训练文本对应的预测训练声学特征。
步骤B3,通过合成器,根据真实声学特征、预测训练声学特征和训练文本特征,生成语音合成模型的输出。
在一些实施例中,语音合成模型的损失函数由第一损失和第二损失确定,第一损失由语音合成模型的输出,与训练音频确定,第二损失由第二编码器的输出,与真实声学特征确定。
在一些实施例中,语音合成模型还可以通过如下方式训练获得:
步骤C,通过训练音频,提取训练音频对应的真实梅尔频谱信息。
相应的,步骤B可以为:
将真实声学特征、训练文本和真实梅尔频谱信息,作为语音合成模型的输入,并根据语音合成模型的输出与训练音频,训练语音合成模型。
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。应注意,上述各个模块的划分并非限制具体的实现方式,上述各个模块例如可以以软件、硬件或者软硬件结合的方式来实现。 在实际实现时,上述各个模块可被实现为独立的物理实体,或者也可由单个实体(例如,处理器(CPU或DSP等)、集成电路等)来实现。需要注意的是,尽管图中将各个模块示为分立的模块,但是这些模块中的一个或多个也可以合并为一个模块,或者拆分为多个模块。此外,上述重音词确定模块和语音合成模型确定模块在附图中用虚线示出指示这些模块不必须被包含在语音合成装置中,其可以在语音合成装置之外实现或者由语音合成装置之外的其它设备实现并且将结果告知语音合成装置。或者,上述重音词确定模块和语音合成模型确定模块在附图中用虚线示出指示这些模块可以并不实际存在,而它们所实现的操作/功能可由语音合成装置本身来实现。
综上所述,本公开首先获取待合成文本和指定情感类型,之后根据指定情感类型,确定对应的指定声学特征,最后将待合成文本和指定声学特征一起输入到预先训练好的语音合成模型中,语音合成模型输出的即为待合成文本对应的,具有指定情感类型的目标音频,其中,目标音频的声学特征与指定声学特征匹配,并且语音合成模型为根据不具有指定情感类型的语料训练得到的。本公开能够通过情感类型对应的声学特征来控制对文本的语音合成,使得语音合成模型输出的目标音频能够与声学特征对应,提高了目标音频的表现力。
下面参考图10,其示出了适于用来实现本公开实施例的电子设备(即上述语音合成方法的执行主体)300的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图10示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图10所示,电子设备300可以包括处理装置(例如中央处理器、图形处理器等)301,其可以根据存储在只读存储器(ROM)302中的程序或者从存储装置308加载到随机访问存储器(RAM)303中的程序而执行各种适当的动作和处理。在RAM 303中,还存储有电子设备300操作所需的各种程序和数据。处理装置301、ROM 302以及RAM 303通过总线304彼此相连。输入/输出(I/O)接口305也连接至总线304。
通常,以下装置可以连接至I/O接口305:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置306;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置307;包括例如磁带、硬盘等的存储装置308;以及通信装置309。通信装置309可以允许电子设备300与其他设备进行无线或有线通信以交换数据。虽然图10示出了具有各种装置的电子设备300,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置309从网络上被下载和安装,或者从存储装置308被安装,或者从ROM 302被安装。在该计算机程序被处理装置301执行时,执行本公开实施例的方法中限定的上述功能。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
在一些实施方式中,终端设备、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:获取待合成文本和指定情感类型;确定所述指定情感类型对应的指定声学特征;将所述待合成文本和所述指定声学特征输入预先训练的语音合成模型, 以获取所述语音合成模型输出的,所述待合成文本对应的具有所述指定情感类型的目标音频,所述目标音频的声学特征与所述指定声学特征匹配,所述语音合成模型为根据不具有所述指定情感类型的语料训练得到的。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言——诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)——连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的模块可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,模块的名称在某种情况下并不构成对该模块本身的限定,例如,获取模块还可以被描述为“获取待合成文本和指定情感类型的模块”。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任 何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
根据本公开的一个或多个实施例,示例性实施例1提供了一种语音合成方法,包括:获取待合成文本和指定情感类型;确定所述指定情感类型对应的指定声学特征;将所述待合成文本和所述指定声学特征输入预先训练的语音合成模型,以获取所述语音合成模型输出的,所述待合成文本对应的具有所述指定情感类型的目标音频,所述目标音频的声学特征与所述指定声学特征匹配,所述语音合成模型为根据不具有所述指定情感类型的语料训练得到的。
根据本公开的一个或多个实施例,示例性实施例2提供了示例性实施例1的方法,所述指定声学特征包括:基频、音量、语速中的至少一种,所述确定所述指定情感类型对应的指定声学特征,包括:根据所述指定情感类型,和预设的情感类型与声学特征的关联关系,确定对应的所述指定声学特征。
根据本公开的一个或多个实施例,示例性实施例3提供了示例性实施例1或示例性实施例2的方法,所述语音合成模型用于:通过所述待合成文本获得所述待合成文本对应的文本特征,和所述待合成文本对应的预测声学特征;通过所述指定声学特征、所述预测声学特征和所述文本特征,获得具有所述指定情感类型的所述目标音频。
根据本公开的一个或多个实施例,示例性实施例4提供了示例性实施例3的方法,所述语音合成模型包括:第一编码器、第二编码器和合成器;所述将所述待合成文本和所述指定声学特征输入预先训练的语音合成模型,以获取所述语音合成模型输出的,所述待合成文本对应的具有所述指定情感类型的目标音频,包括:通过所述第一编码器,提取所述待合成文本对应的所述文本特征;通过所述第二编码器,提取所述待合成文本对应的所述预测声学特征;通过所述合成器,根据所述指定声学特征、所述预测声学特征和所述文本特征,生成所述目标音频。
根据本公开的一个或多个实施例,示例性实施例5提供了示例性实施例4的方法,所述文本特征包括多个文本元素,通过所述合成器,根据所述指定声学特征、所述预测声学特征和所述文本特征,生成所述目标音频,包括:通过所述合成器,根据当前文本元素、历史梅尔频谱特征、所述指定声学特征和所述预测声学特征,确定当前时刻的梅尔频谱特征,所述当前文本元素为所述文本特征中当前时刻输入到所述合成器的文本元素,所述历史梅尔频谱特征为所述合成器确定的上一时刻的梅尔频谱特征;通过所述合成器,根据每 个时刻的梅尔频谱特征,生成所述目标音频。
根据本公开的一个或多个实施例,示例性实施例6提供了示例性实施例3的方法,所述语音合成模型是通过如下方式训练获得的:通过训练文本对应的不具有所述指定情感类型的训练音频,提取所述训练音频对应的真实声学特征;将所述真实声学特征与所述训练文本输入所述语音合成模型,并根据所述语音合成模型的输出与所述训练音频,训练所述语音合成模型。
根据本公开的一个或多个实施例,示例性实施例7提供了示例性实施例6的方法,所述语音合成模型包括:第一编码器、第二编码器和合成器,所述第一编码器和所述第二编码器之间设置有阻止结构,所述阻止结构用于阻止所述第二编码器将梯度回传至所述第一编码器;所述将所述真实声学特征与所述训练文本输入所述语音合成模型,并根据所述语音合成模型的输出与所述训练音频,训练所述语音合成模型,包括:通过所述第一编码器提取所述训练文本对应的训练文本特征;通过所述第二编码器提取所述训练文本对应的预测训练声学特征;通过所述合成器,根据所述真实声学特征、所述预测训练声学特征和所述训练文本特征,生成所述语音合成模型的输出。
根据本公开的一个或多个实施例,示例性实施例8提供了示例性实施例6的方法,所述语音合成模型的损失函数由第一损失和第二损失确定,所述第一损失由所述语音合成模型的输出,与所述训练音频确定,所述第二损失由所述第二编码器的输出,与所述真实声学特征确定。
根据本公开的一个或多个实施例,示例性实施例9提供了示例性实施例6的方法,所述语音合成模型还通过如下方式训练获得的:通过所述训练音频,提取所述训练音频对应的真实梅尔频谱信息;所述将所述真实声学特征与所述训练文本输入所述语音合成模型,并根据所述语音合成模型的输出与所述训练音频,训练所述语音合成模型,包括:将所述真实声学特征、所述训练文本和所述真实梅尔频谱信息,作为所述语音合成模型的输入,并根据所述语音合成模型的输出与所述训练音频,训练所述语音合成模型。
根据本公开的一个或多个实施例,示例性实施例10提供了一种语音合成装置,包括:获取模块,用于获取待合成文本和指定情感类型;确定模块,用于确定所述指定情感类型对应的指定声学特征;合成模块,用于将所述待合成文本和所述指定声学特征输入预先训练的语音合成模型,以获取所述语音合成模型输出的,所述待合成文本对应的有所述指定情感类型的目标音频,所述目标音频的声学特征与所述指定声学特征匹配,所述语音合成模型为根据不具有所述指定情感类型的语料训练得到的。
根据本公开的一个或多个实施例,示例性实施例11提供了一种计算机可读介质,其 上存储有计算机程序,该程序被处理装置执行时实现示例性实施例1至示例性实施例9中所述方法的步骤。
根据本公开的一个或多个实施例,示例性实施例12提供了一种电子设备,包括:存储装置,其上存储有计算机程序;处理装置,用于执行所述存储装置中的所述计算机程序,以实现示例性实施例1至示例性实施例9中所述方法的步骤。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。

Claims (19)

  1. 一种语音合成方法,所述方法包括:
    获取待合成文本和指定情感类型;
    确定所述指定情感类型对应的指定声学特征;
    将所述待合成文本和所述指定声学特征输入预先训练的语音合成模型,以获取所述语音合成模型输出的,所述待合成文本对应的具有所述指定情感类型的目标音频。
  2. 根据权利要求1所述的方法,其中,所述目标音频的声学特征与所述指定声学特征匹配,所述语音合成模型为根据不具有所述指定情感类型的语料训练得到的。
  3. 根据权利要求1所述的方法,其中,所述指定声学特征包括:基频、音量、语速中的至少一种,所述确定所述指定情感类型对应的指定声学特征,包括:
    根据所述指定情感类型,和情感类型与声学特征的关联关系,确定对应的所述指定声学特征。
  4. 根据权利要求1-3中任一项所述的方法,其中,所述语音合成模型用于:
    通过所述待合成文本获得所述待合成文本对应的文本特征,和所述待合成文本对应的预测声学特征;
    通过所述指定声学特征、所述预测声学特征和所述文本特征,获得具有所述指定情感类型的所述目标音频。
  5. 根据权利要求4所述的方法,其中,通过将所述指定声学特征与所述预测声学特征进行叠加以得到声学特征向量,并且根据声学特征向量与所述文本特征生成所述目标音频。
  6. 根据权利要求4所述的方法,其中,通过将指定声学特征、预测声学特征和文本向量进行叠加以得到组合向量,然后根据组合向量生成所述目标音频。
  7. 根据权利要求1-6中任一项所述的方法,其中,所述语音合成模型包括:第一编码器、第二编码器和合成器;
    所述将所述待合成文本和所述指定声学特征输入预先训练的语音合成模型,以获取所述语音合成模型输出的,所述待合成文本对应的具有所述指定情感类型的目标音频,包括:
    通过所述第一编码器,提取所述待合成文本对应的所述文本特征;
    通过所述第二编码器,提取所述待合成文本对应的所述预测声学特征;
    通过所述合成器,根据所述指定声学特征、所述预测声学特征和所述文本特征,生成所述目标音频。
  8. 根据权利要求7所述的方法,其中,所述文本特征包括多个文本元素,并且通过所述合成器,根据所述指定声学特征、所述预测声学特征和所述文本特征,生成所述目标音频,包括:
    通过所述合成器,根据当前文本元素、历史梅尔频谱特征、所述指定声学特征和所述预测声学特征,确定当前时刻的梅尔频谱特征,所述当前文本元素为所述文本特征中当前时刻输入到所述合成器的文本元素,所述历史梅尔频谱特征为所述合成器确定的上一时刻的梅尔频谱特征;
    通过所述合成器,根据每个时刻的梅尔频谱特征,生成所述目标音频。
  9. 根据权利要求1-8中任一项所述的方法,其中,所述语音合成模型是通过如下方式训练获得的:
    通过训练文本对应的不具有所述指定情感类型的训练音频,提取所述训练音频对应的真实声学特征;
    将所述真实声学特征与所述训练文本输入所述语音合成模型,并根据所述语音合成模型的输出与所述训练音频,训练所述语音合成模型。
  10. 根据权利要求9所述的方法,其中,所述语音合成模型包括:第一编码器、第二编码器和合成器,所述第一编码器和所述第二编码器之间设置有阻止结构,所述阻止结构用于阻止所述第二编码器将梯度回传至所述第一编码器;
    所述将所述真实声学特征与所述训练文本输入所述语音合成模型,并根据所述语音合成模型的输出与所述训练音频,训练所述语音合成模型,包括:
    通过所述第一编码器提取所述训练文本对应的训练文本特征;
    通过所述第二编码器提取所述训练文本对应的预测训练声学特征;
    通过所述合成器,根据所述真实声学特征、所述预测训练声学特征和所述训练文本特 征,生成所述语音合成模型的输出。
  11. 根据权利要求10所述的方法,其中,所述语音合成模型的损失函数由第一损失和第二损失确定,所述第一损失由所述语音合成模型的输出,与所述训练音频确定,所述第二损失由所述第二编码器的输出,与所述真实声学特征确定。
  12. 根据权利要求11所述的方法,其中,所述语音合成模型的损失函数是通过第一损失和第二损失的加权求和而被确定的。
  13. 根据权利要求1-8中任一项所述的方法,其中,所述语音合成模型还通过如下方式训练获得的:
    通过所述训练音频,提取所述训练音频对应的真实梅尔频谱信息;
    所述将所述真实声学特征与所述训练文本输入所述语音合成模型,并根据所述语音合成模型的输出与所述训练音频,训练所述语音合成模型,包括:
    将所述真实声学特征、所述训练文本和所述真实梅尔频谱信息,作为所述语音合成模型的输入,并根据所述语音合成模型的输出与所述训练音频,训练所述语音合成模型。
  14. 一种语音合成装置,所述装置包括:
    获取模块,用于获取待合成文本和指定情感类型;
    确定模块,用于确定所述指定情感类型对应的指定声学特征;
    合成模块,用于将所述待合成文本和所述指定声学特征输入预先训练的语音合成模型,以获取所述语音合成模型输出的,所述待合成文本对应的具有所述指定情感类型的目标音频。
  15. 根据权利要求14所述的装置,其中,所述目标音频的声学特征与所述指定声学特征匹配,所述语音合成模型为根据不具有所述指定情感类型的语料训练得到的。
  16. 根据权利要求14或15所述的装置,其中,语音合成模型包括:第一编码器、第二编码器和合成器,并且所述合成模块包括:
    第一处理子模块,用于通过第一编码器,提取待合成文本对应的文本特征,
    第二处理子模块,用于通过第二编码器,提取待合成文本对应的预测声学特征,以及
    第三处理子模块,用于通过合成器,根据指定声学特征、预测声学特征和文本特征, 生成目标音频。
  17. 一种计算机可读介质,其上存储有计算机程序,其特征在于,该程序被处理装置执行时实现权利要求1-13中任一项所述方法的步骤。
  18. 一种电子设备,其特征在于,包括:
    存储装置,其上存储有计算机程序;
    处理装置,用于执行所述存储装置中的所述计算机程序,以实现权利要求1-13中任一项所述方法的步骤。
  19. 一种计算机程序产品,包括指令,所述指令在由计算机执行时使得计算机实现根据权利要求1-13中任一项所述方法的步骤。
PCT/CN2021/126431 2020-11-20 2021-10-26 语音合成方法、装置、可读介质及电子设备 WO2022105553A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/020,198 US20230306954A1 (en) 2020-11-20 2021-10-26 Speech synthesis method, apparatus, readable medium and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011315115.1 2020-11-20
CN202011315115.1A CN112489621B (zh) 2020-11-20 2020-11-20 语音合成方法、装置、可读介质及电子设备

Publications (1)

Publication Number Publication Date
WO2022105553A1 true WO2022105553A1 (zh) 2022-05-27

Family

ID=74933004

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/126431 WO2022105553A1 (zh) 2020-11-20 2021-10-26 语音合成方法、装置、可读介质及电子设备

Country Status (3)

Country Link
US (1) US20230306954A1 (zh)
CN (1) CN112489621B (zh)
WO (1) WO2022105553A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115424604A (zh) * 2022-07-20 2022-12-02 南京硅基智能科技有限公司 一种基于对抗生成网络的语音合成模型的训练方法

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112037760B (zh) * 2020-08-24 2022-01-07 北京百度网讯科技有限公司 语音频谱生成模型的训练方法、装置及电子设备
CN112489620B (zh) * 2020-11-20 2022-09-09 北京有竹居网络技术有限公司 语音合成方法、装置、可读介质及电子设备
CN112489621B (zh) * 2020-11-20 2022-07-12 北京有竹居网络技术有限公司 语音合成方法、装置、可读介质及电子设备
CN113178200B (zh) * 2021-04-28 2024-03-01 平安科技(深圳)有限公司 语音转换方法、装置、服务器及存储介质
CN113555027B (zh) * 2021-07-26 2024-02-13 平安科技(深圳)有限公司 语音情感转换方法、装置、计算机设备及存储介质

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108597492A (zh) * 2018-05-02 2018-09-28 百度在线网络技术(北京)有限公司 语音合成方法和装置
CN108962219A (zh) * 2018-06-29 2018-12-07 百度在线网络技术(北京)有限公司 用于处理文本的方法和装置
CN110379409A (zh) * 2019-06-14 2019-10-25 平安科技(深圳)有限公司 语音合成方法、系统、终端设备和可读存储介质
CN110634466A (zh) * 2018-05-31 2019-12-31 微软技术许可有限责任公司 具有高感染力的tts处理技术
US20200035215A1 (en) * 2019-08-22 2020-01-30 Lg Electronics Inc. Speech synthesis method and apparatus based on emotion information
CN111048062A (zh) * 2018-10-10 2020-04-21 华为技术有限公司 语音合成方法及设备
CN111128118A (zh) * 2019-12-30 2020-05-08 科大讯飞股份有限公司 语音合成方法、相关设备及可读存储介质
CN111653265A (zh) * 2020-04-26 2020-09-11 北京大米科技有限公司 语音合成方法、装置、存储介质和电子设备
WO2020190054A1 (ko) * 2019-03-19 2020-09-24 휴멜로 주식회사 음성 합성 장치 및 그 방법
CN112489621A (zh) * 2020-11-20 2021-03-12 北京有竹居网络技术有限公司 语音合成方法、装置、可读介质及电子设备

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101064104B (zh) * 2006-04-24 2011-02-02 中国科学院自动化研究所 基于语音转换的情感语音生成方法
CN102385858B (zh) * 2010-08-31 2013-06-05 国际商业机器公司 情感语音合成方法和系统
EP3376497B1 (en) * 2017-03-14 2023-12-06 Google LLC Text-to-speech synthesis using an autoencoder
CN107705783B (zh) * 2017-11-27 2022-04-26 北京搜狗科技发展有限公司 一种语音合成方法及装置
CN111192568B (zh) * 2018-11-15 2022-12-13 华为技术有限公司 一种语音合成方法及语音合成装置

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108597492A (zh) * 2018-05-02 2018-09-28 百度在线网络技术(北京)有限公司 语音合成方法和装置
CN110634466A (zh) * 2018-05-31 2019-12-31 微软技术许可有限责任公司 具有高感染力的tts处理技术
CN108962219A (zh) * 2018-06-29 2018-12-07 百度在线网络技术(北京)有限公司 用于处理文本的方法和装置
CN111048062A (zh) * 2018-10-10 2020-04-21 华为技术有限公司 语音合成方法及设备
WO2020190054A1 (ko) * 2019-03-19 2020-09-24 휴멜로 주식회사 음성 합성 장치 및 그 방법
CN110379409A (zh) * 2019-06-14 2019-10-25 平安科技(深圳)有限公司 语音合成方法、系统、终端设备和可读存储介质
US20200035215A1 (en) * 2019-08-22 2020-01-30 Lg Electronics Inc. Speech synthesis method and apparatus based on emotion information
CN111128118A (zh) * 2019-12-30 2020-05-08 科大讯飞股份有限公司 语音合成方法、相关设备及可读存储介质
CN111653265A (zh) * 2020-04-26 2020-09-11 北京大米科技有限公司 语音合成方法、装置、存储介质和电子设备
CN112489621A (zh) * 2020-11-20 2021-03-12 北京有竹居网络技术有限公司 语音合成方法、装置、可读介质及电子设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115424604A (zh) * 2022-07-20 2022-12-02 南京硅基智能科技有限公司 一种基于对抗生成网络的语音合成模型的训练方法
CN115424604B (zh) * 2022-07-20 2024-03-15 南京硅基智能科技有限公司 一种基于对抗生成网络的语音合成模型的训练方法

Also Published As

Publication number Publication date
CN112489621B (zh) 2022-07-12
US20230306954A1 (en) 2023-09-28
CN112489621A (zh) 2021-03-12

Similar Documents

Publication Publication Date Title
WO2022105553A1 (zh) 语音合成方法、装置、可读介质及电子设备
JP7213913B2 (ja) ニューラルネットワークを使用したオーディオの生成
WO2022105545A1 (zh) 语音合成方法、装置、可读介质及电子设备
WO2022156544A1 (zh) 语音合成方法、装置、可读介质及电子设备
CN111583900B (zh) 歌曲合成方法、装置、可读介质及电子设备
WO2022151931A1 (zh) 语音合成方法、合成模型训练方法、装置、介质及设备
WO2022156464A1 (zh) 语音合成方法、装置、可读介质及电子设备
WO2022105861A1 (zh) 用于识别语音的方法、装置、电子设备和介质
WO2022095743A1 (zh) 语音合成方法、装置、存储介质及电子设备
WO2022151930A1 (zh) 语音合成方法、合成模型训练方法、装置、介质及设备
Agarwal et al. Performance of deer hunting optimization based deep learning algorithm for speech emotion recognition
CN111369971A (zh) 语音合成方法、装置、存储介质和电子设备
WO2022111242A1 (zh) 旋律生成方法、装置、可读介质及电子设备
WO2022037388A1 (zh) 语音生成方法、装置、设备和计算机可读介质
CN112927674B (zh) 语音风格的迁移方法、装置、可读介质和电子设备
CN111798821A (zh) 声音转换方法、装置、可读存储介质及电子设备
US11741941B2 (en) Configurable neural speech synthesis
CN111354343B (zh) 语音唤醒模型的生成方法、装置和电子设备
CN113327580A (zh) 语音合成方法、装置、可读介质及电子设备
Sangeetha et al. Emotion speech recognition based on adaptive fractional deep belief network and reinforcement learning
CN111785247A (zh) 语音生成方法、装置、设备和计算机可读介质
Agarwal et al. Vocal mood recognition: Text dependent sequential and parallel approach
WO2023179506A1 (zh) 韵律预测方法、装置、可读介质及电子设备
JP2022153600A (ja) 音声合成方法、装置、電子機器及び記憶媒体
Gambhir et al. End-to-end multi-modal low-resourced speech keywords recognition using sequential Conv2D nets

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21893701

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21893701

Country of ref document: EP

Kind code of ref document: A1