WO2020019885A1 - 语音合成方法、模型训练方法、装置和计算机设备 - Google Patents

语音合成方法、模型训练方法、装置和计算机设备 Download PDF

Info

Publication number
WO2020019885A1
WO2020019885A1 PCT/CN2019/090493 CN2019090493W WO2020019885A1 WO 2020019885 A1 WO2020019885 A1 WO 2020019885A1 CN 2019090493 W CN2019090493 W CN 2019090493W WO 2020019885 A1 WO2020019885 A1 WO 2020019885A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
data
speech
linguistic
speech data
Prior art date
Application number
PCT/CN2019/090493
Other languages
English (en)
French (fr)
Inventor
吴锡欣
王木
康世胤
苏丹
俞栋
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP19840536.7A priority Critical patent/EP3742436A4/en
Publication of WO2020019885A1 publication Critical patent/WO2020019885A1/zh
Priority to US16/999,989 priority patent/US12014720B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture

Definitions

  • the present application relates to the technical field of speech synthesis, and in particular, to a speech synthesis method, a model training method, a device, and a computer device.
  • the speech For synthesized speech, if the speech has human speech characteristics, it will undoubtedly improve the user experience.
  • the usual method is to use the log Mel spectrum obtained by processing the speech data as the input variable of the feature model to obtain the speaker's speech characteristics, and then the end-to-end model (Tacotron) The obtained speech features and corresponding text features synthesize speech data, so that the synthesized speech data has the speaker's speech features.
  • the log Mel spectrum contains both the speaker's speech features and semantic features, which affects the extraction of the voice features from the log Mel spectrum, and then affects the quality of the synthesized speech.
  • the application provides a speech synthesis method, a model training method, a device, and a computer device.
  • a speech synthesis method includes:
  • a speech synthesis device includes:
  • a linguistic data acquisition module for acquiring linguistic data to be processed
  • a linguistic data encoding module configured to encode the linguistic data to obtain linguistic encoded data
  • An embedding vector acquisition module configured to obtain an embedding vector for speech feature conversion; the embedding vector is generated according to a residual between reference synthetic speech data and reference speech data corresponding to the same reference linguistic data;
  • a linguistically encoded data decoding module configured to decode the linguistically encoded data according to the embedding vector to obtain target synthesized speech data converted from speech features.
  • a storage medium stores a computer program, and when the computer program is executed by a processor, the processor causes the processor to execute the steps of the speech synthesis method.
  • a computer device includes a memory and a processor.
  • the memory stores a computer program, and when the computer program is executed by the processor, causes the processor to execute the steps of the speech synthesis method.
  • the linguistic data to be processed is obtained and the linguistic data is encoded to obtain the linguistic encoded data representing the pronunciation.
  • a model training method includes:
  • Obtaining training linguistic data and corresponding training speech data encoding the training linguistic data by a first encoder to obtain first training linguistic encoding data; obtaining a training embedding vector for speech feature conversion; the training embedding vector, Generated according to the residuals between the training synthesized speech data and the training speech data corresponding to the same training linguistic data; decoding, by the first decoder, the first training linguistic encoding data according to the training embedding vector, Predictive target synthesized speech data converted by speech features; adjusting the first encoder and the first decoder according to the difference between the predicted target synthesized speech data and the training speech data, and continue training until the training stop condition is satisfied .
  • a model training device includes:
  • Training voice data acquisition module for acquiring training linguistic data and corresponding training voice data
  • a training linguistic data encoding module configured to encode the training linguistic data through a first encoder to obtain first training linguistic encoding data
  • a training embedding vector acquisition module configured to obtain a training embedding vector for speech feature conversion; the training embedding vector is generated according to a training synthetic speech data and a residual between training speech data corresponding to the same training linguistic data;
  • a training linguistic encoding data decoding module configured to decode the first training linguistic encoding data according to the training embedding vector according to the training embedding vector to obtain predicted target synthesized speech data converted by speech features;
  • An adjustment module is configured to adjust the first encoder and the first decoder according to a difference between the predicted target synthesized speech data and the training speech data, and continue training until a training stop condition is satisfied.
  • a storage medium stores a computer program that, when executed by a processor, causes the processor to execute the steps of the model training method.
  • a computer device includes a memory and a processor.
  • the memory stores a computer program.
  • the processor causes the processor to execute the steps of the model training method.
  • the first encoder and the first decoder process the training linguistic data, the training speech data, and the training embedding vector to obtain the prediction target synthesized speech data, and synthesize according to the prediction target.
  • the difference between the speech data and the training speech data adjusts the first encoder and the first decoder, so that the prediction target synthesized speech data continuously approaches the training speech data, thereby obtaining a trained first encoder and first decoder.
  • the training embedding vector generated from the training synthetic speech data and the residuals between the training speech data is used in the training process, the training embedding vector only contains speech features, and it is not necessary to consider the impact of semantic features on the training model, thereby reducing the first
  • the complexity of an encoder and a first decoder improves the accuracy of the training results.
  • FIG. 1 is a structural diagram of an application system of a speech synthesis method and a model training method according to an embodiment
  • FIG. 2 is a schematic flowchart of a speech synthesis method according to an embodiment
  • FIG. 3 is a schematic diagram of obtaining target synthesized speech data in a speech synthesis stage according to an embodiment
  • FIG. 4 is a schematic flowchart of a step of obtaining an embedding vector according to reference linguistic data and reference speech data in an embodiment
  • FIG. 5 is a schematic diagram of a data flow direction in obtaining an embedded vector in an embodiment
  • FIG. 6 is a schematic flowchart of a step of obtaining an embedding vector by using a residual model in an embodiment
  • FIG. 7 is a schematic diagram of a residual model structure and a residual process in a residual model in an embodiment
  • FIG. 8 is a schematic diagram of obtaining an embedded vector in an adaptive phase in an embodiment
  • FIG. 9 is a schematic flowchart of steps for training a target voice model in an embodiment
  • FIG. 10 is a schematic diagram of a data flow direction when a target voice model is trained in a model training phase according to an embodiment
  • 11 is a schematic flowchart of steps for training an average speech model, a residual model, and a target speech model in an embodiment
  • FIG. 12 is a schematic diagram of a data flow direction when an average speech model, a residual model, and a target speech model are trained in a model training phase according to an embodiment
  • FIG. 13 is a schematic flowchart of steps for training a target speech model in an embodiment
  • FIG. 14 is a schematic flowchart of steps for training an average speech model, a residual model, and a target speech model in an embodiment
  • FIG. 15 is a structural block diagram of a speech synthesis device according to an embodiment
  • 16 is a structural block diagram of a speech synthesis device in another embodiment
  • FIG. 17 is a structural block diagram of a model training device in an embodiment
  • FIG. 18 is a structural block diagram of a model training apparatus in another embodiment
  • 19 is a structural block diagram of a computer device in an embodiment
  • FIG. 20 is a structural block diagram of a computer device in another embodiment.
  • FIG. 1 is an application environment diagram of a speech synthesis method and a model training method in an embodiment.
  • the speech synthesis method and model training method are applied to a speech synthesis system.
  • the speech synthesis system includes a first encoder, a first decoder, a second encoder, a second decoder, an overlayer, a residual model, a projection layer, and the like.
  • the internal relationship and signal flow between the components of the speech synthesis system are shown in Figure 1.
  • the first encoder and the first decoder constitute a target speech model and are used to synthesize speech in the application phase.
  • the second encoder and the second decoder constitute an average speech model.
  • the composed average speech model is used in combination with an overlayer, a residual model, and a projection layer, and can be used to obtain an embedding vector used to characterize style features in the adaptive phase.
  • the speech synthesis system can run on a computer device as an application or a part of an application.
  • the computer device may be a terminal or a server.
  • the terminal can be a desktop terminal, a mobile terminal, or an intelligent robot.
  • the mobile terminal may be a smart phone, a tablet computer, a notebook computer, or a wearable device.
  • a speech synthesis method is provided. This embodiment is mainly described by using the method applied to the terminal running the speech synthesis system in FIG. 1 as an example.
  • the speech synthesis method may include the following steps:
  • the linguistic data may be a text or a feature or a feature item of the text.
  • the characteristics of the text may be characteristics of the words, pronunciation, prosody and stress of the words or words in the text.
  • the feature item may be a word, a word, a phrase, or the like. Feature items need to have the following characteristics: they must be able to identify the text content, have the ability to distinguish the target text from other texts, and feature items can be easily separated.
  • the terminal receives a voice interaction signal sent by the user, and searches for a linguistic data corresponding to the voice interaction signal from a preset linguistic library. For example, during the user's voice interaction with the terminal, if the terminal receives a voice interaction signal from the user, "Which is the most beautiful Shishi and Diao Chan?", The terminal searches the preset linguistic library for the corresponding voice interaction signal. "Xi Shi and Diao Chan are both beautiful" linguistic data. In this example, the linguistic data is text.
  • the terminal encodes the linguistic data through a first encoder to obtain the linguistically encoded data.
  • the terminal obtains a piece of text and encodes the text through a first encoder to obtain a distributed representation, which is the linguistically encoded data.
  • the distributed representation may be a feature vector.
  • a feature vector corresponds to a word or words in the text.
  • the first encoder may be a linguistic data encoder or an attention-based recursive generator.
  • the first encoder may be composed of RNN (Recurrent Neural Network), LSTM (Long Short-Term Memory), gated convolutional neural network, or delay network.
  • the terminal inputs a vector representing linguistic data into a first encoder, and uses the last unit state of the first encoder as an output to obtain linguistically encoded data.
  • the embedding vector is generated according to the residual between the reference synthesized speech data and the reference speech data corresponding to the same reference linguistic data.
  • the embedding vector may be a vector with a reference object's speaking style characteristic, and the reference object may be a person having a special speaking style.
  • Style features include, but are not limited to, prosody duration features, fundamental frequency features, and energy features that are highly correlated with duration and prosody.
  • Prosody features include the length, pause, and stress of a word or word.
  • the terminal fuses and processes the embedded vector with the corresponding linguistically encoded data, and obtains synthesized speech data with a reference object's speaking style. When the synthesized speech data is processed and played out through the speaker, the synthesized speech will no longer be mechanized, but will have a human speaking style.
  • the terminal before the user performs voice interaction with the terminal, the terminal obtains reference linguistic data and reference voice data with stylistic features.
  • the source of the reference voice data may be a user who performs voice interaction with the terminal, or may Is the specified reference user.
  • the terminal performs speech synthesis on the reference linguistic data to obtain reference synthesized speech data without stylistic features.
  • the terminal performs difference processing between the reference synthesized speech data and the reference speech data to obtain a residual characterizing the style feature.
  • the terminal processes the residuals to obtain embedding vectors that characterize style features.
  • the terminal stores the obtained embedding vector in a style feature vector library.
  • the style feature vector library can store embedded vectors corresponding to multiple reference objects.
  • the residual characterizing the style feature may be a residual sequence in essence.
  • the step of processing the residual by the terminal to obtain an embedded vector characterizing the style feature may include: inputting the residual into multiple fully connected layers of the residual model, and inputting the results output by the fully connected layers into the forward direction respectively.
  • the gate loop unit layer and the backward gate loop unit layer add the output of the last time step of the forward gate loop unit layer and the output of the first time step of the backward gate loop unit layer to obtain the speech feature conversion, Embedding vectors that can characterize style features.
  • the terminal obtains Zhang Manyu's voice data as reference voice data and obtains corresponding linguistic data (such as speaking Text content, such as "Which is more beautiful than Xishi and Diao Chan"? Among them, the acquired reference speech data has Zhang Manyu's speaking style.
  • the terminal performs speech synthesis on the linguistic data, and obtains reference synthesized speech data without Zhang Manyu's speaking style.
  • the terminal compares the reference speech data with Zhang Manyu's speaking style with the reference synthesized speech data without speaking style, and obtains a residual characterizing the style features.
  • the terminal processes the obtained residuals to obtain an embedding vector that can represent Zhang Manyu's speaking style.
  • the terminal saves the obtained embedding vector for speech feature conversion and can represent style features in the embedding vector library.
  • the terminal receives the style feature selection instruction, it displays a style selection interface corresponding to the embedding vector.
  • the terminal receives the specified style feature instruction, and obtains an embedded vector corresponding to the style feature instruction from a style feature vector library. For example, if the user wants to hear the sound of a movie or sports star, then the user selects the target movie or sports star in each reference object in the style selection interface of the terminal. At this time, the terminal receives information about the movie or sports star. According to the style feature instruction, an embedding vector representing the speech style of the movie or sports star is selected according to the style feature instruction.
  • the terminal decodes the linguistically encoded data according to the embedding vector through the first decoder to obtain target synthesized speech data with speech feature conversion and having a reference object speaking style.
  • the terminal combines the embedding vector with the linguistically encoded data, and decodes the combined result to obtain target synthesized speech data with speech feature conversion and a reference object speaking style.
  • the first decoder may be a speech data decoder or an attention-based recursive generator.
  • the first decoder may be composed of RNN, or LSTM, or CNN (Convolutional Neural Network, Convolutional Neural Network), gated convolutional neural network, or delay network.
  • the terminal when receiving a voice interaction signal sent by a user, acquires linguistic data corresponding to the voice interaction signal, and the linguistic data is, for example, "Who is more beautiful than Shih Tzu and Diao Chan?"
  • the terminal inputs the acquired linguistic data to the first encoder, and obtains the linguistic encoded data through the encoding processing of the first encoder.
  • the terminal obtains an embedding vector that can represent the speaking style of a reference object (such as Zhang Manyu), and processes the embedding vector and the linguistically encoded data through a first decoder to obtain target synthesized speech data having the speaking style of the reference object.
  • a reference object such as Zhang Manyu
  • the linguistic data to be processed is obtained and the linguistic data is encoded to obtain the linguistic encoding data representing the pronunciation.
  • the method may further include:
  • the reference voice data may be voice data collected from a reference object.
  • the reference linguistic data corresponds to the reference speech data.
  • the reference object may be a user who performs voice interaction with the terminal, or may be a designated reference user.
  • the reference speech data may be a speech signal sent by a reference object, and the reference linguistic data may be a text content to be expressed in the speech signal.
  • the terminal performs speech synthesis on the linguistic data to obtain reference synthesized speech data without the user's own speaking style.
  • the terminal makes a difference between the reference speech data with the user's own speaking style and the reference synthesized speech data without the speaking style to obtain a residual characterizing the style characteristics.
  • the terminal processes the obtained residuals to obtain an embedding vector that can represent the user's own speaking style.
  • the terminal collects the voice of the reference object, and performs framed, windowed, and Fourier transform on the collected voice to obtain voice data in the frequency domain with the reference object's speaking style characteristics.
  • the terminal encodes the reference linguistic data by using a first encoder to obtain the reference linguistic data.
  • the terminal obtains a piece of text and encodes the piece of text by a first encoder to obtain a distributed representation, which is the reference linguistically encoded data.
  • the distributed representation may be a feature vector.
  • a feature vector corresponds to a word or words in the text.
  • the second encoder may be a linguistic data encoder or an attention-based recursive generator.
  • the second encoder may be composed of RNN, or LSTM, or gated convolutional neural network, or delay network.
  • the terminal inputs a vector representing the linguistic data into a second encoder, and uses the last unit state of the second encoder as an output to obtain the linguistically encoded data.
  • the terminal decodes the reference linguistic data through a second decoder to obtain reference synthesized speech data without stylistic features.
  • the second decoder may be a speech data decoder or an attention-based recursive generator.
  • the second decoder may be composed of RNN, or LSTM, or CNN (Convolutional Neural Network, Convolutional Neural Network), gated convolutional neural network, or delay network.
  • S404 and S406 are steps for synthesizing reference synthesized speech data without stylistic features. As an example, it is shown in FIG. 5.
  • the terminal After the terminal obtains the reference linguistic data, the terminal inputs the obtained reference linguistic data into a second encoder, and processes the reference linguistic data through the second encoder to obtain a representation C representing a context of the reference linguistic data.
  • the above method steps are only used to understand how to obtain the reference synthesized speech data, and are not limited to the embodiments of the present application.
  • the terminal performs a difference between the reference speech data and the reference synthesized speech data to obtain a residual characterizing the style feature.
  • the terminal processes the obtained residuals with stylistic features to obtain embedding vectors for speech feature conversion and for characterizing stylistic features.
  • the embedding vector used for speech feature conversion is determined, so as to obtain the embedding vector used for style control in speech synthesis of linguistic data, so
  • the synthesized target synthesized speech data has specific style characteristics, which improves the quality of synthesized speech.
  • S408 may include:
  • the terminal performs a difference between the reference speech data and the reference synthesized speech data to obtain a residual characterizing the style feature.
  • the residual model can be constructed by RNN.
  • the residual model can include four layers: two fully connected (Dense) layers, one forward GRU (Gated Recurrent Unit) layer, and one backward GRU layer from bottom to top.
  • each Dense layer contains 128 units excited by an activation function (such as the ReLU function), the dropout rate is 0.5, and each gate loop unit layer contains 32 memory modules.
  • S604 may include: inputting a residual to a residual model, and processing the residual through a fully connected layer, a forward gate loop unit layer, and a backward gate loop unit layer in the residual model.
  • S606. Generate an embedding vector for speech feature conversion according to the result of the forward operation and the result of the backward operation in the residual model.
  • the embedding vector may also be called an adaptive embedding vector.
  • the style features of the embedded vector are related to the reference speech data. For example, assuming that the reference speech data is obtained by collecting the speech of Maggie Cheung, the style features of the embedding vector are consistent with those of Maggie Cheung. For another example, assuming that the reference speech data is obtained by collecting the user's own voice, the style feature of the embedded vector is consistent with the user's own speaking style feature.
  • the terminal performs a forward operation on the residual by using a forward gate loop unit layer in the residual model to obtain a result of the forward operation.
  • the terminal performs a backward operation on the residual through the backward gate loop unit layer in the residual model to obtain the result of the backward operation.
  • S606 may include: obtaining a first vector output at the last time step when the forward gate loop unit layer in the residual model performs a forward operation; and acquiring the backward gate loop unit layer in the residual model after performing A second vector that is output at the first time step during the direction operation; superimposing the first vector and the second vector to obtain an embedding vector for speech feature conversion.
  • the state of the hidden layer at the last time step of the forward GRU layer is added to the state of the hidden layer at the first time step of the backward GRU layer to obtain an embedding vector e used to characterize the style features.
  • the embedding vector can be obtained by the following method: the terminal obtains the reference linguistic data and has a stylistic feature (such as Zhang Manyu's speaking style feature) The reference speech data, wherein the linguistic data is, for example, "Who is more beautiful than Xishi and Diao Chan?" The terminal inputs the acquired linguistic data to the first encoder, and obtains the reference linguistic encoded data through the encoding processing of the first encoder.
  • a stylistic feature such as Zhang Manyu's speaking style feature
  • the terminal decodes the reference linguistic encoding data to obtain reference synthesized speech data, and makes a difference between the reference synthesized speech data and the reference speech data to obtain a residual characterizing the style feature.
  • the terminal processes the residual through the residual model to obtain an embedding vector that can represent the speaking style.
  • the residual between the reference speech data and the reference synthesized speech data is processed by the residual model to obtain an embedding vector for speech feature conversion, so that the embedding vector has the same style features as the reference speech data, and is adaptive. Effect.
  • an embedded vector is used to perform style control when speech synthesis is performed on the linguistic data, so that the synthesized target synthesized speech data has specific style characteristics, and the quality of the synthesized speech is improved.
  • the linguistically encoded data is obtained by encoding by a first encoder; the target synthesized speech data is obtained by decoding by a first decoder; the method further includes:
  • the linguistic data may be a text or a feature or a feature item of the text.
  • the training linguistic data refers to the linguistic data used in the training phase for training the first encoder and the first decoder.
  • the terminal obtains training linguistics data and training speech data with stylistic features.
  • training linguistics data can be "I like to eat, sleep and beat Doudou".
  • the terminal outputs "I like to eat, sleep and beat Doudou" in response.
  • the training linguistic data is encoded by the first encoder to obtain the first training linguistic encoding data.
  • the terminal encodes the training linguistic data by using a first encoder to obtain the first training linguistic encoding data. For example, the terminal obtains a piece of training text, and encodes the training text through a first encoder to obtain a distributed representation, where the distributed representation is the first training linguistically encoded data.
  • the training embedding vector is generated according to the residuals between the training synthesized speech data and the training speech data corresponding to the same training linguistic data.
  • the training embedding vector refers to a vector used to train a first encoder and a first decoder.
  • the terminal fuses and processes the training embedding vector with the corresponding first training linguistic encoding data, and obtains the training synthesized speech data with the reference object's speaking style.
  • the synthesized speech data is processed and played out through the speaker, the synthesized speech will no longer be mechanized, but will have a human speaking style.
  • the terminal before the user performs voice interaction with the terminal, the terminal obtains training linguistic data and training voice data with stylistic features.
  • the source of the training voice data can be selected by the developer, or the developer himself It can also be obtained from other speeches with a specific speaking style.
  • the terminal performs speech synthesis on the training linguistic data to obtain training synthesized speech data without stylistic features.
  • the terminal performs difference processing between the training synthesized speech data and the training speech data to obtain a residual characterizing the style feature.
  • the terminal processes the residuals to obtain a training embedding vector representing the style features.
  • the terminal stores the obtained training embedding vector in a style feature vector library.
  • the step of processing the residuals by the terminal to obtain a training embedding vector representing the style features may include: processing the residuals through multiple fully connected layers of the residual model, and inputting the results output by the fully connected layers respectively before the input
  • the backward gate loop unit layer and the backward gate loop unit layer add the output of the last time step of the forward gate loop unit layer and the output of the first time step of the backward gate loop unit layer to obtain speech feature conversion.
  • Training embedding vectors that can represent style features.
  • Zhang Manyu's voice data For example, if a developer wants to use Zhang Manyu's voice data as training voice data, then obtain Zhang Manyu's voice for processing to obtain training voice data, and obtain corresponding linguistic data (such as the text content of a speech, text content such as "I like to eat and sleep "Playing Doudou"), in which the acquired training speech data has a speech style of Maggie Cheung.
  • the terminal performs speech synthesis on the linguistic data to obtain training synthesized speech data without a speaking style.
  • the terminal compares the training speech data with Zhang Manyu's speaking style with the training synthesized speech data without speaking style, and obtains a residual characterizing the style features.
  • the terminal processes the obtained residuals to obtain a training embedding vector that can represent Zhang Manyu's speaking style.
  • the terminal receives a specified style feature selection instruction, and obtains a training embedding vector corresponding to the style feature instruction from a style feature vector library. For example, the developer selects the target movie or sports star from each reference object in the style selection interface of the terminal. At this time, the terminal receives a style feature instruction for the movie or sports star, and selects the movie or sports character according to the style feature instruction Star talking style training embedding vector.
  • the first decoder decodes the first training linguistically encoded data according to the training embedding vector to obtain the predicted target synthesized speech data after the speech feature conversion.
  • the terminal decodes the first training linguistically encoded data according to the training embedding vector by using the first decoder to obtain the predicted target synthesized speech data with speech feature conversion and having the reference object's speaking style. Or, the terminal combines the training embedding vector with the first training linguistic coding data, and decodes the combined result to obtain the synthesized target speech data of the prediction target with the reference object's speaking style after the speech feature conversion.
  • S910 Adjust the first encoder and the first decoder according to the difference between the predicted target synthesized speech data and the training speech data, and continue training until the training stop condition is satisfied.
  • the terminal adjusts parameters in the first encoder and the first decoder according to the difference between the prediction target synthesized speech data and the training speech data, and continues training until the speech style corresponding to the predicted target synthesized speech data and If the speech styles corresponding to the training speech data are consistent, the training is stopped.
  • S902-S910 are steps for training the first encoder and the first decoder.
  • the first encoder and the first decoder can be trained by the following methods: obtaining training linguistic data and having a style Feature (such as Zhang Manyu or the developer ’s own speaking style feature) training speech data, the training linguistic data is encoded by the first encoder to obtain the first training linguistic encoding data; the training embedding vector used to characterize the style feature is obtained by The first decoder decodes the first training linguistically encoded data according to the training embedding vector to obtain the predicted target synthesized speech data after speech feature conversion; and adjusts the first encoding according to the difference between the predicted target synthesized speech data and the training speech data. And the first decoder, and continue training until the training stop condition is satisfied.
  • a style Feature such as Zhang Manyu or the developer ’s own speaking style feature
  • the first encoder and the first decoder process the training linguistic data, the training speech data, and the training embedding vector to obtain the prediction target synthesized speech data. Adjust the first encoder and the first decoder according to the difference between the predicted target synthesized speech data and the training speech data, so that the predicted target synthesized speech data continuously approaches the training speech data, thereby obtaining a trained first encoder and first decoder. .
  • the training embedding vector generated from the training synthetic speech data and the residuals between the training speech data is used in the training process, the training embedding vector only contains speech features, and it is not necessary to consider the impact of semantic features on the training model, thereby reducing the first
  • the complexity of an encoder and a first decoder improves the accuracy of the training results.
  • the linguistically encoded data is obtained by encoding with a first encoder; the target synthesized speech data is obtained by decoding with a first decoder; the reference linguistically encoded data is obtained by encoding with a second encoder; reference is made to the synthesized speech data. Decoded by a second decoder; embedded vectors are obtained by a residual model. As shown in FIG. 11, the method may further include:
  • the training linguistic data refers to the linguistic data used in the training phase for training the first encoder and the first decoder.
  • the terminal obtains training linguistics data and training speech data with stylistic features.
  • training linguistics data can be "I like to eat, sleep and beat Doudou".
  • the terminal encodes the training linguistic data through a second encoder to obtain the second training linguistic encoding data.
  • the terminal obtains a piece of text, and encodes the piece of text through a first encoder to obtain a distributed representation, where the distributed representation is the second training linguistically encoded data.
  • the distributed representation may be a feature vector.
  • a feature vector corresponds to a word or words in the text.
  • S1106 Decode the second training linguistically encoded data by a second decoder to obtain training synthesized speech data.
  • S1108 Generate a training embedding vector by using a residual model and according to a residual between training synthetic speech data and training speech data.
  • the terminal performs a difference between the training synthesized speech data and the training speech data by using a residual model to obtain a residual characterizing a style feature.
  • the terminal processes the obtained residuals with stylistic features to obtain a training embedding vector for speech feature conversion and for characterizing stylistic features.
  • S1110 Decode the first training linguistically encoded data according to the training embedding vector by the first decoder to obtain the predicted target synthesized speech data after the speech feature conversion.
  • the first training linguistic encoding data is obtained by encoding the training linguistic data by the first encoder.
  • the terminal decodes the first training linguistically encoded data according to the training embedding vector through the second decoder to obtain the predicted target synthesized speech data with speech feature conversion and having the reference object's speaking style. Or, the terminal combines the training embedding vector with the first training linguistic coding data, and decodes the combined result to obtain the synthesized target speech data of the prediction target with the reference object's speaking style after the speech feature conversion.
  • S1112 Adjust the second encoder, the second decoder, the residual model, the first encoder, and the first decoder according to the difference between the predicted target synthesized speech data and the training speech data, and continue training until the training stop condition is satisfied .
  • the terminal adjusts parameters in the second encoder, the second decoder, the residual model, the first encoder and the first decoder according to the difference between the predicted target synthesized speech data and the training speech data, and Continue training until the speech style corresponding to the predicted target synthesized speech data is consistent with the speech style corresponding to the training speech data, then stop training.
  • S1102-S1112 are steps for training the second encoder, the second decoder, the residual model, the first encoder, and the first decoder.
  • the second encoder can be trained by the following method , The second decoder, the residual model, the first encoder, and the first decoder: obtain training linguistic data and training speech data with stylistic features (such as Zhang Manyu or the developer's own stylistic features), and pass the second encoding
  • the encoder encodes the training linguistic data to obtain the second training linguistic encoding data, and decodes the second training linguistic encoding data through the second decoder to obtain the training synthesized speech data.
  • the terminal processes the residuals between the training synthesized speech data and the training speech data through a residual model to obtain a training embedding vector for characterizing the style features.
  • the first decoder is used to decode the first training linguistic encoding data according to the training embedding vector to obtain a prediction target converted by the speech feature Synthesize speech data.
  • the second encoder, the second decoder, the residual model, the first encoder and the first decoder are adjusted, and the training is continued until the training stop condition is satisfied.
  • the second encoder, the second decoder, the residual model, the first encoder, and the first decoder are trained by training linguistic data and corresponding training speech data. Adjusting the second encoder, the second decoder, the residual model, the first encoder and the first decoder according to the difference between the prediction target synthesized speech data and the training speech data, so that the prediction target synthesized speech data continuously approaches the training speech data, Thus, a trained second encoder, a second decoder, a residual model, a first encoder, and a first decoder are obtained.
  • the training embedding vector generated from the residuals between the training synthesized speech data and the training speech data is used in the training process, the training embedding vector only contains speech features, and it is not necessary to consider the impact of semantic features on the training model, thereby reducing The complexity of the second encoder, the second decoder, the residual model, the first encoder and the first decoder is improved, and the accuracy of the training result is improved.
  • the second encoder, the second decoder, and the residual model used to obtain the embedding vector used to characterize the style features are combined with the first encoder and the first decoder used to synthesize speech, which reduces the The need for data in speech synthesis systems improves the accuracy of building speech synthesis systems.
  • S208 may include: stitching the linguistically encoded data and the embedding vector to obtain a stitching vector; and decoding the stitching vector to obtain target synthesized speech data converted by the speech feature.
  • the embedding vector includes: prosody duration feature, fundamental frequency feature, and energy feature.
  • the step of stitching the linguistically encoded data and the embedding vector to obtain the stitching vector may include: determining the target duration corresponding to the prosody in the target speech data according to the prosody duration feature; combining the phoneme sequence with the target duration, the fundamental frequency feature and the energy feature To get combined features.
  • the linguistic encoding data and the embedded vector are spliced, and the vector obtained after the splicing is decoded to obtain the target synthesized speech data after the speech feature conversion. Because the stitched vectors have no semantic features, the processing of linguistically encoded data by semantic features is avoided, thereby improving the quality of synthesized speech.
  • the method may further include: determining a speech amplitude spectrum corresponding to the target synthesized speech data; converting the speech amplitude spectrum into a time-domain speech waveform signal; and generating a speech based on the speech waveform.
  • the target synthesized speech data may be frequency-domain speech data.
  • the terminal obtains the corresponding speech amplitude spectrum from the target synthesized speech data in the frequency domain, and converts the speech amplitude spectrum into time-domain speech through the Griffin-Lim algorithm.
  • Waveform signal The terminal converts the voice waveform signal through the world vocoder into a synthetic sound with style.
  • the target synthesized speech data with speech characteristics is converted into a speech signal, thereby obtaining a styled speech, so that the quality of the synthesized speech can be improved.
  • a model training method is provided. This embodiment is mainly described by using the method applied to the terminal running the speech synthesis system in FIG. 1 as an example.
  • the model training method may include the following steps:
  • the linguistic data may be a text or a feature or a feature item of the text.
  • the training linguistic data refers to the linguistic data used in the training phase for training the first encoder and the first decoder.
  • the terminal obtains training linguistics data and training speech data with stylistic features.
  • training linguistics data can be "I like to eat, sleep and beat Doudou".
  • the terminal outputs "I like to eat, sleep and beat Doudou" in response.
  • S1304 Encode the training linguistic data by using a first encoder to obtain first training linguistic encoded data.
  • the terminal encodes the training linguistic data by using a first encoder to obtain the first training linguistic encoding data. For example, the terminal obtains a piece of training text, and encodes the training text through a first encoder to obtain a distributed representation, where the distributed representation is the first training linguistically encoded data.
  • the distributed representation may be a feature vector. A feature vector corresponds to a word or words in the text.
  • S1306 Acquire a training embedding vector for speech feature conversion; the training embedding vector is generated according to the residual between the training synthesized speech data and the training speech data corresponding to the same training linguistic data.
  • the embedding vector may be a vector with a reference object's speaking style characteristics.
  • the reference object may be a person who speaks with a specific style selected by the developer during the training process.
  • the training embedding vector refers to a vector for training a first encoder and a first decoder.
  • the terminal fuses and processes the training embedding vector with the corresponding first training linguistic encoding data, and obtains the training synthesized speech data with the reference object's speaking style.
  • the synthesized speech will no longer be mechanized, but will have a human speaking style.
  • the terminal before the user performs voice interaction with the terminal, the terminal obtains training linguistic data and training voice data with stylistic features.
  • the source of the training voice data can be selected by the developer, or the developer himself It can also be obtained from other speeches with a specific speaking style.
  • the terminal performs speech synthesis on the training linguistic data to obtain training synthesized speech data without stylistic features.
  • the terminal performs difference processing between the training synthesized speech data and the training speech data to obtain a residual characterizing the style feature.
  • the terminal processes the residuals to obtain a training embedding vector representing the style features.
  • the terminal stores the obtained training embedding vector in a style feature vector library.
  • the style feature vector library can store training embedding vectors corresponding to multiple reference objects, and the reference object can be a person who speaks with a special style.
  • the residual characterizing the style feature may be a residual sequence in essence.
  • the step of processing the residuals by the terminal to obtain a training embedding vector representing the style features may include: processing the residuals through multiple fully connected layers of the residual model, and inputting the results output by the fully connected layers respectively before the input
  • the backward gate loop unit layer and the backward gate loop unit layer add the output of the last time step of the forward gate loop unit layer and the output of the first time step of the backward gate loop unit layer to obtain speech feature conversion.
  • Training embedding vectors that can represent style features.
  • Zhang Manyu's voice data For example, if a developer wants to use Zhang Manyu's voice data as training voice data, then obtain Zhang Manyu's voice for processing to obtain training voice data, and obtain corresponding linguistic data (such as the text content of a speech, text content such as "I like to eat and sleep "Playing Doudou"), in which the acquired training speech data has a speech style of Maggie Cheung.
  • the terminal performs speech synthesis on the linguistic data to obtain training synthesized speech data without a speaking style.
  • the terminal compares the training speech data with Zhang Manyu's speaking style with the training synthesized speech data without speaking style, and obtains a residual characterizing the style features.
  • the terminal processes the obtained residuals to obtain a training embedding vector that can represent Zhang Manyu's speaking style.
  • the terminal receives a specified style feature selection instruction, and obtains a training embedding vector corresponding to the style feature instruction from a style feature vector library. For example, a developer wants to hear the sound of a movie or sports star. Then, the user selects the target movie or sports star in each reference object in the style selection interface of the terminal. At this time, the terminal receives information about the movie or sports star.
  • the star's style feature instruction according to the style feature instruction, selects a training embedding vector representing the movie or sports star's speaking style.
  • S1308 Decode the first training linguistically encoded data according to the training embedding vector by the first decoder to obtain the predicted target synthesized speech data after the speech feature conversion.
  • the terminal decodes the first training linguistically encoded data according to the training embedding vector by using the first decoder to obtain the predicted target synthesized speech data with speech feature conversion and having the reference object's speaking style. Or, the terminal combines the training embedding vector with the first training linguistic coding data, and decodes the combined result to obtain the synthesized target speech data of the prediction target with the reference object's speaking style after the speech feature conversion.
  • S1310 Adjust the first encoder and the first decoder according to the difference between the predicted target synthesized speech data and the training speech data, and continue training until the training stop condition is satisfied.
  • the terminal adjusts parameters in the first encoder and the first decoder according to the difference between the prediction target synthesized speech data and the training speech data, and continues training until the speech style corresponding to the predicted target synthesized speech data and If the speech styles corresponding to the training speech data are consistent, the training is stopped.
  • S1302-S1310 are steps for training the first encoder and the first decoder.
  • the first encoder and the first decoder can be trained by the following methods: obtaining training linguistic data and having a style Feature (such as Zhang Manyu or the developer ’s own speaking style feature) training speech data, the training linguistic data is encoded by the first encoder to obtain the first training linguistic encoding data; the training embedding vector used to characterize the style feature is obtained by The first decoder decodes the first training linguistically encoded data according to the training embedding vector to obtain the predicted target synthesized speech data after speech feature conversion; and adjusts the first encoding according to the difference between the predicted target synthesized speech data and the training speech data. And the first decoder, and continue training until the training stop condition is satisfied.
  • a style Feature such as Zhang Manyu or the developer ’s own speaking style feature
  • the first encoder and the first decoder process the training linguistic data, the training speech data, and the training embedding vector to obtain the prediction target synthesized speech data. Adjust the first encoder and the first decoder according to the difference between the predicted target synthesized speech data and the training speech data, so that the predicted target synthesized speech data continuously approaches the training speech data, thereby obtaining a trained first encoder and first decoder. .
  • the training embedding vector generated from the training synthetic speech data and the residuals between the training speech data is used in the training process, the training embedding vector only contains speech features, and it is not necessary to consider the impact of semantic features on the training model, thereby reducing the first
  • the complexity of an encoder and a first decoder improves the accuracy of the training results.
  • the method may further include:
  • the linguistic data may be a text or a feature or a feature item of the text.
  • the training linguistic data refers to the linguistic data used in the training phase for training the first encoder and the first decoder.
  • the terminal obtains training linguistics data and training speech data with stylistic features.
  • training linguistics data can be "I like to eat, sleep and beat Doudou".
  • the terminal encodes the training linguistic data through a second encoder to obtain the second training linguistic encoding data.
  • the terminal obtains a piece of text, and encodes the piece of text through a first encoder to obtain a distributed representation, where the distributed representation is the second training linguistically encoded data.
  • the distributed representation may be a feature vector.
  • a feature vector corresponds to a word or words in the text.
  • S1408 Generate a training embedding vector by using a residual model and according to a residual between the training synthesized speech data and the training speech data.
  • the terminal performs a difference between the training synthesized speech data and the training speech data by using a residual model to obtain a residual characterizing a style feature.
  • the terminal processes the obtained residuals with stylistic features to obtain a training embedding vector for speech feature conversion and for characterizing stylistic features.
  • the first decoder decodes the first training linguistically encoded data according to the training embedding vector to obtain the prediction target synthesized speech data after the speech feature conversion.
  • the terminal decodes the second training linguistically encoded data according to the training embedding vector through the second decoder to obtain the predicted target synthesized speech data with speech feature conversion and with the reference object's speaking style. Or, the terminal combines the training embedding vector with the second training linguistic coding data, and decodes the combined result to obtain the synthesized target speech data of the prediction target with the reference object's speaking style after the speech feature conversion.
  • S1412 Adjust the second encoder, the second decoder, the residual model, the first encoder and the first decoder according to the difference between the predicted target synthesized speech data and the training speech data, and continue training until the training stop condition is satisfied .
  • the terminal adjusts parameters in the second encoder, the second decoder, the residual model, the first encoder and the first decoder according to the difference between the predicted target synthesized speech data and the training speech data, and Continue training until the speech style corresponding to the predicted target synthesized speech data is consistent with the speech style corresponding to the training speech data, then stop training.
  • S1402-S1412 are steps for training the second encoder, the second decoder, the residual model, the first encoder, and the first decoder.
  • the second encoder can be trained by the following method , The second decoder, the residual model, the first encoder, and the first decoder: obtain training linguistic data and training speech data with stylistic features (such as Zhang Manyu or the developer's own stylistic features), and pass the second encoding
  • the encoder encodes the training linguistic data to obtain the second training linguistic encoding data, and decodes the second training linguistic encoding data through the second decoder to obtain the training synthesized speech data.
  • the terminal processes the residuals between the training synthesized speech data and the training speech data through a residual model to obtain a training embedding vector for characterizing the style features.
  • the first decoder is used to decode the first training linguistic encoding data according to the training embedding vector to obtain a prediction target converted by the speech feature. Synthesize speech data. According to the difference between the predicted target synthesized speech data and the training speech data, the second encoder, the second decoder, the residual model, the first encoder and the first decoder are adjusted, and the training is continued until the training stop condition is satisfied.
  • the second encoder, the second decoder, the residual model, the first encoder, and the first decoder are trained by training linguistic data and corresponding training speech data. Adjusting the second encoder, the second decoder, the residual model, the first encoder and the first decoder according to the difference between the prediction target synthesized speech data and the training speech data, so that the prediction target synthesized speech data continuously approaches the training speech data, Thus, a trained second encoder, a second decoder, a residual model, a first encoder, and a first decoder are obtained.
  • the training embedding vector generated from the residuals between the training synthesized speech data and the training speech data is used in the training process, the training embedding vector only contains speech features, and it is not necessary to consider the impact of semantic features on the training model, thereby reducing The complexity of the second encoder, the second decoder, the residual model, the first encoder and the first decoder is improved, and the accuracy of the training result is improved.
  • the second encoder, the second decoder, and the residual model used to obtain the embedding vector used to characterize the style features are combined with the first encoder and the first decoder used to synthesize speech, which reduces the The need for data in speech synthesis systems improves the accuracy of building speech synthesis systems.
  • the training encoder obtains a style embedding vector from the log Mel spectrum of the reference audio, and then uses this embedding vector to guide Tacotron to model the style data.
  • the training encoder Given a log Mel spectrum of the reference audio, an embedded vector representing the style is obtained through a trained encoder, and then the embedding vector is used to guide Tacotron to generate the corresponding style of speech.
  • the speech synthesis system includes: an average speech model, a residual model, a projection layer, and a target speech model.
  • the target speech model includes a first encoder and a first decoder.
  • the first encoder and the first decoder may be a linguistic data encoder and a speech data decoder, respectively.
  • the first encoder and the first decoder may also be attention-based recursive generators.
  • the average speech model includes a second encoder and a second decoder.
  • the second encoder and the second decoder may be a linguistic data encoder and a speech data decoder, respectively.
  • the second encoder and the second decoder may also be attention-based recursive generators.
  • Both the average speech model and the target speech model can be based on Tacotron models, including decoders and encoders.
  • the average speech model trains the training linguistic data to obtain average style speech data.
  • the residual model encodes the difference between the predicted average synthesized speech data and the target speech data to obtain an embedded vector of style features.
  • the projection layer projects the embedded vector into the first decoder space of the target speech model.
  • training phase Before obtaining synthesized speech, there are three phases: training phase, adaptive phase and test phase; among them:
  • the input training linguistic data first predicts the average training synthesized speech data through the average speech model.
  • the average speech model includes a second encoder (such as a linguistic data encoder) and a second decoder (such as a speech data decoder).
  • the second encoder is used to encode training linguistic data to obtain a hidden layer representation.
  • the second decoder is used to decode the hidden layer representation to obtain training synthesized speech data.
  • the hidden layer representation refers to the linguistic encoding data described in the embodiments of the present application.
  • the obtained training synthesized speech data and the target training style data with stylistic features are subjected to difference processing to obtain a residual error between the two. Residuals are input into a residual model to obtain a training embedding vector for characterizing style features, and the training embedding vector is mapped to a first decoder of a target speech model through a projection layer.
  • the input is training linguistic data, which is encoded by the first encoder to obtain the hidden layer representation.
  • the first decoder follows the training embedding vector mapped from the hidden layer representation and the projection layer, and decodes the style-synthesized prediction target synthesized speech data.
  • the training embedding vectors are data-driven and learned automatically.
  • the training speech data adjust the average speech model, the residual model and the target speech model, and continue training until the predicted target synthesized speech data is as close as possible to the training speech data, so that the final output synthesized speech
  • the style is consistent with the style of the speech data used in training, so that the trained average speech model, residual model, and target speech model are obtained.
  • the adaptive phase mainly obtains the target style embedding vector through the trained average speech model, residual model and target speech model. For example, as shown in FIG. 8, when the user performs voice interaction with the terminal, if he wants to hear Zhang Manyu's speaking style, then the user can use Zhang Manyu's voice data as reference voice data and obtain corresponding reference linguistic data.
  • the obtained reference linguistic data is input into a trained average speech model to obtain reference synthesized speech data.
  • the reference synthesized speech data and the reference speech data are subjected to difference processing to obtain residuals representing style characteristics. Residuals are input into the residual model to obtain the embedding vector used to characterize the style features.
  • an adaptive style embedding vector can be quickly obtained. Because this process does not require training, it greatly improves the speed of adaptation and reduces the time for adaptation.
  • the given linguistic data is first input to the first encoder of the target speech model for encoding to obtain a hidden layer representation.
  • the first decoder is controlled by using the embedding vector obtained in the adaptive stage to obtain target synthesized speech data in a style similar to the adaptive reference sample. For example, when the source of the reference speech data used in the adaptation phase is Maggie Cheung, the style of the target synthesized speech data obtained is Maggie Cheung ’s speaking style.
  • the output target synthesized speech data is restored to speech waveform signal by Griffin-Lim algorithm.
  • the style features that do not need to be manually labeled reduce the cost of constructing a speech synthesis system
  • the residual error is used as a control condition, the use of a log Mel spectrum is avoided, and model modeling is reduced Complexity, which improves the accuracy of style feature extraction
  • the style vector module that is, the residual model
  • the speech synthesis model can be modeled and trained simultaneously, avoiding additional style vector modules, reducing training time, and can also achieve Quickly adaptively get the embedding vector needed for synthesized speech.
  • FIG. 2 is a schematic flowchart of a speech synthesis method in an embodiment
  • FIG. 13 is a schematic flowchart of a model training method in an embodiment. It should be understood that although the steps in the flowcharts of FIG. 2 and FIG. 13 are sequentially displayed according to the directions of the arrows, these steps are not necessarily performed sequentially in the order indicated by the arrows. Unless explicitly stated in this document, the execution of these steps is not strictly limited, and these steps can be performed in other orders. Moreover, at least some of the steps in FIG. 2 and FIG. 13 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily performed at the same time, but may be performed at different times. These sub-steps or The execution order of the phases is not necessarily performed sequentially, but may be performed in turn or alternately with other steps or at least a part of the sub-steps or phases of other steps.
  • a speech synthesis device may include: a linguistic data acquisition module 1502, a linguistic data encoding module 1504, an embedded vector acquisition module 1506, and a linguistic encoding.
  • Data decoding module 1508 of which:
  • the linguistic data acquisition module 1502 is configured to acquire linguistic data to be processed.
  • the linguistic data encoding module 1504 is configured to encode the linguistic data to obtain the linguistic encoded data.
  • An embedding vector acquisition module 1506 is configured to obtain an embedding vector used for speech feature conversion.
  • the embedding vector is generated according to a residual between the reference synthesized speech data and the reference speech data corresponding to the same reference linguistic data.
  • the linguistic coded data decoding module 1508 is configured to decode the linguistic coded data according to the embedding vector to obtain the target synthesized speech data after the speech feature conversion.
  • the linguistic data to be processed is obtained and the linguistic data is encoded to obtain the linguistic encoding data representing the pronunciation.
  • the apparatus may further include: an embedded vector determination module 1510. among them:
  • the linguistic data acquisition module 1502 is further configured to acquire reference linguistic data and corresponding reference speech data.
  • the linguistic data encoding module 1504 is further configured to encode the reference linguistic data to obtain the reference linguistic encoded data.
  • the linguistic coded data decoding module 1508 is further configured to decode the reference linguistic coded data to obtain reference synthesized speech data.
  • the embedding vector determination module 1510 is configured to determine an embedding vector for speech feature conversion according to a residual between the reference speech data and the reference synthesized speech data.
  • the embedding vector used for speech feature conversion is determined, so as to obtain the embedding vector used for style control in speech synthesis of linguistic data, so
  • the synthesized target synthesized speech data has specific style characteristics, which improves the quality of synthesized speech.
  • the embedding vector determination module 1510 is further configured to determine a residual between the reference speech data and the reference synthesized speech data; process the residual through a residual model; and according to the result of the forward operation and the backward operation in the residual model. As a result, an embedding vector is generated for speech feature conversion.
  • the embedding vector determination module 1510 is further configured to process the residuals through a fully connected layer, a forward gate loop unit layer, and a backward gate loop unit layer in the residual model.
  • the embedding vector determination module 1510 is further configured to obtain a first vector that is output at the last time step when the forward gate loop unit layer performs a forward operation in the residual model; and acquire a backward gate loop in the residual model.
  • the second vector that is output at the first time step when the unit layer performs a backward operation; the first vector and the second vector are superimposed to obtain an embedding vector for speech feature conversion.
  • the residual between the reference speech data and the reference synthesized speech data is processed by the residual model to obtain an embedding vector for speech feature conversion, so that the embedding vector has the same style features as the reference speech data, and is adaptive. Effect.
  • an embedded vector is used to perform style control when speech synthesis is performed on the linguistic data, so that the synthesized target synthesized speech data has specific style characteristics, and the quality of the synthesized speech is improved.
  • the linguistically encoded data is obtained through encoding by a first encoder; the target synthesized speech data is obtained through decoding by a first decoder; as shown in FIG. 16, the apparatus further includes a first adjustment module 1512. among them:
  • the linguistic data acquisition module 1502 is further configured to acquire training linguistic data and corresponding training speech data.
  • the linguistic data encoding module 1504 is further configured to encode the training linguistic data through a first encoder to obtain the first training linguistic encoded data.
  • the embedding vector acquisition module 1506 is further configured to obtain a training embedding vector for speech feature conversion; the training embedding vector is generated according to a residual between training synthetic speech data and training speech data corresponding to the same training linguistic data.
  • the linguistic encoded data decoding module 1508 is further configured to decode the first training linguistic encoded data according to the training embedding vector by the first decoder to obtain the predicted target synthesized speech data after the speech feature conversion.
  • the first adjustment module 1512 is configured to adjust the first encoder and the first decoder according to the difference between the predicted target synthesized speech data and the training speech data, and continue training until the training stop condition is satisfied.
  • the first encoder and the first decoder process the training linguistic data, the training speech data, and the training embedding vector to obtain the predicted target synthesized speech data, and adjust according to the difference between the predicted target synthesized speech data and the training speech data.
  • the first encoder and the first decoder enable the prediction target synthesized speech data to continuously approach the training speech data, thereby obtaining a trained first encoder and first decoder.
  • the training embedding vector generated from the training synthetic speech data and the residuals between the training speech data is used in the training process, the training embedding vector only contains speech features, and it is not necessary to consider the impact of semantic features on the training model, thereby reducing the first
  • the complexity of an encoder and a first decoder improves the accuracy of the training results.
  • the linguistically encoded data is obtained by encoding with a first encoder; the target synthesized speech data is obtained by decoding with a first decoder; the reference linguistically encoded data is obtained by encoding with a second encoder; reference is made to the synthesized speech data. Decoded by a second decoder; embedded vectors are obtained by a residual model.
  • the apparatus further includes: an embedded vector generation module 1514 and a second adjustment module 1516; wherein:
  • the linguistic data acquisition module 1502 is further configured to acquire training linguistic data and corresponding training speech data.
  • the linguistic data encoding module 1504 is further configured to encode the training linguistic data through a second encoder to obtain second training linguistic encoded data.
  • the linguistic encoded data decoding module 1508 is further configured to decode the second training linguistic encoded data through a second decoder to obtain training synthesized speech data.
  • the embedding vector generating module 1514 is configured to generate a training embedding vector by using a residual model and according to a residual between the synthesized speech data and the training speech data.
  • the linguistic encoded data decoding module 1508 is further configured to decode the first training linguistic encoded data according to the training embedding vector by the first decoder to obtain the prediction target synthesized speech data converted by the speech feature;
  • the encoded data is obtained by encoding the training linguistic data by the first encoder.
  • a second adjustment module 1516 configured to adjust the second encoder, the second decoder, the residual model, the first encoder and the first decoder according to the difference between the predicted target synthesized speech data and the training speech data, and continue training Until the training stop condition is met.
  • the second encoder, the second decoder, the residual model, the first encoder, and the first decoder are trained by training the linguistic data and the corresponding training speech data, and the speech data is synthesized according to the prediction target. And the difference between the training speech data to adjust the second encoder, the second decoder, the residual model, the first encoder and the first decoder, so that the prediction target synthesized speech data continuously approaches the training speech data, thereby obtaining a trained first Two encoders, a second decoder, a residual model, a first encoder, and a first decoder.
  • the training embedding vector generated from the residuals between the training synthesized speech data and the training speech data is used in the training process, the training embedding vector only contains speech features, and it is not necessary to consider the impact of semantic features on the training model, thereby reducing The complexity of the second encoder, the second decoder, the residual model, the first encoder and the first decoder is improved, and the accuracy of the training result is improved.
  • the second encoder, the second decoder, and the residual model used to obtain the embedding vector used to characterize the style features are combined with the first encoder and the first decoder used to synthesize speech, which reduces the The need for data in speech synthesis systems improves the accuracy of building speech synthesis systems.
  • the linguistically encoded data decoding module 1508 is further configured to stitch the linguistically encoded data and the embedding vector to obtain a stitching vector; and decode the stitching vector to obtain the target synthesized speech data transformed by the speech features.
  • the linguistic encoding data and the embedded vector are spliced, and the vector obtained after the splicing is decoded to obtain the target synthesized speech data after the speech feature conversion. Because the stitched vectors have no semantic features, the processing of linguistically encoded data by semantic features is avoided, thereby improving the quality of synthesized speech.
  • the device further includes a synthesis module 1518, a conversion module 1520, and a speech generation module 1522. among them:
  • a synthesis module 1518 is configured to determine a speech amplitude spectrum corresponding to the target synthesized speech data.
  • the conversion module 1520 is configured to convert a speech amplitude spectrum into a time-domain speech waveform signal.
  • the voice generation module 1522 is configured to generate a voice according to a voice waveform.
  • the target synthesized speech data with speech characteristics is converted into a speech signal, thereby obtaining a styled speech, so that the quality of the synthesized speech can be improved.
  • a model training device may include a speech data acquisition module 1702, a linguistic data encoding module 1704, an embedded vector acquisition module 1706, and linguistic encoded data. Decoding module 1708 and adjustment module 1710. among them:
  • the voice data acquisition module 1702 is configured to acquire training linguistic data and corresponding training voice data.
  • the linguistic data encoding module 1704 is configured to encode the training linguistic data through a first encoder to obtain the first training linguistic encoded data.
  • An embedding vector acquisition module 1706 is configured to obtain a training embedding vector used for speech feature conversion.
  • the training embedding vector is generated according to a residual between training synthesized speech data and training speech data corresponding to the same training linguistic data.
  • the linguistic encoded data decoding module 1708 is configured to decode the first training linguistic encoded data according to the training embedding vector by the first decoder to obtain the predicted target synthesized speech data after the speech feature conversion.
  • An adjustment module 1710 is configured to adjust the first encoder and the first decoder according to the difference between the predicted target synthesized speech data and the training speech data, and continue training until the training stop condition is satisfied.
  • the first encoder and the first decoder process the training linguistic data, the training speech data, and the training embedding vector to obtain the predicted target synthesized speech data, and adjust according to the difference between the predicted target synthesized speech data and the training speech data.
  • the first encoder and the first decoder enable the prediction target synthesized speech data to continuously approach the training speech data, thereby obtaining a trained first encoder and first decoder.
  • the training embedding vector generated from the training synthetic speech data and the residuals between the training speech data is used in the training process, the training embedding vector only contains speech features, and it is not necessary to consider the impact of semantic features on the training model, thereby reducing the first
  • the complexity of an encoder and a first decoder improves the accuracy of the training results.
  • the apparatus further includes: an embedded vector generating module 1712. among them:
  • the linguistic data encoding module 1704 is further configured to encode the training linguistic data through a second encoder to obtain second training linguistic encoded data.
  • the linguistic encoded data decoding module 1708 is further configured to decode the second training linguistic encoded data through a second decoder to obtain training synthesized speech data.
  • An embedding vector generating module 1712 is configured to generate a training embedding vector by using a residual model and according to a residual between training synthetic speech data and training speech data.
  • the adjustment module 1710 is further configured to adjust the second encoder, the second decoder, the residual model, the first encoder and the first decoder according to the difference between the predicted target synthesized speech data and the training speech data, and continue training until Meet training stop conditions.
  • the second encoder, the second decoder, the residual model, the first encoder, and the first decoder are trained by training the linguistic data and the corresponding training speech data, and the speech data is synthesized according to the prediction target. And the difference between the training speech data to adjust the second encoder, the second decoder, the residual model, the first encoder and the first decoder, so that the prediction target synthesized speech data continuously approaches the training speech data, thereby obtaining a trained first Two encoders, a second decoder, a residual model, a first encoder, and a first decoder.
  • the training embedding vector generated from the residuals between the training synthesized speech data and the training speech data is used in the training process, the training embedding vector only contains speech features, and it is not necessary to consider the impact of semantic features on the training model, thereby reducing The complexity of the second encoder, the second decoder, the residual model, the first encoder and the first decoder is improved, and the accuracy of the training result is improved.
  • the second encoder, the second decoder, and the residual model used to obtain the embedding vector used to characterize the style features are combined with the first encoder and the first decoder used to synthesize speech, which reduces the The need for data in speech synthesis systems improves the accuracy of building speech synthesis systems.
  • FIG. 19 shows an internal structure diagram of a computer device in one embodiment.
  • the computer device may be a terminal running the speech synthesis system in FIG. 1.
  • the computer device includes a processor 1901, a memory 1902, a network interface 1903, an input device 1904, and a display screen 1905 connected through a system bus.
  • the memory 1902 includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the computer device stores an operating system and a computer program.
  • the processor 1901 can implement a speech synthesis method.
  • a computer program may also be stored in the internal memory.
  • the processor 1901 may execute a speech synthesis method.
  • the display screen 1905 of the computer device may be a liquid crystal display or an electronic ink display screen.
  • the input device 1904 of the computer device may be a touch layer covered on the display screen, or may be a button, a trackball, or a touchpad provided on the computer device casing , Or an external keyboard, trackpad, or mouse.
  • FIG. 19 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied.
  • the computer equipment may include a comparison More or fewer components are shown in the figure, or some components are combined, or have different component arrangements.
  • the 15 devices provided in this application may be implemented in the form of a computer program, and the computer program may be run on a computer device as shown in FIG. 19.
  • the memory 1902 of the computer equipment may store various program modules constituting the speech synthesis device, for example, the linguistic data acquisition module 1502 shown in FIG. 15, the linguistic data encoding module 1504, the embedded vector acquisition module 1506, and the linguistic encoded data decoding. Module 1508.
  • the computer program constituted by each program module causes the processor 1901 to execute the steps in the speech synthesis method of each embodiment of the present application described in this specification.
  • the computer device shown in FIG. 19 may execute S202 through the linguistic data acquisition module 1502 in the speech synthesis device shown in FIG. 15.
  • the computer device may execute S204 through the linguistic data encoding module 1504.
  • the computer device may execute S206 through the embedded vector acquisition module 1506.
  • the computer device may execute S208 through the linguistically encoded data decoding module 1508.
  • a computer device which includes a memory and a processor.
  • the memory stores a computer program.
  • the processor causes the processor to perform the following steps: obtaining the linguistic data to be processed; Learning data encoding to obtain linguistic encoding data; obtaining embedding vectors for speech feature conversion; embedding vectors are generated based on the residuals between the reference synthesized speech data and the reference speech data corresponding to the same reference linguistic data;
  • the linguistically encoded data is decoded to obtain the target synthesized speech data after speech feature conversion.
  • the processor when the computer program is executed by the processor, the processor causes the processor to further perform the following steps: acquiring reference linguistic data and corresponding reference speech data; encoding the reference linguistic data to obtain reference linguistic encoding data; and decoding the reference
  • the linguistically encoded data is used to obtain the reference synthesized speech data; according to the residual between the reference speech data and the reference synthesized speech data, an embedding vector for speech feature conversion is determined.
  • the processor when the computer program is executed by the processor to determine the embedding vector for speech feature conversion according to the residual between the reference speech data and the reference synthesized speech data, the processor can perform the following steps: determine the reference speech The residuals between the data and the reference synthesized speech data; the residuals are processed by the residual model; and based on the results of the forward and backward operations in the residual model, an embedding vector is generated for speech feature conversion.
  • the processor when the computer program is executed by the processor to generate the embedding vector for speech feature conversion according to the result of the forward operation and the result of the backward operation in the residual model, the processor can perform the following steps: Obtain the first vector output at the last time step when the forward gate loop unit layer performs a forward operation in the residual model; obtain the first vector output at the last time step when the backward gate loop unit layer performs a backward operation in the residual model A second vector; superimposing the first vector and the second vector to obtain an embedding vector for speech feature conversion.
  • the processor when the computer program is executed by the processor to process the residuals through the residual model, the processor can perform the following steps: through the fully connected layer in the residual model, the forward gate loop unit layer, and the backward direction
  • the gate loop unit layer processes the residuals.
  • the linguistically encoded data is obtained by encoding the first encoder; the target synthesized speech data is obtained by decoding the first decoder; when the computer program is executed by the processor, the processor causes the processor to further perform the following steps: obtaining training The linguistic data and the corresponding training speech data; the training linguistic data is encoded by the first encoder to obtain the first training linguistic encoding data; the training embedding vector for speech feature conversion is obtained; the training embedding vector is corresponding to the same training Residual generation between the training synthesized speech data of the linguistic data and the training speech data; decoding the first training linguistic encoded data according to the training embedding vector by the first decoder to obtain the predicted target synthesized speech after speech feature conversion Data; according to the difference between the predicted target synthesized speech data and the training speech data, adjust the first encoder and the first decoder, and continue training until the training stop condition is met.
  • the linguistically encoded data is obtained by encoding with a first encoder; the target synthesized speech data is obtained by decoding with a first decoder; the reference linguistically encoded data is obtained by encoding with a second encoder; Decoded by a second decoder; embedded vectors are obtained by a residual model.
  • the processor when the computer program is executed by the processor, the processor causes the processor to further perform the following steps: acquiring training linguistic data and corresponding training speech data; encoding the training linguistic data by a second encoder to obtain a second training The linguistically encoded data; the second training linguistically encoded data is decoded by the second decoder to obtain the training synthesized speech data; the training embedding vector is generated based on the residual model and the residual between the synthesized speech data and the training speech data ; Decoding the first training linguistic encoding data according to the training embedding vector by the first decoder to obtain the predicted target synthesized speech data after the speech feature conversion; according to the difference between the predicted target synthesized speech data and the training speech data, adjust the first Two encoders, a second decoder, a residual model, a first encoder, and a first decoder, and continue training until the training stop condition is satisfied.
  • the processor when the computer program is executed by the processor to decode the linguistically encoded data according to the embedding vector to obtain the target synthesized speech data converted by the speech feature, the processor can perform the following steps: encode the linguistically encoded data Stitching with the embedding vector to obtain the stitching vector; decoding the stitching vector to obtain the target synthesized speech data after speech feature conversion.
  • the processor when the computer program is executed by the processor, the processor causes the processor to further perform the following steps: determining a speech amplitude spectrum corresponding to the target synthesized speech data; converting the speech amplitude spectrum into a time-domain speech waveform signal; and according to the speech waveform Generate speech.
  • a computer-readable storage medium storing a computer program.
  • the processor causes the processor to perform the following steps: obtaining the linguistic data to be processed; encoding the linguistic data, Get linguistically encoded data; get embedding vectors for speech feature conversion; embedding vectors are generated based on the residuals between the reference synthesized speech data and the reference speech data corresponding to the same reference linguistic data; linguistically encoded data according to the embedding vector Decoding to obtain the target synthesized speech data after speech feature conversion.
  • the processor when the computer program is executed by the processor, the processor causes the processor to further perform the following steps: acquiring reference linguistic data and corresponding reference speech data; encoding the reference linguistic data to obtain reference linguistic encoding data; and decoding the reference
  • the linguistically encoded data is used to obtain the reference synthesized speech data; according to the residual between the reference speech data and the reference synthesized speech data, an embedding vector for speech feature conversion is determined.
  • the processor when the computer program is executed by the processor to determine the embedding vector for speech feature conversion according to the residual between the reference speech data and the reference synthesized speech data, the processor can perform the following steps: determine the reference speech The residuals between the data and the reference synthesized speech data; the residuals are processed by the residual model; and based on the results of the forward and backward operations in the residual model, an embedding vector is generated for speech feature conversion.
  • the processor when the computer program is executed by the processor to generate the embedding vector for speech feature conversion according to the result of the forward operation and the result of the backward operation in the residual model, the processor can perform the following steps: Obtain the first vector output at the last time step when the forward gate loop unit layer performs a forward operation in the residual model; obtain the first vector output at the last time step when the backward gate loop unit layer performs a backward operation in the residual model A second vector; superimposing the first vector and the second vector to obtain an embedding vector for speech feature conversion.
  • the processor when the computer program is executed by the processor to process the residuals through the residual model, the processor can perform the following steps: through the fully connected layer in the residual model, the forward gate loop unit layer, and the backward direction
  • the gate loop unit layer processes the residuals.
  • the linguistically encoded data is obtained by encoding the first encoder; the target synthesized speech data is obtained by decoding the first decoder; when the computer program is executed by the processor, the processor causes the processor to further perform the following steps: obtaining training The linguistic data and the corresponding training speech data; the training linguistic data is encoded by the first encoder to obtain the first training linguistic encoding data; the training embedding vector for speech feature conversion is obtained; the training embedding vector is corresponding to the same training Residual generation between the training synthesized speech data of the linguistic data and the training speech data; decoding the first training linguistic encoded data according to the training embedding vector by the first decoder to obtain the predicted target synthesized speech after the speech feature conversion Data; according to the difference between the predicted target synthesized speech data and the training speech data, adjust the first encoder and the first decoder, and continue training until the training stop condition is met.
  • the linguistically encoded data is obtained by encoding with a first encoder; the target synthesized speech data is obtained by decoding with a first decoder; the reference linguistically encoded data is obtained by encoding with a second encoder; reference is made to the synthesized speech data. Decoded by a second decoder; embedded vectors are obtained by a residual model.
  • the processor when the computer program is executed by the processor, the processor causes the processor to further perform the following steps: acquiring training linguistic data and corresponding training speech data; encoding the training linguistic data by a second encoder to obtain a second training The linguistically encoded data; the second training linguistically encoded data is decoded by the second decoder to obtain the training synthesized speech data; the training embedding vector is generated based on the residual model and the residual between the synthesized speech data and the training speech data ; Decoding the first training linguistic encoding data according to the training embedding vector by the first decoder to obtain the predicted target synthesized speech data after the speech feature conversion; according to the difference between the predicted target synthesized speech data and the training speech data, adjust the first Two encoders, a second decoder, a residual model, a first encoder, and a first decoder, and continue training until the training stop condition is satisfied.
  • the processor when the computer program is executed by the processor to decode the linguistically encoded data according to the embedding vector to obtain the target synthesized speech data converted by the speech feature, the processor can perform the following steps: encode the linguistically encoded data Stitching with the embedding vector to obtain the stitching vector; decoding the stitching vector to obtain the target synthesized speech data after speech feature conversion.
  • the processor when the computer program is executed by the processor, the processor causes the processor to further perform the following steps: determining a speech amplitude spectrum corresponding to the target synthesized speech data; converting the speech amplitude spectrum into a time-domain speech waveform signal; and according to the speech waveform Generate speech.
  • FIG. 20 shows an internal structure diagram of a computer device in one embodiment.
  • the computer device may be a terminal running the model training system in FIG. 1.
  • the computer device includes the computer device including a processor 2001, a memory 2002, a network interface 2003, an input device 2004, and a display screen 2005 connected through a system bus.
  • the memory 2002 includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the computer device stores an operating system and a computer program.
  • the processor 2001 can implement a model training method.
  • a computer program can also be stored in the internal memory.
  • the processor 2001 can execute the model training method.
  • the display screen 2005 of the computer equipment may be a liquid crystal display or an electronic ink display screen
  • the input device 2004 of the computer equipment may be a touch layer covered on the display screen, or a button, a trackball, or a touchpad provided on the computer equipment housing. , Or an external keyboard, trackpad, or mouse.
  • FIG. 20 is only a block diagram of a part of the structure related to the scheme of the present application, and does not constitute a limitation on the computer equipment to which the scheme of the present application is applied.
  • the computer equipment may include More or fewer components are shown in the figure, or some components are combined, or have different component arrangements.
  • the 17 devices provided in this application may be implemented in the form of a computer program, and the computer program may be run on a computer device as shown in FIG. 20.
  • the memory 2002 of the computer equipment may store various program modules constituting the model training device, for example, a voice data acquisition module 1702, a linguistic data encoding module 1704, an embedded vector acquisition module 1706, and a linguistic encoded data decoding module shown in FIG. 1708 and adjustment module 1710.
  • the computer program constituted by each program module causes the processor 2001 to execute the steps in the model training method of each embodiment of the present application described in this specification.
  • the computer device shown in FIG. 20 may execute S1302 through the voice data acquisition module 1702 in the model training apparatus shown in FIG. 17.
  • the computer device may execute S1304 through the linguistic data encoding module 1704.
  • the computer device may execute S1306 through the embedded vector acquisition module 1706.
  • the computer device may execute S1308 through the linguistically encoded data decoding module 1708.
  • the computer device may execute S1310 through the adjustment module 1710.
  • a computer device which includes a memory and a processor.
  • the memory stores a computer program.
  • the processor causes the processor to perform the following steps: acquiring training linguistic data and corresponding training speech. Data; encoding the training linguistic data by a first encoder to obtain the first training linguistic encoding data; obtaining a training embedding vector for speech feature conversion; the training embedding vector is synthesized based on training corresponding to the same training linguistic data Generating a residual between speech data and training speech data; decoding the first training linguistically encoded data according to the training embedding vector by a first decoder to obtain prediction target synthesized speech data converted by speech features; Adjusting the first encoder and the first decoder according to the difference between the predicted target synthesized speech data and the training speech data, and continue training until the training stop condition is satisfied.
  • the processor when the computer program is executed by the processor, the processor causes the processor to further perform the following steps: encoding the training linguistic data by a second encoder to obtain the second training linguistic encoded data; and The second training linguistic encoding data is decoded to obtain training synthesized speech data; a training model is generated by a residual model and a training embedding vector is generated according to the residual between the training synthesized speech data and the training speech data; the computer program is executed by the processor according to the Predict the difference between the target synthesized speech data and the training speech data, adjust the first encoder and the first decoder, and continue training until the step of stopping the training condition is met, so that the processor can perform the following steps: Adjusting the difference between the predicted target synthesized speech data and the training speech data, adjusting the second encoder, the second decoder, the residual model, the first encoder, and the first decoder, And continue training until the training stop conditions are met.
  • a computer-readable storage medium storing a computer program.
  • the processor causes the processor to perform the following steps: obtaining training linguistic data and corresponding training speech data;
  • An encoder encodes the training linguistic data to obtain the first training linguistic encoding data; obtains a training embedding vector for speech feature conversion; the training embedding vector synthesizes speech data and training according to training corresponding to the same training linguistic data Generating residuals between speech data; decoding the first training linguistic encoding data according to the training embedding vector according to the training embedding vector to obtain prediction target synthesized speech data transformed by speech features; and according to the prediction The target synthesizes the difference between the speech data and the training speech data, adjusts the first encoder and the first decoder, and continues training until the training stop condition is satisfied.
  • the processor when the computer program is executed by the processor, the processor causes the processor to further perform the following steps: encoding the training linguistic data by a second encoder to obtain the second training linguistic encoded data; and The second training linguistic encoding data is decoded to obtain training synthesized speech data; a training model is generated by a residual model and a training embedding vector is generated according to the residual between the training synthesized speech data and the training speech data; the computer program is executed by the processor according to the Predict the difference between the target synthesized speech data and the training speech data, adjust the first encoder and the first decoder, and continue training until the step of stopping the training condition is met, so that the processor can perform the following steps: Adjusting the difference between the predicted target synthesized speech data and the training speech data, adjusting the second encoder, the second decoder, the residual model, the first encoder, and the first decoder, And continue training until the training stop conditions are met.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM dual data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Synchlink DRAM
  • Rambus direct RAM
  • DRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Machine Translation (AREA)

Abstract

本申请涉及一种语音合成方法、模型训练方法、装置和计算机设备,所述方法包括:获取待处理的语言学数据;对所述语言学数据编码,得到语言学编码数据;获取用于语音特征转换的嵌入向量;所述嵌入向量,根据对应相同参考语言学数据的参考合成语音数据和参考语音数据之间的残差生成;根据所述嵌入向量对所述语言学编码数据进行解码,获得经过语音特征转换的目标合成语音数据。本申请提供的方案可以避免因对数梅尔频谱中的语义特征影响合成语音的质量的问题。

Description

语音合成方法、模型训练方法、装置和计算机设备
本申请要求于2018年07月25日提交的申请号为201810828220.1、发明名称为“语音合成方法、模型训练方法、装置和计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音合成技术领域,特别是涉及一种语音合成方法、模型训练方法、装置和计算机设备。
背景技术
随着语音合成技术和计算机技术的不断发展,语音交互的应用场景越来越广泛,用户可以很方便地通过数字产品获得各种语音相关的服务,如用户通过手机中的电子地图进行语音导航,通过阅读软件收听有声小说等。
对于合成的语音而言,若语音具有人的语音特征时,无疑会提高用户体验。使合成的语音具有人的语音特征,通常的做法是:以处理语音数据所得的对数梅尔频谱作为特征模型的输入变量,获得说话人的语音特征,然后端到端模型(Tacotron)根据所获得的语音特征和对应的文本特征合成语音数据,从而使合成的语音数据具有说话人的语音特征。然而,上述方案中,由于对数梅尔频谱中既包含话人的语音特征又包含语义特征,从而影响从对数梅尔频谱中提取语音特征,进而影响合成语音的质量。
发明内容
本申请提供了一种语音合成方法、模型训练方法、装置和计算机设备。
一种语音合成方法,包括:
获取待处理的语言学数据;对所述语言学数据编码,得到语言学编码数据;获取用于语音特征转换的嵌入向量;所述嵌入向量,根据对应相同参考语言学数据的参考合成语音数据和参考语音数据之间的残差生成;根据所述嵌入向量对所述语言学编码数据进行解码,获得经过语音特征转换的目标合成语音数据。
一种语音合成装置,包括:
语言学数据获取模块,用于获取待处理的语言学数据;
语言学数据编码模块,用于对所述语言学数据编码,得到语言学编码数据;
嵌入向量获取模块,用于获取用于语音特征转换的嵌入向量;所述嵌入向量,根据对应相同参考语言学数据的参考合成语音数据和参考语音数据之间的残差生成;
语言学编码数据解码模块,用于根据所述嵌入向量对所述语言学编码数据进行解码,获得经过语音特征转换的目标合成语音数据。
一种存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器 执行所述语音合成方法的步骤。
一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行所述语音合成方法的步骤。
上述的语音合成方法、模型训练方法、装置和计算机设备中,获取待处理的语言学数据,对语言学数据进行编码,便可得到表征发音的语言学编码数据。获取用于语音特征转换的嵌入向量,由于嵌入向量是对应相同参考语言学数据的参考合成语音数据和参考语音数据之间的残差生成,因而所得到的嵌入向量为不包含语义特征的风格特征向量。根据嵌入向量对语言学编码数据进行解码,避免了语义特征对语言学编码数据处理的影响,因此所获得的目标合成语音数据的质量高,从而提高了合成语音的质量。
一种模型训练方法,包括:
获取训练语言学数据和相应的训练语音数据;通过第一编码器对训练语言学数据编码,得到第一训练语言学编码数据;获取用于语音特征转换的训练嵌入向量;所述训练嵌入向量,根据对应相同训练语言学数据的训练合成语音数据和训练语音数据之间的残差生成;通过第一解码器,根据所述训练嵌入向量对所述第一训练语言学编码数据进行解码,获得经过语音特征转换的预测目标合成语音数据;根据所述预测目标合成语音数据和训练语音数据间的差异,调整所述第一编码器和所述第一解码器,并继续训练,直至满足训练停止条件。
一种模型训练装置,包括:
训练语音数据获取模块,用于获取训练语言学数据和相应的训练语音数据;
训练语言学数据编码模块,用于通过第一编码器对训练语言学数据编码,得到第一训练语言学编码数据;
训练嵌入向量获取模块,用于获取用于语音特征转换的训练嵌入向量;所述训练嵌入向量,根据对应相同训练语言学数据的训练合成语音数据和训练语音数据之间的残差生成;
训练语言学编码数据解码模块,用于通过第一解码器,根据所述训练嵌入向量对所述第一训练语言学编码数据进行解码,获得经过语音特征转换的预测目标合成语音数据;
调整模块,用于根据所述预测目标合成语音数据和训练语音数据间的差异,调整所述第一编码器和所述第一解码器,并继续训练,直至满足训练停止条件。
一种存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行所述模型训练方法的步骤。
一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行所述模型训练方法的步骤。
上述的模型训练方法、装置、存储介质和计算机设备中,通过第一编码器和第一解码器处理训练语言学数据、训练语音数据和训练嵌入向量,得到预测目标合成语音数据,根据预测目标合成语音数据和训练语音数据间的差异调整第一编码器和第一解码器,使预测目标合成语音数据不断逼近训练语音数据,从而得到训练好的第一编码器和第一解码器。由于训练过程中采用了由训练合成语音数据和训练语音数据之间的残差生成的训练嵌入向量,该训练嵌入向量只包含语音特征,无需考虑语义特征对对训练模型的影响,从而降低了第一编码器和第一解码器的复杂度,提高了训练结果的准确性。
附图说明
图1为一个实施例中语音合成方法和模型训练方法的应用系统结构图;
图2为一个实施例中语音合成方法的流程示意图;
图3为一个实施例中语音合成阶段得到目标合成语音数据的示意图;
图4为一个实施例中根据参考语言学数据和参考语音数据获得嵌入向量的步骤的流程示意图;
图5为一个实施例中获得嵌入向量过程中数据流向的示意图;
图6为一个实施例中通过残差模型获得嵌入向量的步骤的流程示意图;
图7为一个实施例中残差模型结构和残差在残差模型中的处理过程的示意图;
图8为一个实施例中自适应阶段获得嵌入向量的示意图;
图9为一个实施例中对目标语音模型进行训练的步骤的流程示意图;
图10为一个实施例中模型训练阶段中训练目标语音模型时数据流向的示意图;
图11为一个实施例中对平均语音模型、残差模型和目标语音模型进行训练的步骤的流程示意图;
图12为一个实施例中模型训练阶段中训练平均语音模型、残差模型和目标语音模型时数据流向的示意图;
图13为一个实施例中对目标语音模型进行训练的步骤的流程示意图;
图14为一个实施例中对平均语音模型、残差模型和目标语音模型进行训练的步骤的流程示意图;
图15为一个实施例中语音合成装置的结构框图;
图16为另一个实施例中语音合成装置的结构框图;
图17为一个实施例中模型训练装置的结构框图;
图18为另一个实施例中模型训练装置的结构框图;
图19为一个实施例中计算机设备的结构框图;
图20为另一个实施例中计算机设备的结构框图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
图1为一个实施例中语音合成方法和模型训练方法的应用环境图。参照图1,该语音合成方法、模型训练方法应用于语音合成系统。该语音合成系统包括第一编码器、第一解码器、第二编码器、第二解码器、叠加器、残差模型和投影层等。语音合成系统中的各部分组成元素之间的内在关系及信号流向如图1所示。其中,第一编码器和第一解码器构成目标语音模型,在应用阶段用于合成语音。第二编码器和第二解码器构成平均语音模型,所构成的平均语音模型与叠加器、残差模型和投影层组合使用,在自适应阶段可用于获得用于表征风格特征的嵌入向量。该语音合成系统可以以应用程序或应用程序的组成部分运行在计算机设备上。该计算机设备可以为终端或服务器。终端可以是台式终端、移动终端、智能机器人。其中, 移动终端可以是智能手机、平板电脑、笔记本电脑或可穿戴式设备等。
如图2所示,在一个实施例中,提供了一种语音合成方法。本实施例主要以该方法应用于上述图1中运行语音合成系统的终端来举例说明。参照图2,该语音合成方法可以包括如下步骤:
S202,获取待处理的语言学数据。
其中,语言学数据可以是文本或文本的特征或特征项。文本的特征可以是文本中的字、发音、字或词的韵律和重音等特征。特征项可以是字、词或短语等。特征项需要具备以下特性:能够确实标识文本内容,具有将目标文本与其他文本相区分的能力,特征项分离易实现。
在一个实施例中,在应用过程中,终端接收用户发出的语音交互信号,从预设的语言学库中查找与语音交互信号对应的语言学数据。例如,用户在与终端进行语音交互过程中,若终端接收到用户发出“西施与貂蝉谁更漂亮”的语音交互信号时,终端从预设的语言学库中查找与该语音交互信号对应的“西施与貂蝉都一样漂亮”的语言学数据。在该实例中,语言学数据为文本。
S204,对语言学数据编码,得到语言学编码数据。
在一个实施例中,终端通过第一编码器对语言学数据编码,得到语言学编码数据。例如,终端获取一段文本,通过第一编码器对文本进行编码,获得分布式表示,该分布式表示即为语言学编码数据。其中,该分布式表示可以是特征向量。一个特征向量与文本中的一个字或词相对应。
其中,第一编码器可以是语言学数据编码器或基于注意力的递归生成器。第一编码器可以由RNN(Recurrent Neural Network,递归神经网络),或LSTM(Long Short-Term Memory,长短期记忆网络),或闸控卷积神经网络,或时延网络所构成。
示例地,终端将表征语言学数据的向量输入第一编码器,将第一编码器最后一个单元状态作为输出,得到语言学编码数据。
S206,获取用于语音特征转换的嵌入向量;嵌入向量是根据对应相同参考语言学数据的参考合成语音数据和参考语音数据之间的残差生成。
其中,嵌入向量可以是具有参考对象说话风格特征的向量,而参考对象可以是说话具有特殊风格的人。风格特征包括但不限于:与时长和韵律起伏相关性高的韵律时长特征、基频特征和能量特征。韵律时长特征包括一个字或词的时长、停顿和重音等特征。终端将该嵌入向量与对应的语言学编码数据进行融合和处理,将得到具有参考对象说话风格的合成语音数据。当合成语音数据经过处理后通过扬声器播放出来,播放出来的合成语音将不再是机械化的语音,而是具有人的说话风格。
在一个实施例中,当用户在与终端进行语音交互之前,终端获取参考语言学数据和具有风格特征的参考语音数据,其中,参考语音数据的来源可以是与终端进行语音交互的用户,也可以是指定的参考用户。终端对参考语言学数据进行语音合成,得到不具有风格特征的参考合成语音数据。终端将参考合成语音数据与参考语音数据进行作差处理,得到表征风格特征的残差。终端对残差进行处理得到表征风格特征的嵌入向量。终端将得到的嵌入向量保存于风格特征向量库中。其中,风格特征向量库可以保存多个参考对象对应的嵌入向量。该表征风格特征的残差实质上可以是残差序列。
在一个实施例中,终端对残差进行处理得到表征风格特征的嵌入向量的步骤,可以包括:将残差输入残差模型的多个全连接层,将全连接层输出的结果分别输入前向门循环单元层和后向门循环单元层,将前向门循环单元层最后一个时间步的输出与后向门循环单元层第一个时间步的输出相加,得到用于语音特征转换的、能表征风格特征的嵌入向量。
例如,若用户在与终端进行语音交互时想要听到张曼玉的说话风格,那么在与终端进行语音交互之前,终端获取张曼玉的语音数据作为参考语音数据,并获取对应的语言学数据(例如说话的文字内容,文字内容如“西施与貂蝉谁更漂亮”),其中,获取的参考语音数据具有张曼玉的说话风格。终端对语言学数据进行语音合成,得到不具有张曼玉说话风格的参考合成语音数据。终端将具有张曼玉说话风格的参考语音数据与不具有说话风格的参考合成语音数据作差,得到表征风格特征的残差。终端对得到的残差进行处理,获得能够表征张曼玉说话风格的嵌入向量。
在一个实施例中,终端将得到的用于语音特征转换的、能表征风格特征的嵌入向量,保存于嵌入向量库中。当终端接收到风格特征选择指令时,展示与嵌入向量对应的风格选择界面。
在一个实施例中,终端接收指定的风格特征指令,从风格特征向量库中获取与风格特征指令对应的嵌入向量。例如,用户想要听到某个电影或体育明星的声音,那么,用户在终端的风格选择界面中的各参考对象中选择目标的电影或体育明星,此时终端接收到对于该电影或体育明星的风格特征指令,根据风格特征指令选择表征该电影或体育明星说话风格的嵌入向量。
S208,根据嵌入向量对语言学编码数据进行解码,获得经过语音特征转换的目标合成语音数据。
在一个实施例中,终端通过第一解码器,按照嵌入向量对语言学编码数据进行解码,获得经过语音特征转换的、具有参考对象说话风格的目标合成语音数据。或者,终端将嵌入向量与语言学编码数据进行组合,对组合后的结果进行解码,获得经过语音特征转换的、具有参考对象说话风格的目标合成语音数据。
其中,第一解码器可以是语音数据解码器或基于注意力的递归生成器。第一解码器可以由RNN,或LSTM,或CNN(Convolutional Neural Network,卷积神经网络),或闸控卷积神经网络,或时延网络所构成。
作为一个示例,如图3所示,当接收到用户发出的语音交互信号时,终端获取与语音交互信号对应的语言学数据,该语言学数据例如是“西施与貂蝉谁更漂亮”。终端将获取的语言学数据输入第一编码器,通过第一编码器的编码处理,得到语言学编码数据。终端获取可以表征参考对象(如张曼玉)说话风格的嵌入向量,通过第一解码器对嵌入向量和语言学编码数据进行处理,得到具有参考对象说话风格的目标合成语音数据。
上述实施例中,获取待处理的语言学数据,对语言学数据进行编码,便可得到表征发音的语言学编码数据。获取用于语音特征转换的嵌入向量,由于嵌入向量是对应相同参考语言学数据的参考合成语音数据和参考语音数据之间的残差生成,因而所得到的嵌入向量为不包含语义特征的风格特征向量。根据嵌入向量对语言学编码数据进行解码,避免了语义特征对语言学编码数据处理的影响,因此所获得的目标合成语音数据的质量高,从而提高了合成语 音的质量。
在一个实施例中,如图4所示,该方法还可以包括:
S402,获取参考语言学数据和相应的参考语音数据。
其中,参考语音数据可以是采自于参考对象的语音数据。参考语言学数据与参考语音数据相对应。参考对象可以是与终端进行语音交互的用户,也可以是指定的参考用户。对应的,参考语音数据可以是参考对象发出的语音信号,而参考语言学数据可以是语音信号中所要表达的文字内容。
例如,若用户在与终端进行语音交互时想要听到用户本人的说话风格,那么在与终端进行语音交互之前,获取用户本人的语音数据作为参考语音数据,并获取对应的语言学数据,其中,获取的参考语音数据具有用户本人的说话风格。终端对语言学数据进行语音合成,得到不具有用户本人说话风格的参考合成语音数据。终端将具有用户本人说话风格的参考语音数据与不具有说话风格的参考合成语音数据作差,得到表征风格特征的残差。终端对得到的残差进行处理,获得能够表征用户本人说话风格的嵌入向量。
在一个实施例中,终端采集参考对象的语音,将采集的语音进行分帧、加窗和傅里叶变换,得到具有参考对象说话风格特征的、且为频域的语音数据。
S404,对参考语言学数据编码,得到参考语言学编码数据。
在一个实施例中,终端通过第一编码器对参考语言学数据编码,得到参考语言学编码数据。例如,终端获取一段文本,通过第一编码器对该段文本进行编码,获得分布式表示,该分布式表示即为参考语言学编码数据。其中,该分布式表示可以是特征向量。一个特征向量与文本中的一个字或词相对应。
其中,第二编码器可以是语言学数据编码器或基于注意力的递归生成器。第二编码器可以由RNN,或LSTM,或闸控卷积神经网络,或时延网络所构成。示例地,终端将表征语言学数据的向量输入第二编码器,将第二编码器最后一个单元状态作为输出,得到语言学编码数据。
S406,解码参考语言学编码数据,得到参考合成语音数据。
在一个实施例中,终端通过第二解码器对参考语言学数据进行解码,得到不具有风格特征的参考合成语音数据。
其中,第二解码器可以是语音数据解码器或基于注意力的递归生成器。第二解码器可以由RNN,或LSTM,或CNN(Convolutional Neural Network,卷积神经网络),或闸控卷积神经网络,或时延网络所构成。
其中,S404和S406为合成不具有风格特征的参考合成语音数据的步骤。作为一个示例,如图5所示。终端获得参考语言学数据后,将获得的参考语言学数据输入第二编码器中,通过第二编码器对参考语言学数据进行处理,得到表示参考语言学数据的上下文的表示C。其中,上下文的表示C可以是概括了输入序列X={x(1),x(2)...x(n)}的向量,其中n为大于1的整数。终端将上下文的表示C输入第二解码器,以固定长度的向量作为条件,产生输出序列Y={y(1),y(2)...y(n)},进而得到参考合成语音数据。需要说明的是,上述方法步骤只是用于理解如何得到参考合成语音数据,不作为本申请实施例的限定。
S408,根据参考语音数据和参考合成语音数据间的残差,确定用于语音特征转换的嵌入 向量。
在一个实施例中,终端对参考语音数据和参考合成语音数据进行作差,得到表征风格特征的残差。终端对所得的具有风格特征的残差进行处理,得到用于语音特征转换的、且用于表征风格特征的嵌入向量。
上述实施例中,根据参考语音数据和参考合成语音数据间的残差,确定用于语音特征转换的嵌入向量,从而得到用于对语言学数据进行语音合成时进行风格控制的嵌入向量,以使合成的目标合成语音数据具有特定的风格特征,提高合成语音的质量。
在一个实施例中,如图6所示,S408可以包括:
S602,确定参考语音数据和参考合成语音数据间的残差。
在一个实施例中,终端对参考语音数据和参考合成语音数据进行作差,得到表征风格特征的残差。
S604,通过残差模型处理残差。
其中,残差模型可以由RNN所构建。残差模型可以包括4层:从下至上分别为两个全连接(Dense)层、一个前向GRU(Gated Recurrent Unit,门循环单元)层和一个后向GRU层。其中,每个Dense层包含128个以激活函数(如ReLU函数)激发的单元,丢失(Dropout)率为0.5,每个门循环单元层包含了32个记忆模块。
在一个实施例中,S604可以包括:将残差输入至残差模型,并通过残差模型中的全连接层、前向门循环单元层和后向门循环单元层对残差进行处理。
S606,根据残差模型中前向运算的结果和后向运算的结果,生成用于语音特征转换的嵌入向量。
其中,该嵌入向量也可以称为自适应嵌入向量。嵌入向量所具有的风格特征与参考语音数据相关。例如,假设参考语音数据是通过采集张曼玉的语音所得,则该嵌入向量所具有的风格特征与张曼玉的说话风格特征一致。又例如,假设参考语音数据是通过采集用户本人的语音所得,则该嵌入向量所具有的风格特征与用户本人的说话风格特征一致。
在一个实施例中,终端通过残差模型中前向门循环单元层对残差进行前向运算,得到前向运算的结果。终端通过残差模型中后向门循环单元层对残差进行后向运算,得到后向运算的结果。
在一个实施例中,S606可以包括:获取残差模型中前向门循环单元层进行前向运算时在最后一个时间步输出的第一向量;获取残差模型中后向门循环单元层进行后向运算时在第一个时间步输出的第二向量;将第一向量和第二向量叠加,获得用于语音特征转换的嵌入向量。
作为一个示例,如图7所示,假设所得到的残差为R={r(1),r(2),…,r(t)},其中,t为大于1的整数。将所得到的残差R={r(1),r(2),…,r(t)}依次输入Dense层和GRU层。最后,将前向GRU层最后一个时间步的隐层状态与后向GRU层第一个时间步的隐层状态相加,得到用于表征风格特征的嵌入向量e。
S402-S408以及S602-S606为获取嵌入向量的步骤,作为一个示例,如图8所示,可以通过如下方法获得嵌入向量:终端获取参考语言学数据和具有风格特征(如张曼玉说话的风格特征)的参考语音数据,其中,该语言学数据例如是“西施与貂蝉谁更漂亮”。终端将获取的语言学数据输入第一编码器,通过第一编码器的编码处理,得到参考语言学编码数据。然后, 终端对参考语言学编码数据进行解码获得参考合成语音数据,将参考合成语音数据与参考语音数据进行作差,得到表征风格特征的残差。终端通过残差模型对残差进行处理,得到可以表征说话风格的嵌入向量。
上述实施例中,通过残差模型处理参考语音数据和参考合成语音数据之间的残差,获得用于语音特征转换的嵌入向量,使得嵌入向量具有与参考语音数据相同的风格特征,具有自适应的效果。此外,得到用于对语言学数据进行语音合成时进行风格控制的嵌入向量,以使合成的目标合成语音数据具有特定的风格特征,提高合成语音的质量。
在一个实施例中,如图9所示,语言学编码数据通过第一编码器进行编码得到;目标合成语音数据通过第一解码器进行解码得到;该方法还包括:
S902,获取训练语言学数据和相应的训练语音数据。
其中,语言学数据可以是文本或文本的特征或特征项。训练语言学数据指的是在训练阶段所采用的语言学数据,用于对第一编码器和第一解码器进行训练。
在一个实施例中,在训练过程中,终端获取训练语言学数据和具有风格特征的训练语音数据。例如,在训练过程中,开发人员输入用于训练的训练语言学数据和具有风格特征的训练语音数据。其中,训练语言学数据可以是“我喜欢吃饭睡觉打豆豆”。其中,当训练“我喜欢吃饭睡觉打豆豆”这个语言学数据后,若用户在与终端进行语音交互时发出“小机器人,你平时喜欢干嘛呀?”的语音交互信号时,终端则输出“我喜欢吃饭睡觉打豆豆”作为回应。
S904,通过第一编码器对训练语言学数据编码,得到第一训练语言学编码数据。
在一个实施例中,终端通过第一编码器对训练语言学数据编码,得到第一训练语言学编码数据。例如,终端获取一段训练文本,通过第一编码器对训练文本进行编码,获得分布式表示,该分布式表示即为第一训练语言学编码数据。
S906,获取用于语音特征转换的训练嵌入向量;训练嵌入向量,根据对应相同训练语言学数据的训练合成语音数据和训练语音数据之间的残差生成。
其中,训练嵌入向量指的是用于训练第一编码器和第一解码器的向量。终端将该训练嵌入向量与对应的第一训练语言学编码数据进行融合和处理,将得到具有参考对象说话风格的训练合成语音数据。当训练合成语音数据经过处理后通过扬声器播放出来,播放出来的合成语音将不再是机械化的语音,而是具有人的说话风格。
在一个实施例中,当用户在与终端进行语音交互之前,终端获取训练语言学数据和具有风格特征的训练语音数据,其中,训练语音数据的来源可以由开发人员选取,可以是由开发人员自己的语音所得,也可以是由其它具有特定说话风格的语音所得。终端对训练语言学数据进行语音合成,得到不具有风格特征的训练合成语音数据。终端将训练合成语音数据与训练语音数据进行作差处理,得到表征风格特征的残差。终端对残差进行处理得到表征风格特征的训练嵌入向量。终端将得到的训练嵌入向量保存于风格特征向量库中。
在一个实施例中,终端对残差进行处理得到表征风格特征的训练嵌入向量的步骤,可以包括:通过残差模型的多个全连接层处理残差,将全连接层输出的结果分别输入前向门循环单元层和后向门循环单元层,将前向门循环单元层最后一个时间步的输出与后向门循环单元层第一个时间步的输出相加,得到用于语音特征转换的、能表征风格特征的训练嵌入向量。
例如,若开发人员想以张曼玉的语音数据作为训练语音数据,则获取张曼玉的语音进行 处理得到训练语音数据,并获取对应的语言学数据(例如说话的文字内容,文字内容如“我喜欢吃饭睡觉打豆豆”),其中,获取的训练语音数据具有张曼玉的说话风格。终端对语言学数据进行语音合成,得到不具有说话风格的训练合成语音数据。终端将具有张曼玉说话风格的训练语音数据与不具有说话风格的训练合成语音数据作差,得到表征风格特征的残差。终端对得到的残差进行处理,获得能够表征张曼玉说话风格的训练嵌入向量。
在一个实施例中,终端接收指定的风格特征选择指令,从风格特征向量库中获取与风格特征指令对应的训练嵌入向量。例如,开发人员在终端的风格选择界面中的各参考对象中选择目标的电影或体育明星,此时终端接收到对于该电影或体育明星的风格特征指令,根据风格特征指令选择表征该电影或体育明星说话风格的训练嵌入向量。
S908,通过第一解码器,根据训练嵌入向量对第一训练语言学编码数据进行解码,获得经过语音特征转换的预测目标合成语音数据。
在一个实施例中,终端通过第一解码器,按照训练嵌入向量对第一训练语言学编码数据进行解码,获得经过语音特征转换的、具有参考对象说话风格的预测目标合成语音数据。或者,终端将训练嵌入向量与第一训练语言学编码数据进行组合,对组合后的结果进行解码,获得经过语音特征转换的、具有参考对象说话风格的预测目标合成语音数据。
S910,根据预测目标合成语音数据和训练语音数据间的差异,调整第一编码器和第一解码器,并继续训练,直至满足训练停止条件。
在一个实施例中,终端根据预测目标合成语音数据和训练语音数据间的差异,调整第一编码器和第一解码器中的参数,并继续训练,直至预测目标合成语音数据对应的语音风格与训练语音数据对应的语音风格一致,则停止训练。
S902-S910为训练第一编码器和第一解码器的步骤,作为一个示例,如图10所示,可以通过如下方法训练第一编码器和第一解码器:获取训练语言学数据和具有风格特征(如张曼玉或开发者本人说话的风格特征)的训练语音数据,通过第一编码器对训练语言学数据编码得到第一训练语言学编码数据;获取用于表征风格特征的训练嵌入向量,通过第一解码器,根据训练嵌入向量对第一训练语言学编码数据进行解码,获得经过语音特征转换的预测目标合成语音数据;根据预测目标合成语音数据和训练语音数据间的差异,调整第一编码器和第一解码器,并继续训练,直至满足训练停止条件。
上述实施例中,通过第一编码器和第一解码器处理训练语言学数据、训练语音数据和训练嵌入向量,得到预测目标合成语音数据。根据预测目标合成语音数据和训练语音数据间的差异调整第一编码器和第一解码器,使预测目标合成语音数据不断逼近训练语音数据,从而得到训练好的第一编码器和第一解码器。由于训练过程中采用了由训练合成语音数据和训练语音数据之间的残差生成的训练嵌入向量,该训练嵌入向量只包含语音特征,无需考虑语义特征对对训练模型的影响,从而降低了第一编码器和第一解码器的复杂度,提高了训练结果的准确性。
在一个实施例中,语言学编码数据通过第一编码器进行编码得到;目标合成语音数据通过第一解码器进行解码得到;参考语言学编码数据通过第二编码器进行编码得到;参考合成语音数据通过第二解码器进行解码得到;嵌入向量通过残差模型得到。如图11所示,该方法还可以包括:
S1102,获取训练语言学数据和相应的训练语音数据。
其中,训练语言学数据指的是在训练阶段所采用的语言学数据,用于对第一编码器和第一解码器进行训练。
在一个实施例中,在训练过程中,终端获取训练语言学数据和具有风格特征的训练语音数据。例如,在训练过程中,开发人员输入用于训练的训练语言学数据和具有风格特征的训练语音数据。其中,训练语言学数据可以是“我喜欢吃饭睡觉打豆豆”。
S1104,通过第二编码器将训练语言学数据编码,得到第二训练语言学编码数据。
在一个实施例中,终端通过第二编码器对训练语言学数据编码,得到第二训练语言学编码数据。例如,终端获取一段文本,通过第一编码器对该段文本进行编码,获得分布式表示,该分布式表示即为第二训练语言学编码数据。其中,该分布式表示可以是特征向量。一个特征向量与文本中的一个字或词相对应。
S1106,通过第二解码器对第二训练语言学编码数据解码,得到训练合成语音数据。
S1108,通过残差模型,并根据训练合成语音数据和训练语音数据之间的残差生成训练嵌入向量。
在一个实施例中,终端通过残差模型,对训练合成语音数据和训练语音数据进行作差,得到表征风格特征的残差。终端对所得的具有风格特征的残差进行处理,得到用于语音特征转换的、且用于表征风格特征的训练嵌入向量。
对于获得训练嵌入向量的详细过程,可参考S402-S408和S602-S606,这里不再进行赘述。
S1110,通过第一解码器,根据训练嵌入向量对第一训练语言学编码数据进行解码,获得经过语音特征转换的预测目标合成语音数据。
其中,第一训练语言学编码数据由第一编码器编码训练语言学数据所得。
在一个实施例中,终端通过第二解码器,按照训练嵌入向量对第一训练语言学编码数据进行解码,获得经过语音特征转换的、具有参考对象说话风格的预测目标合成语音数据。或者,终端将训练嵌入向量与第一训练语言学编码数据进行组合,对组合后的结果进行解码,获得经过语音特征转换的、具有参考对象说话风格的预测目标合成语音数据。
S1112,根据预测目标合成语音数据和训练语音数据间的差异,调整第二编码器、第二解码器、残差模型、第一编码器和第一解码器,并继续训练,直至满足训练停止条件。
在一个实施例中,终端根据预测目标合成语音数据和训练语音数据间的差异,调整第二编码器、第二解码器、残差模型、第一编码器和第一解码器中的参数,并继续训练,直至预测目标合成语音数据对应的语音风格与训练语音数据对应的语音风格一致,则停止训练。
S1102-S1112为训练第二编码器、第二解码器、残差模型、第一编码器和第一解码器的步骤,作为一个示例,如图12所示,可以通过如下方法训练第二编码器、第二解码器、残差模型、第一编码器和第一解码器:获取训练语言学数据和具有风格特征(如张曼玉或开发者本人说话的风格特征)的训练语音数据,通过第二编码器将训练语言学数据编码得到第二训练语言学编码数据,通过第二解码器对第二训练语言学编码数据进行解码得到训练合成语音数据。终端通过残差模型对训练合成语音数据与训练语音数据之间的残差进行处理,获得用于表征风格特征的训练嵌入向量。通过第一编码器对训练语言学数据编码得到第一训练语言学编码数据后,通过第一解码器,根据训练嵌入向量对第一训练语言学编码数据进行解码,获 得经过语音特征转换的预测目标合成语音数据。根据预测目标合成语音数据和训练语音数据间的差异,调整第二编码器、第二解码器、残差模型、第一编码器和第一解码器,并继续训练,直至满足训练停止条件。
上述实施例中,通过训练语言学数据和相应的训练语音数据,对第二编码器、第二解码器、残差模型、第一编码器和第一解码器进行训练。根据预测目标合成语音数据和训练语音数据间的差异调整第二编码器、第二解码器、残差模型、第一编码器和第一解码器,使预测目标合成语音数据不断逼近训练语音数据,从而得到训练好的第二编码器、第二解码器、残差模型、第一编码器和第一解码器。
此外,由于训练过程中采用了由训练合成语音数据和训练语音数据之间的残差生成的训练嵌入向量,该训练嵌入向量只包含语音特征,无需考虑语义特征对对训练模型的影响,从而降低了第二编码器、第二解码器、残差模型、第一编码器和第一解码器的复杂度,提高了训练结果的准确性。
最后,将用于获取用于表征风格特征的嵌入向量的第二编码器、第二解码器、残差模型,与用于合成语音的第一编码器和第一解码器结合在一起,降低了语音合成系统对数据的需求,提高建立语音合成系统的准确性。
在一个实施例中,S208可以包括:将语言学编码数据和嵌入向量拼接,得到拼接向量;对拼接向量进行解码,得到经过语音特征转换的目标合成语音数据。
在一个实施例中,嵌入向量包括:韵律时长特征、基频特征和能量特征。将语言学编码数据和嵌入向量拼接,得到拼接向量的步骤,可以包括:根据韵律时长特征确定与目标语音数据中韵律对应的目标时长;将音素序列与目标时长、基频特征和能量特征进行组合,获得组合特征。
上述实施例中,将语言学编码数据和嵌入向量拼接,对拼接后所得的向量进行解码,得到经过语音特征转换的目标合成语音数据。由于拼接后的向量没语义特征,避免了语义特征对语言学编码数据的处理,从而提高了合成语音的质量。
在一个实施例中,该方法还可以包括:确定与目标合成语音数据对应的语音幅度谱;将语音幅度谱转换为时域的语音波形信号;根据语音波形生成语音。
在一个实施例中,目标合成语音数据可以是频域的语音数据,终端从频域的目标合成语音数据中获取对应的语音幅度谱,通过Griffin-Lim算法将语音幅度谱转换为时域的语音波形信号。终端将语音波形信号通过world声码器,转换成带有风格的合成声音。
上述实施例中,将具有语音特征的目标合成语音数据转换为语音信号,从而获得具有风格的语音,从而可以提高合成语音的质量。
如图13所示,在一个实施例中,提供了一种模型训练方法。本实施例主要以该方法应用于上述图1中运行语音合成系统的终端来举例说明。参照图13,该模型训练方法可以包括如下步骤:
S1302,获取训练语言学数据和相应的训练语音数据。
其中,语言学数据可以是文本或文本的特征或特征项。训练语言学数据指的是在训练阶段所采用的语言学数据,用于对第一编码器和第一解码器进行训练。
在一个实施例中,在训练过程中,终端获取训练语言学数据和具有风格特征的训练语音数据。例如,在训练过程中,开发人员输入用于训练的训练语言学数据和具有风格特征的训练语音数据。其中,训练语言学数据可以是“我喜欢吃饭睡觉打豆豆”。其中,当训练“我喜欢吃饭睡觉打豆豆”这个语言学数据后,若用户在与终端进行语音交互时发出“小机器人,你平时喜欢干嘛呀?”的语音交互信号时,终端则输出“我喜欢吃饭睡觉打豆豆”作为回应。
S1304,通过第一编码器对训练语言学数据编码,得到第一训练语言学编码数据。
在一个实施例中,终端通过第一编码器对训练语言学数据编码,得到第一训练语言学编码数据。例如,终端获取一段训练文本,通过第一编码器对训练文本进行编码,获得分布式表示,该分布式表示即为第一训练语言学编码数据。其中,该分布式表示可以是特征向量。一个特征向量与文本中的一个字或词相对应。
S1306,获取用于语音特征转换的训练嵌入向量;训练嵌入向量,根据对应相同训练语言学数据的训练合成语音数据和训练语音数据之间的残差生成。
其中,嵌入向量可以是具有参考对象说话风格特征的向量,参考对象可以在训练过程中,由开发人员选择的说话具有特定风格的人。训练嵌入向量指的是用于训练第一编码器和第一解码器的向量。终端将该训练嵌入向量与对应的第一训练语言学编码数据进行融合和处理,将得到具有参考对象说话风格的训练合成语音数据。当训练合成语音数据经过处理后通过扬声器播放出来,播放出来的合成语音将不再是机械化的语音,而是具有人的说话风格。
在一个实施例中,当用户在与终端进行语音交互之前,终端获取训练语言学数据和具有风格特征的训练语音数据,其中,训练语音数据的来源可以由开发人员选取,可以是由开发人员自己的语音所得,也可以是由其它具有特定说话风格的语音所得。终端对训练语言学数据进行语音合成,得到不具有风格特征的训练合成语音数据。终端将训练合成语音数据与训练语音数据进行作差处理,得到表征风格特征的残差。终端对残差进行处理得到表征风格特征的训练嵌入向量。终端将得到的训练嵌入向量保存于风格特征向量库中。其中,风格特征向量库可以保存多个参考对象对应的训练嵌入向量,而参考对象可以是说话具有特殊风格的人。该表征风格特征的残差实质上可以是残差序列。
在一个实施例中,终端对残差进行处理得到表征风格特征的训练嵌入向量的步骤,可以包括:通过残差模型的多个全连接层处理残差,将全连接层输出的结果分别输入前向门循环单元层和后向门循环单元层,将前向门循环单元层最后一个时间步的输出与后向门循环单元层第一个时间步的输出相加,得到用于语音特征转换的、能表征风格特征的训练嵌入向量。
例如,若开发人员想以张曼玉的语音数据作为训练语音数据,则获取张曼玉的语音进行处理得到训练语音数据,并获取对应的语言学数据(例如说话的文字内容,文字内容如“我喜欢吃饭睡觉打豆豆”),其中,获取的训练语音数据具有张曼玉的说话风格。终端对语言学数据进行语音合成,得到不具有说话风格的训练合成语音数据。终端将具有张曼玉说话风格的训练语音数据与不具有说话风格的训练合成语音数据作差,得到表征风格特征的残差。终端对得到的残差进行处理,获得能够表征张曼玉说话风格的训练嵌入向量。
在一个实施例中,终端接收指定的风格特征选择指令,从风格特征向量库中获取与风格特征指令对应的训练嵌入向量。例如,开发人员想要听到某个电影或体育明星的声音,那么,用户在终端的风格选择界面中的各参考对象中选择目标的电影或体育明星,此时终端接收到 对于该电影或体育明星的风格特征指令,根据风格特征指令选择表征该电影或体育明星说话风格的训练嵌入向量。
S1308,通过第一解码器,根据训练嵌入向量对第一训练语言学编码数据进行解码,获得经过语音特征转换的预测目标合成语音数据。
在一个实施例中,终端通过第一解码器,按照训练嵌入向量对第一训练语言学编码数据进行解码,获得经过语音特征转换的、具有参考对象说话风格的预测目标合成语音数据。或者,终端将训练嵌入向量与第一训练语言学编码数据进行组合,对组合后的结果进行解码,获得经过语音特征转换的、具有参考对象说话风格的预测目标合成语音数据。
S1310,根据预测目标合成语音数据和训练语音数据间的差异,调整第一编码器和第一解码器,并继续训练,直至满足训练停止条件。
在一个实施例中,终端根据预测目标合成语音数据和训练语音数据间的差异,调整第一编码器和第一解码器中的参数,并继续训练,直至预测目标合成语音数据对应的语音风格与训练语音数据对应的语音风格一致,则停止训练。
S1302-S1310为训练第一编码器和第一解码器的步骤,作为一个示例,如图10所示,可以通过如下方法训练第一编码器和第一解码器:获取训练语言学数据和具有风格特征(如张曼玉或开发者本人说话的风格特征)的训练语音数据,通过第一编码器对训练语言学数据编码得到第一训练语言学编码数据;获取用于表征风格特征的训练嵌入向量,通过第一解码器,根据训练嵌入向量对第一训练语言学编码数据进行解码,获得经过语音特征转换的预测目标合成语音数据;根据预测目标合成语音数据和训练语音数据间的差异,调整第一编码器和第一解码器,并继续训练,直至满足训练停止条件。
上述实施例中,通过第一编码器和第一解码器处理训练语言学数据、训练语音数据和训练嵌入向量,得到预测目标合成语音数据。根据预测目标合成语音数据和训练语音数据间的差异调整第一编码器和第一解码器,使预测目标合成语音数据不断逼近训练语音数据,从而得到训练好的第一编码器和第一解码器。由于训练过程中采用了由训练合成语音数据和训练语音数据之间的残差生成的训练嵌入向量,该训练嵌入向量只包含语音特征,无需考虑语义特征对对训练模型的影响,从而降低了第一编码器和第一解码器的复杂度,提高了训练结果的准确性。
在一个实施例中,如图14所示,该方法还可以包括:
S1402,获取训练语言学数据和相应的训练语音数据。
其中,语言学数据可以是文本或文本的特征或特征项。训练语言学数据指的是在训练阶段所采用的语言学数据,用于对第一编码器和第一解码器进行训练。
在一个实施例中,在训练过程中,终端获取训练语言学数据和具有风格特征的训练语音数据。例如,在训练过程中,开发人员输入用于训练的训练语言学数据和具有风格特征的训练语音数据。其中,训练语言学数据可以是“我喜欢吃饭睡觉打豆豆”。
S1404,通过第二编码器将训练语言学数据编码,得到第二训练语言学编码数据。
在一个实施例中,终端通过第二编码器对训练语言学数据编码,得到第二训练语言学编码数据。例如,终端获取一段文本,通过第一编码器对该段文本进行编码,获得分布式表示,该分布式表示即为第二训练语言学编码数据。其中,该分布式表示可以是特征向量。一个特 征向量与文本中的一个字或词相对应。
S1406,通过第二解码器对第二训练语言学编码数据解码,得到训练合成语音数据。
S1408,通过残差模型,并根据训练合成语音数据和训练语音数据之间的残差生成训练嵌入向量。
在一个实施例中,终端通过残差模型,对训练合成语音数据和训练语音数据进行作差,得到表征风格特征的残差。终端对所得的具有风格特征的残差进行处理,得到用于语音特征转换的、且用于表征风格特征的训练嵌入向量。
对于获得训练嵌入向量的详细过程,可参考S402-S408和S602-S606,这里不再进行赘述。
S1410,通过第一解码器,根据训练嵌入向量对第一训练语言学编码数据进行解码,获得经过语音特征转换的预测目标合成语音数据。
在一个实施例中,终端通过第二解码器,按照训练嵌入向量对第二训练语言学编码数据进行解码,获得经过语音特征转换的、具有参考对象说话风格的预测目标合成语音数据。或者,终端将训练嵌入向量与第二训练语言学编码数据进行组合,对组合后的结果进行解码,获得经过语音特征转换的、具有参考对象说话风格的预测目标合成语音数据。
S1412,根据预测目标合成语音数据和训练语音数据间的差异,调整第二编码器、第二解码器、残差模型、第一编码器和第一解码器,并继续训练,直至满足训练停止条件。
在一个实施例中,终端根据预测目标合成语音数据和训练语音数据间的差异,调整第二编码器、第二解码器、残差模型、第一编码器和第一解码器中的参数,并继续训练,直至预测目标合成语音数据对应的语音风格与训练语音数据对应的语音风格一致,则停止训练。
S1402-S1412为训练第二编码器、第二解码器、残差模型、第一编码器和第一解码器的步骤,作为一个示例,如图12所示,可以通过如下方法训练第二编码器、第二解码器、残差模型、第一编码器和第一解码器:获取训练语言学数据和具有风格特征(如张曼玉或开发者本人说话的风格特征)的训练语音数据,通过第二编码器将训练语言学数据编码得到第二训练语言学编码数据,通过第二解码器对第二训练语言学编码数据进行解码得到训练合成语音数据。终端通过残差模型对训练合成语音数据与训练语音数据之间的残差进行处理,获得用于表征风格特征的训练嵌入向量。通过第一编码器对训练语言学数据编码得到第一训练语言学编码数据后,通过第一解码器,根据训练嵌入向量对第一训练语言学编码数据进行解码,获得经过语音特征转换的预测目标合成语音数据。根据预测目标合成语音数据和训练语音数据间的差异,调整第二编码器、第二解码器、残差模型、第一编码器和第一解码器,并继续训练,直至满足训练停止条件。
上述实施例中,通过训练语言学数据和相应的训练语音数据,对第二编码器、第二解码器、残差模型、第一编码器和第一解码器进行训练。根据预测目标合成语音数据和训练语音数据间的差异调整第二编码器、第二解码器、残差模型、第一编码器和第一解码器,使预测目标合成语音数据不断逼近训练语音数据,从而得到训练好的第二编码器、第二解码器、残差模型、第一编码器和第一解码器。
此外,由于训练过程中采用了由训练合成语音数据和训练语音数据之间的残差生成的训练嵌入向量,该训练嵌入向量只包含语音特征,无需考虑语义特征对对训练模型的影响,从而降低了第二编码器、第二解码器、残差模型、第一编码器和第一解码器的复杂度,提高了 训练结果的准确性。
最后,将用于获取用于表征风格特征的嵌入向量的第二编码器、第二解码器、残差模型,与用于合成语音的第一编码器和第一解码器结合在一起,降低了语音合成系统对数据的需求,提高建立语音合成系统的准确性。
对于传统的语音合成方案中,其整体的思路是:在训练阶段,训练编码器从参考音频的对数梅尔频谱中得到风格的嵌入向量,再利用这个嵌入向量指导Tacotron对风格数据进行建模。在语音合成阶段,给定一个参考音频的对数梅尔频谱,首先通过训练好的编码器获得表征风格的嵌入向量,然后利用该嵌入向量指导Tacotron生成对应风格的语音。
上述方案中,存在以下问题:1)依赖人工标注的风格特征,耗时耗力,同时不便于拓展到不同的风格特征;2)在语音合成阶段,需要有额外的风格向量模块预测风格特征,以将预测所得的风格特征输入语音合成模型合成具有风格的语音,增加了训练耗时;3)在获取风格特征时输入是对数梅尔频谱,而对数梅尔频谱包含风格特征和语义特征,因此语音合成模型建模复杂度较高;4)对数梅尔频谱中不仅包含了风格特征,还包含了语义特征,这些语义特征对风格特征的提取会产生一定的影响,从而影响了提取风格特征的准确率。
本申请实施例提供了一种解决方案,可以解决上述问题。其中,如图1所示,语音合成系统包括:平均语音模型,残差模型,投影层与目标语音模型。其中,目标语音模型包括第一编码器和第一解码器。第一编码器和第一解码器分别可以是语言学数据编码器和语音数据解码器。此外,第一编码器和第一解码器还可以是基于注意力的递归生成器。平均语音模型包括第二编码器和第二解码器。第二编码器和第二解码器分别可以是语言学数据编码器和语音数据解码器。此外,第二编码器和第二解码器还可以是基于注意力的递归生成器。
平均语音模型和目标语音模型都可以是基于Tacotron模型,包括解码器与编码器。平均语音模型对训练语言学数据进行训练,得到平均风格的语音数据。残差模型对预测的平均合成语音数据与目标语音数据之间的差进行编码得到风格特征的嵌入向量。投影层将嵌入向量投影到目标语音模型的第一解码器空间中。
获得合成的语音之前,需通过以下三个阶段:训练阶段,自适应阶段与测试阶段;其中:
1)在训练阶段。
如图12所示,输入的训练语言学数据先通过平均语音模型预测出平均的训练合成语音数据。平均语音模型包括:第二编码器(如语言学数据编码器)与第二解码器(如语音数据解码器)。第二编码器用于对训练语言学数据进行编码,获得隐层表示。第二解码器用于对隐层表示进行解码,获得训练合成语音数据。其中,隐层表示指的是本申请实施例所述的语言学编码数据。
所获得的训练合成语音数据与目标带风格特征的训练语音数据进行作差处理,获得两者之间的残差。将残差输入残差模型,得到用于表征风格特征的训练嵌入向量,该训练嵌入向量通过投影层映射到目标语音模型的第一解码器中。
在目标语音模型中,类似于平均语音模型,输入的是训练语言学数据,经过第一编码器编码得到隐层表示。第一解码器跟据隐层表示与投影层映射过来的训练嵌入向量,解码出具有风格的预测目标合成语音数据。
整个训练过程中,训练嵌入向量是由数据驱动,自动学习得到的。
根据预测目标合成语音数据和训练语音数据间的差异,调整平均语音模型、残差模型和目标语音模型,并继续训练,直至预测目标合成语音数据尽可能逼近训练语音数据,使最终输出的合成语音的风格与训练所采用的语音数据的风格一致,从而得到训练好的平均语音模型、残差模型和目标语音模型。
2)自适应阶段。
自适应阶段主要是通过训练好的平均语音模型、残差模型和目标语音模型,获得目标风格的嵌入向量。例如,如图8所示,用户在与终端进行语音交互时,若想要听到张曼玉的说话风格,那么,用户可以使用张曼玉的语音数据作为参考语音数据,并获取对应的参考语言学数据。将获得的参考语言学数据输入训练好的平均语音模型,从而得到参考合成语音数据。将参考合成语音数据与参考语音数据进行作差处理,得到表示风格特征的残差。将残差输入残差模型,便可得到用于表征风格特征的嵌入向量。
利用训练阶段训练得到的平均语音模型和残差模型,可以快速得到自适应的风格嵌入向量。这个过程由于不需要训练,因而极大提高自适应的速度,减少自适应的时间。
3)测试阶段。
在测试阶段,如图3所示,用户与终端进行语音交互时,首先将给定的语言学数据输入到目标语音模型的第一编码器中进行编码,得到隐层表示。利用自适应阶段得到的嵌入向量对第一解码器进行控制,得到与自适应参考样本相似的风格的目标合成语音数据。例如,自适应阶段所采用的参考语音数据的来源为张曼玉时,所得到的目标合成语音数据的风格即为张曼玉的说话风格。
输出的目标合成语音数据再经过Griffin-Lim算法恢复为语音波形信号。
通过实时本申请实施例,可以具有以下有益效果:不需要人工标注的风格特征,降低了构建语音合成系统的成本;以残差为控制条件,避免了使用对数梅尔频谱,降低模型建模复杂度,提高了风格特征提取的准确性;风格向量模块(即残差模型)和语音合成模型可以同时建模同时训练,避免了额外的风格向量模块,降低了训练耗时,而且还可以实现快速自适应得到合成语音所需的嵌入向量。
图2为一个实施例中语音合成方法的流程示意图,图13为一个实施例中模型训练方法的流程示意图。应该理解的是,虽然图2和图13的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2和图13中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
如图15所示,在一个实施例中,提供了一种语音合成装置,该语音合成装置可以包括:语言学数据获取模块1502、语言学数据编码模块1504、嵌入向量获取模块1506和语言学编 码数据解码模块1508;其中:
语言学数据获取模块1502,用于获取待处理的语言学数据。
语言学数据编码模块1504,用于对语言学数据编码,得到语言学编码数据。
嵌入向量获取模块1506,用于获取用于语音特征转换的嵌入向量;嵌入向量,根据对应相同参考语言学数据的参考合成语音数据和参考语音数据之间的残差生成。
语言学编码数据解码模块1508,用于根据嵌入向量对语言学编码数据进行解码,获得经过语音特征转换的目标合成语音数据。
上述实施例中,获取待处理的语言学数据,对语言学数据进行编码,便可得到表征发音的语言学编码数据。获取用于语音特征转换的嵌入向量,由于嵌入向量是对应相同参考语言学数据的参考合成语音数据和参考语音数据之间的残差生成,因而所得到的嵌入向量为不包含语义特征的风格特征向量。根据嵌入向量对语言学编码数据进行解码,避免了语义特征对语言学编码数据处理的影响,因此所获得的目标合成语音数据的质量高,从而提高了合成语音的质量。
在一个实施例中,如图16所示,该装置还可以包括:嵌入向量确定模块1510。其中:
语言学数据获取模块1502还用于获取参考语言学数据和相应的参考语音数据。
语言学数据编码模块1504还用于对参考语言学数据编码,得到参考语言学编码数据。
语言学编码数据解码模块1508还用于解码参考语言学编码数据,得到参考合成语音数据。
嵌入向量确定模块1510,用于根据参考语音数据和参考合成语音数据间的残差,确定用于语音特征转换的嵌入向量。
上述实施例中,根据参考语音数据和参考合成语音数据间的残差,确定用于语音特征转换的嵌入向量,从而得到用于对语言学数据进行语音合成时进行风格控制的嵌入向量,以使合成的目标合成语音数据具有特定的风格特征,提高合成语音的质量。
在一个实施例中,嵌入向量确定模块1510还用于确定参考语音数据和参考合成语音数据间的残差;通过残差模型处理残差;根据残差模型中前向运算的结果和后向运算的结果,生成用于语音特征转换的嵌入向量。
在一个实施例中,嵌入向量确定模块1510还用于通过残差模型中的全连接层、前向门循环单元层和后向门循环单元层处理残差。
在一个实施例中,嵌入向量确定模块1510还用于获取残差模型中前向门循环单元层进行前向运算时在最后一个时间步输出的第一向量;获取残差模型中后向门循环单元层进行后向运算时在第一个时间步输出的第二向量;将第一向量和第二向量叠加,获得用于语音特征转换的嵌入向量。
上述实施例中,通过残差模型处理参考语音数据和参考合成语音数据之间的残差,获得用于语音特征转换的嵌入向量,使得嵌入向量具有与参考语音数据相同的风格特征,具有自适应的效果。此外,得到用于对语言学数据进行语音合成时进行风格控制的嵌入向量,以使合成的目标合成语音数据具有特定的风格特征,提高合成语音的质量。
在一个实施例中,语言学编码数据通过第一编码器进行编码得到;目标合成语音数据通过第一解码器进行解码得到;如图16所示,该装置还包括:第一调整模块1512。其中:
语言学数据获取模块1502还用于获取训练语言学数据和相应的训练语音数据。
语言学数据编码模块1504还用于通过第一编码器对训练语言学数据编码,得到第一训练语言学编码数据。
嵌入向量获取模块1506还用于获取用于语音特征转换的训练嵌入向量;训练嵌入向量,根据对应相同训练语言学数据的训练合成语音数据和训练语音数据之间的残差生成。
语言学编码数据解码模块1508还用于通过第一解码器,根据训练嵌入向量对第一训练语言学编码数据进行解码,获得经过语音特征转换的预测目标合成语音数据。
第一调整模块1512,用于根据预测目标合成语音数据和训练语音数据间的差异,调整第一编码器和第一解码器,并继续训练,直至满足训练停止条件。
上述实施例中,通过第一编码器和第一解码器处理训练语言学数据、训练语音数据和训练嵌入向量,得到预测目标合成语音数据,根据预测目标合成语音数据和训练语音数据间的差异调整第一编码器和第一解码器,使预测目标合成语音数据不断逼近训练语音数据,从而得到训练好的第一编码器和第一解码器。由于训练过程中采用了由训练合成语音数据和训练语音数据之间的残差生成的训练嵌入向量,该训练嵌入向量只包含语音特征,无需考虑语义特征对对训练模型的影响,从而降低了第一编码器和第一解码器的复杂度,提高了训练结果的准确性。
在一个实施例中,语言学编码数据通过第一编码器进行编码得到;目标合成语音数据通过第一解码器进行解码得到;参考语言学编码数据通过第二编码器进行编码得到;参考合成语音数据通过第二解码器进行解码得到;嵌入向量通过残差模型得到。
在一个实施例中,如图16所示,该装置还包括:嵌入向量生成模块1514和第二调整模块1516;其中:
语言学数据获取模块1502还用于获取训练语言学数据和相应的训练语音数据。
语言学数据编码模块1504还用于通过第二编码器将训练语言学数据编码,得到第二训练语言学编码数据。
语言学编码数据解码模块1508还用于通过第二解码器对第二训练语言学编码数据解码,得到训练合成语音数据。
嵌入向量生成模块1514,用于通过残差模型,并根据训练合成语音数据和训练语音数据之间的残差生成训练嵌入向量。
语言学编码数据解码模块1508还用于通过第一解码器,根据训练嵌入向量对第一训练语言学编码数据进行解码,获得经过语音特征转换的预测目标合成语音数据;其中,第一训练语言学编码数据由第一编码器编码训练语言学数据所得。
第二调整模块1516,用于根据预测目标合成语音数据和训练语音数据间的差异,调整第二编码器、第二解码器、残差模型、第一编码器和第一解码器,并继续训练,直至满足训练停止条件。
上述实施例中,通过训练语言学数据和相应的训练语音数据,对第二编码器、第二解码器、残差模型、第一编码器和第一解码器进行训练,根据预测目标合成语音数据和训练语音数据间的差异调整第二编码器、第二解码器、残差模型、第一编码器和第一解码器,使预测目标合成语音数据不断逼近训练语音数据,从而得到训练好的第二编码器、第二解码器、残 差模型、第一编码器和第一解码器。
此外,由于训练过程中采用了由训练合成语音数据和训练语音数据之间的残差生成的训练嵌入向量,该训练嵌入向量只包含语音特征,无需考虑语义特征对对训练模型的影响,从而降低了第二编码器、第二解码器、残差模型、第一编码器和第一解码器的复杂度,提高了训练结果的准确性。
最后,将用于获取用于表征风格特征的嵌入向量的第二编码器、第二解码器、残差模型,与用于合成语音的第一编码器和第一解码器结合在一起,降低了语音合成系统对数据的需求,提高建立语音合成系统的准确性。
在一个实施例中,语言学编码数据解码模块1508还用于将语言学编码数据和嵌入向量拼接,得到拼接向量;对拼接向量进行解码,得到经过语音特征转换的目标合成语音数据。
上述实施例中,将语言学编码数据和嵌入向量拼接,对拼接后所得的向量进行解码,得到经过语音特征转换的目标合成语音数据。由于拼接后的向量没语义特征,避免了语义特征对语言学编码数据的处理,从而提高了合成语音的质量。
在一个实施例中,如图16所示,该装置还包括:合成模块1518、转换模块1520和语音生成模块1522。其中:
合成模块1518,用于确定与目标合成语音数据对应的语音幅度谱。
转换模块1520,用于将语音幅度谱转换为时域的语音波形信号。
语音生成模块1522,用于根据语音波形生成语音。
上述实施例中,将具有语音特征的目标合成语音数据转换为语音信号,从而获得具有风格的语音,从而可以提高合成语音的质量。
如图17所示,在一个实施例中,提供了一种模型训练装置,该模型训练装置可以包括:语音数据获取模块1702、语言学数据编码模块1704、嵌入向量获取模块1706、语言学编码数据解码模块1708和调整模块1710。其中:
语音数据获取模块1702,用于获取训练语言学数据和相应的训练语音数据。
语言学数据编码模块1704,用于通过第一编码器对训练语言学数据编码,得到第一训练语言学编码数据。
嵌入向量获取模块1706,用于获取用于语音特征转换的训练嵌入向量;训练嵌入向量,根据对应相同训练语言学数据的训练合成语音数据和训练语音数据之间的残差生成。
语言学编码数据解码模块1708,用于通过第一解码器,根据训练嵌入向量对第一训练语言学编码数据进行解码,获得经过语音特征转换的预测目标合成语音数据。
调整模块1710,用于根据预测目标合成语音数据和训练语音数据间的差异,调整第一编码器和第一解码器,并继续训练,直至满足训练停止条件。
上述实施例中,通过第一编码器和第一解码器处理训练语言学数据、训练语音数据和训练嵌入向量,得到预测目标合成语音数据,根据预测目标合成语音数据和训练语音数据间的差异调整第一编码器和第一解码器,使预测目标合成语音数据不断逼近训练语音数据,从而得到训练好的第一编码器和第一解码器。由于训练过程中采用了由训练合成语音数据和训练语音数据之间的残差生成的训练嵌入向量,该训练嵌入向量只包含语音特征,无需考虑语义 特征对对训练模型的影响,从而降低了第一编码器和第一解码器的复杂度,提高了训练结果的准确性。
在一个实施例中,如图18所示,该装置还包括:嵌入向量生成模块1712。其中:
语言学数据编码模块1704还用于通过第二编码器将训练语言学数据编码,得到第二训练语言学编码数据。
语言学编码数据解码模块1708还用于通过第二解码器对第二训练语言学编码数据解码,得到训练合成语音数据。
嵌入向量生成模块1712,用于通过残差模型,并根据训练合成语音数据和训练语音数据之间的残差生成训练嵌入向量。
调整模块1710还用于根据预测目标合成语音数据和训练语音数据间的差异,调整第二编码器、第二解码器、残差模型、第一编码器和第一解码器,并继续训练,直至满足训练停止条件。
上述实施例中,通过训练语言学数据和相应的训练语音数据,对第二编码器、第二解码器、残差模型、第一编码器和第一解码器进行训练,根据预测目标合成语音数据和训练语音数据间的差异调整第二编码器、第二解码器、残差模型、第一编码器和第一解码器,使预测目标合成语音数据不断逼近训练语音数据,从而得到训练好的第二编码器、第二解码器、残差模型、第一编码器和第一解码器。
此外,由于训练过程中采用了由训练合成语音数据和训练语音数据之间的残差生成的训练嵌入向量,该训练嵌入向量只包含语音特征,无需考虑语义特征对对训练模型的影响,从而降低了第二编码器、第二解码器、残差模型、第一编码器和第一解码器的复杂度,提高了训练结果的准确性。
最后,将用于获取用于表征风格特征的嵌入向量的第二编码器、第二解码器、残差模型,与用于合成语音的第一编码器和第一解码器结合在一起,降低了语音合成系统对数据的需求,提高建立语音合成系统的准确性。
图19示出了一个实施例中计算机设备的内部结构图。该计算机设备可以是图1中运行语音合成系统的终端。如图19所示,该计算机设备包括通过系统总线连接的处理器1901、存储器1902、网络接口1903、输入装置1904和显示屏1905。其中,存储器1902包括非易失性存储介质和内存储器。该计算机设备的非易失性存储介质存储有操作系统,还可存储有计算机程序,该计算机程序被处理器1901执行时,可使得处理器1901实现语音合成方法。该内存储器中也可储存有计算机程序,该计算机程序被处理器1901执行时,可使得处理器1901执行语音合成方法。计算机设备的显示屏1905可以是液晶显示屏或者电子墨水显示屏,计算机设备的输入装置1904可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。
本领域技术人员可以理解,图19中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,本申请提供的15装置可以实现为一种计算机程序的形式,计算机程序 可在如图19所示的计算机设备上运行。计算机设备的存储器1902中可存储组成该语音合成装置的各个程序模块,比如,图15所示的语言学数据获取模块1502、语言学数据编码模块1504、嵌入向量获取模块1506和语言学编码数据解码模块1508。各个程序模块构成的计算机程序使得处理器1901执行本说明书中描述的本申请各个实施例的语音合成方法中的步骤。
例如,图19所示的计算机设备可以通过如图15所示的语音合成装置中的语言学数据获取模块1502执行S202。计算机设备可通过语言学数据编码模块1504执行S204。计算机设备可通过嵌入向量获取模块1506执行S206。计算机设备可通过语言学编码数据解码模块1508执行S208。
在一个实施例中,提供了一种计算机设备,包括存储器和处理器,存储器存储有计算机程序,计算机程序被处理器执行时,使得处理器执行以下步骤:获取待处理的语言学数据;对语言学数据编码,得到语言学编码数据;获取用于语音特征转换的嵌入向量;嵌入向量,根据对应相同参考语言学数据的参考合成语音数据和参考语音数据之间的残差生成;根据嵌入向量对语言学编码数据进行解码,获得经过语音特征转换的目标合成语音数据。
在一个实施例中,计算机程序被处理器执行时,使得处理器还执行以下步骤:获取参考语言学数据和相应的参考语音数据;对参考语言学数据编码,得到参考语言学编码数据;解码参考语言学编码数据,得到参考合成语音数据;根据参考语音数据和参考合成语音数据间的残差,确定用于语音特征转换的嵌入向量。
在一个实施例中,计算机程序被处理器执行根据参考语音数据和参考合成语音数据间的残差,确定用于语音特征转换的嵌入向量的步骤时,使得处理器可以执行以下步骤:确定参考语音数据和参考合成语音数据间的残差;通过残差模型处理残差;根据残差模型中前向运算的结果和后向运算的结果,生成用于语音特征转换的嵌入向量。
在一个实施例中,计算机程序被处理器执行根据残差模型中前向运算的结果和后向运算的结果,生成用于语音特征转换的嵌入向量的步骤时,使得处理器可以执行以下步骤:获取残差模型中前向门循环单元层进行前向运算时在最后一个时间步输出的第一向量;获取残差模型中后向门循环单元层进行后向运算时在第一个时间步输出的第二向量;将第一向量和第二向量叠加,获得用于语音特征转换的嵌入向量。
在一个实施例中,计算机程序被处理器执行通过残差模型处理残差的步骤时,使得处理器可以执行以下步骤:通过残差模型中的全连接层、前向门循环单元层和后向门循环单元层处理残差。
在一个实施例中,语言学编码数据通过第一编码器进行编码得到;目标合成语音数据通过第一解码器进行解码得到;计算机程序被处理器执行时,使得处理器还执行以下步骤:获取训练语言学数据和相应的训练语音数据;通过第一编码器对训练语言学数据编码,得到第一训练语言学编码数据;获取用于语音特征转换的训练嵌入向量;训练嵌入向量,根据对应相同训练语言学数据的训练合成语音数据和训练语音数据之间的残差生成;通过第一解码器,根据训练嵌入向量对第一训练语言学编码数据进行解码,获得经过语音特征转换的预测目标合成语音数据;根据预测目标合成语音数据和训练语音数据间的差异,调整第一编码器和第一解码器,并继续训练,直至满足训练停止条件。
在一个实施例中,语言学编码数据通过第一编码器进行编码得到;目标合成语音数据通 过第一解码器进行解码得到;参考语言学编码数据通过第二编码器进行编码得到;参考合成语音数据通过第二解码器进行解码得到;嵌入向量通过残差模型得到。
在一个实施例中,计算机程序被处理器执行时,使得处理器还执行以下步骤:获取训练语言学数据和相应的训练语音数据;通过第二编码器将训练语言学数据编码,得到第二训练语言学编码数据;通过第二解码器对第二训练语言学编码数据解码,得到训练合成语音数据;通过残差模型,并根据训练合成语音数据和训练语音数据之间的残差生成训练嵌入向量;通过第一解码器,根据训练嵌入向量对第一训练语言学编码数据进行解码,获得经过语音特征转换的预测目标合成语音数据;根据预测目标合成语音数据和训练语音数据间的差异,调整第二编码器、第二解码器、残差模型、第一编码器和第一解码器,并继续训练,直至满足训练停止条件。
在一个实施例中,计算机程序被处理器执行根据嵌入向量对语言学编码数据进行解码,获得经过语音特征转换的目标合成语音数据的步骤时,使得处理器可以执行以下步骤:将语言学编码数据和嵌入向量拼接,得到拼接向量;对拼接向量进行解码,得到经过语音特征转换的目标合成语音数据。
在一个实施例中,计算机程序被处理器执行时,使得处理器还执行以下步骤:确定与目标合成语音数据对应的语音幅度谱;将语音幅度谱转换为时域的语音波形信号;根据语音波形生成语音。
在一个实施例中,提供了一种计算机可读存储介质,存储有计算机程序,计算机程序被处理器执行时,使得处理器执行以下步骤:获取待处理的语言学数据;对语言学数据编码,得到语言学编码数据;获取用于语音特征转换的嵌入向量;嵌入向量,根据对应相同参考语言学数据的参考合成语音数据和参考语音数据之间的残差生成;根据嵌入向量对语言学编码数据进行解码,获得经过语音特征转换的目标合成语音数据。
在一个实施例中,计算机程序被处理器执行时,使得处理器还执行以下步骤:获取参考语言学数据和相应的参考语音数据;对参考语言学数据编码,得到参考语言学编码数据;解码参考语言学编码数据,得到参考合成语音数据;根据参考语音数据和参考合成语音数据间的残差,确定用于语音特征转换的嵌入向量。
在一个实施例中,计算机程序被处理器执行根据参考语音数据和参考合成语音数据间的残差,确定用于语音特征转换的嵌入向量的步骤时,使得处理器可以执行以下步骤:确定参考语音数据和参考合成语音数据间的残差;通过残差模型处理残差;根据残差模型中前向运算的结果和后向运算的结果,生成用于语音特征转换的嵌入向量。
在一个实施例中,计算机程序被处理器执行根据残差模型中前向运算的结果和后向运算的结果,生成用于语音特征转换的嵌入向量的步骤时,使得处理器可以执行以下步骤:获取残差模型中前向门循环单元层进行前向运算时在最后一个时间步输出的第一向量;获取残差模型中后向门循环单元层进行后向运算时在第一个时间步输出的第二向量;将第一向量和第二向量叠加,获得用于语音特征转换的嵌入向量。
在一个实施例中,计算机程序被处理器执行通过残差模型处理残差的步骤时,使得处理器可以执行以下步骤:通过残差模型中的全连接层、前向门循环单元层和后向门循环单元层处理残差。
在一个实施例中,语言学编码数据通过第一编码器进行编码得到;目标合成语音数据通过第一解码器进行解码得到;计算机程序被处理器执行时,使得处理器还执行以下步骤:获取训练语言学数据和相应的训练语音数据;通过第一编码器对训练语言学数据编码,得到第一训练语言学编码数据;获取用于语音特征转换的训练嵌入向量;训练嵌入向量,根据对应相同训练语言学数据的训练合成语音数据和训练语音数据之间的残差生成;通过第一解码器,根据训练嵌入向量对第一训练语言学编码数据进行解码,获得经过语音特征转换的预测目标合成语音数据;根据预测目标合成语音数据和训练语音数据间的差异,调整第一编码器和第一解码器,并继续训练,直至满足训练停止条件。
在一个实施例中,语言学编码数据通过第一编码器进行编码得到;目标合成语音数据通过第一解码器进行解码得到;参考语言学编码数据通过第二编码器进行编码得到;参考合成语音数据通过第二解码器进行解码得到;嵌入向量通过残差模型得到。
在一个实施例中,计算机程序被处理器执行时,使得处理器还执行以下步骤:获取训练语言学数据和相应的训练语音数据;通过第二编码器将训练语言学数据编码,得到第二训练语言学编码数据;通过第二解码器对第二训练语言学编码数据解码,得到训练合成语音数据;通过残差模型,并根据训练合成语音数据和训练语音数据之间的残差生成训练嵌入向量;通过第一解码器,根据训练嵌入向量对第一训练语言学编码数据进行解码,获得经过语音特征转换的预测目标合成语音数据;根据预测目标合成语音数据和训练语音数据间的差异,调整第二编码器、第二解码器、残差模型、第一编码器和第一解码器,并继续训练,直至满足训练停止条件。
在一个实施例中,计算机程序被处理器执行根据嵌入向量对语言学编码数据进行解码,获得经过语音特征转换的目标合成语音数据的步骤时,使得处理器可以执行以下步骤:将语言学编码数据和嵌入向量拼接,得到拼接向量;对拼接向量进行解码,得到经过语音特征转换的目标合成语音数据。
在一个实施例中,计算机程序被处理器执行时,使得处理器还执行以下步骤:确定与目标合成语音数据对应的语音幅度谱;将语音幅度谱转换为时域的语音波形信号;根据语音波形生成语音。
图20示出了一个实施例中计算机设备的内部结构图。该计算机设备可以是图1中运行模型训练系统的终端。如图20所示,该计算机设备包括该计算机设备包括通过系统总线连接的处理器2001、存储器2002、网络接口2003、输入装置2004和显示屏2005。其中,存储器2002包括非易失性存储介质和内存储器。该计算机设备的非易失性存储介质存储有操作系统,还可存储有计算机程序,该计算机程序被处理器2001执行时,可使得处理器2001实现模型训练方法。该内存储器中也可储存有计算机程序,该计算机程序被处理器2001执行时,可使得处理器2001执行模型训练方法。计算机设备的显示屏2005可以是液晶显示屏或者电子墨水显示屏,计算机设备的输入装置2004可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。
本领域技术人员可以理解,图20中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,计算机设备可以包括比图 中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,本申请提供的17装置可以实现为一种计算机程序的形式,计算机程序可在如图20所示的计算机设备上运行。计算机设备的存储器2002中可存储组成该模型训练装置的各个程序模块,比如,图17所示的语音数据获取模块1702、语言学数据编码模块1704、嵌入向量获取模块1706、语言学编码数据解码模块1708和调整模块1710。各个程序模块构成的计算机程序使得处理器2001执行本说明书中描述的本申请各个实施例的模型训练方法中的步骤。
例如,图20所示的计算机设备可以通过如图17所示的模型训练装置中的语音数据获取模块1702执行S1302。计算机设备可通过语言学数据编码模块1704执行S1304。计算机设备可通过嵌入向量获取模块1706执行S1306。计算机设备可通过语言学编码数据解码模块1708执行S1308。计算机设备可通过调整模块1710执行S1310。
在一个实施例中,提供了一种计算机设备,包括存储器和处理器,存储器存储有计算机程序,计算机程序被处理器执行时,使得处理器执行以下步骤:获取训练语言学数据和相应的训练语音数据;通过第一编码器对训练语言学数据编码,得到第一训练语言学编码数据;获取用于语音特征转换的训练嵌入向量;所述训练嵌入向量,根据对应相同训练语言学数据的训练合成语音数据和训练语音数据之间的残差生成;通过第一解码器,根据所述训练嵌入向量对所述第一训练语言学编码数据进行解码,获得经过语音特征转换的预测目标合成语音数据;根据所述预测目标合成语音数据和训练语音数据间的差异,调整所述第一编码器和所述第一解码器,并继续训练,直至满足训练停止条件。
在一个实施例中,计算机程序被处理器执行时,使得处理器还执行以下步骤:通过第二编码器将训练语言学数据编码,得到第二训练语言学编码数据;通过第二解码器对第二训练语言学编码数据解码,得到训练合成语音数据;通过残差模型,并根据训练合成语音数据和所述训练语音数据之间的残差生成训练嵌入向量;计算机程序被处理器执行根据所述预测目标合成语音数据和训练语音数据间的差异,调整所述第一编码器和所述第一解码器,并继续训练,直至满足训练停止条件的步骤时,使得处理器可以执行以下步骤:根据所述预测目标合成语音数据和训练语音数据间的差异,调整所述第二编码器、所述第二解码器、所述残差模型、所述第一编码器和所述第一解码器,并继续训练,直至满足训练停止条件。
在一个实施例中,提供了一种计算机可读存储介质,存储有计算机程序,计算机程序被处理器执行时,使得处理器执行以下步骤:获取训练语言学数据和相应的训练语音数据;通过第一编码器对训练语言学数据编码,得到第一训练语言学编码数据;获取用于语音特征转换的训练嵌入向量;所述训练嵌入向量,根据对应相同训练语言学数据的训练合成语音数据和训练语音数据之间的残差生成;通过第一解码器,根据所述训练嵌入向量对所述第一训练语言学编码数据进行解码,获得经过语音特征转换的预测目标合成语音数据;根据所述预测目标合成语音数据和训练语音数据间的差异,调整所述第一编码器和所述第一解码器,并继续训练,直至满足训练停止条件。
在一个实施例中,计算机程序被处理器执行时,使得处理器还执行以下步骤:通过第二编码器将训练语言学数据编码,得到第二训练语言学编码数据;通过第二解码器对第二训练语言学编码数据解码,得到训练合成语音数据;通过残差模型,并根据训练合成语音数据和 所述训练语音数据之间的残差生成训练嵌入向量;计算机程序被处理器执行根据所述预测目标合成语音数据和训练语音数据间的差异,调整所述第一编码器和所述第一解码器,并继续训练,直至满足训练停止条件的步骤时,使得处理器可以执行以下步骤:根据所述预测目标合成语音数据和训练语音数据间的差异,调整所述第二编码器、所述第二解码器、所述残差模型、所述第一编码器和所述第一解码器,并继续训练,直至满足训练停止条件。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (16)

  1. 一种语音合成方法,包括:
    获取待处理的语言学数据;
    对所述语言学数据编码,得到语言学编码数据;
    获取用于语音特征转换的嵌入向量;所述嵌入向量,根据对应相同参考语言学数据的参考合成语音数据和参考语音数据之间的残差生成;
    根据所述嵌入向量对所述语言学编码数据进行解码,获得经过语音特征转换的目标合成语音数据。
  2. 根据权利要求1所述的方法,还包括:
    获取参考语言学数据和相应的参考语音数据;
    对所述参考语言学数据编码,得到参考语言学编码数据;
    解码所述参考语言学编码数据,得到参考合成语音数据;
    根据所述参考语音数据和所述参考合成语音数据间的残差,确定用于语音特征转换的嵌入向量。
  3. 根据权利要求2所述的方法,所述根据所述参考语音数据和所述参考合成语音数据间的残差,确定用于语音特征转换的嵌入向量包括:
    确定所述参考语音数据和所述参考合成语音数据间的残差;
    通过残差模型处理所述残差;
    根据所述残差模型中前向运算的结果和后向运算的结果,生成用于语音特征转换的嵌入向量。
  4. 根据权利要求3所述的方法,所述根据所述残差模型中前向运算的结果和后向运算的结果,生成用于语音特征转换的嵌入向量包括:
    获取所述残差模型中前向门循环单元层进行前向运算时在最后一个时间步输出的第一向量;
    获取所述残差模型中后向门循环单元层进行后向运算时在第一个时间步输出的第二向量;
    将所述第一向量和所述第二向量叠加,获得用于语音特征转换的嵌入向量。
  5. 根据权利要求3所述的方法,所述通过残差模型处理所述残差包括:
    通过所述残差模型中的全连接层、前向门循环单元层和后向门循环单元层处理所述残差。
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,所述语言学编码数据通过第一编码器进行编码得到;所述目标合成语音数据通过第一解码器进行解码得到;还包括:
    获取训练语言学数据和相应的训练语音数据;
    通过第一编码器对训练语言学数据编码,得到第一训练语言学编码数据;
    获取用于语音特征转换的训练嵌入向量;所述训练嵌入向量,根据对应相同训练语言学 数据的训练合成语音数据和训练语音数据之间的残差生成;
    通过第一解码器,根据所述训练嵌入向量对所述第一训练语言学编码数据进行解码,获得经过语音特征转换的预测目标合成语音数据;
    根据所述预测目标合成语音数据和训练语音数据间的差异,调整所述第一编码器和所述第一解码器,并继续训练,直至满足训练停止条件。
  7. 根据权利要求2至5任一项所述的方法,所述语言学编码数据通过第一编码器进行编码得到;所述目标合成语音数据通过第一解码器进行解码得到;所述参考语言学编码数据通过第二编码器进行编码得到;所述参考合成语音数据通过第二解码器进行解码得到;所述嵌入向量通过残差模型得到。
  8. 根据权利要求7所述的方法,还包括:
    获取训练语言学数据和相应的训练语音数据;
    通过第二编码器将训练语言学数据编码,得到第二训练语言学编码数据;
    通过第二解码器对第二训练语言学编码数据解码,得到训练合成语音数据;
    通过残差模型,并根据训练合成语音数据和所述训练语音数据之间的残差生成训练嵌入向量;
    根据所述训练嵌入向量对第一训练语言学编码数据进行解码,获得经过语音特征转换的预测目标合成语音数据;
    根据所述预测目标合成语音数据和训练语音数据间的差异,调整所述第二编码器、所述第二解码器、所述残差模型、所述第一编码器和所述第一解码器,并继续训练,直至满足训练停止条件。
  9. 根据权利要求1至5任一项所述的方法,所述根据所述嵌入向量对所述语言学编码数据进行解码,获得经过语音特征转换的目标合成语音数据包括:
    将所述语言学编码数据和所述嵌入向量拼接,得到拼接向量;
    对所述拼接向量进行解码,得到经过语音特征转换的目标合成语音数据。
  10. 根据权利要求1至5任一项所述的方法,还包括:
    确定与所述目标合成语音数据对应的语音幅度谱;
    将语音幅度谱转换为时域的语音波形信号;
    根据所述语音波形生成语音。
  11. 一种模型训练方法,包括:
    获取训练语言学数据和相应的训练语音数据;
    通过第一编码器对训练语言学数据编码,得到第一训练语言学编码数据;
    获取用于语音特征转换的训练嵌入向量;所述训练嵌入向量,根据对应相同训练语言学数据的训练合成语音数据和训练语音数据之间的残差生成;
    通过第一解码器,根据所述训练嵌入向量对所述第一训练语言学编码数据进行解码,获得经过语音特征转换的预测目标合成语音数据;
    根据所述预测目标合成语音数据和训练语音数据间的差异,调整所述第一编码器和所述第一解码器,并继续训练,直至满足训练停止条件。
  12. 根据权利要求11所述的方法,还包括:
    通过第二编码器将训练语言学数据编码,得到第二训练语言学编码数据;
    通过第二解码器对第二训练语言学编码数据解码,得到训练合成语音数据;
    通过残差模型,并根据训练合成语音数据和所述训练语音数据之间的残差生成训练嵌入向量;
    所述根据所述预测目标合成语音数据和训练语音数据间的差异,调整所述第一编码器和所述第一解码器,并继续训练,直至满足训练停止条件包括:
    根据所述预测目标合成语音数据和训练语音数据间的差异,调整所述第二编码器、所述第二解码器、所述残差模型、所述第一编码器和所述第一解码器,并继续训练,直至满足训练停止条件。
  13. 一种语音合成装置,包括:
    语言学数据获取模块,用于获取待处理的语言学数据;
    语言学数据编码模块,用于对所述语言学数据编码,得到语言学编码数据;
    嵌入向量获取模块,用于获取用于语音特征转换的嵌入向量;所述嵌入向量,根据对应相同参考语言学数据的参考合成语音数据和参考语音数据之间的残差生成;
    语言学编码数据解码模块,用于根据所述嵌入向量对所述语言学编码数据进行解码,获得经过语音特征转换的目标合成语音数据。
  14. 一种模型训练装置,包括:
    训练语音数据获取模块,用于获取训练语言学数据和相应的训练语音数据;
    训练语言学数据编码模块,用于通过第一编码器对训练语言学数据编码,得到第一训练语言学编码数据;
    训练嵌入向量获取模块,用于获取用于语音特征转换的训练嵌入向量;所述训练嵌入向量,根据对应相同训练语言学数据的训练合成语音数据和训练语音数据之间的残差生成;
    训练语言学编码数据解码模块,用于通过第一解码器,根据所述训练嵌入向量对所述第一训练语言学编码数据进行解码,获得经过语音特征转换的预测目标合成语音数据;
    调整模块,用于根据所述预测目标合成语音数据和训练语音数据间的差异,调整所述第一编码器和所述第一解码器,并继续训练,直至满足训练停止条件。
  15. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行如权利要求1至12中任一项所述方法的步骤。
  16. 一种计算机可读存储介质,存储有计算机程序,计算机程序被处理器执行时,使得处理器执行如权利要求1至12中任一项所述方法的步骤。
PCT/CN2019/090493 2018-07-25 2019-06-10 语音合成方法、模型训练方法、装置和计算机设备 WO2020019885A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP19840536.7A EP3742436A4 (en) 2018-07-25 2019-06-10 SPEECH SYNTHESIS METHOD, MODEL TRAINING METHOD, DEVICE AND COMPUTER DEVICE
US16/999,989 US12014720B2 (en) 2018-07-25 2020-08-21 Voice synthesis method, model training method, device and computer device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810828220.1 2018-07-25
CN201810828220.1A CN109036375B (zh) 2018-07-25 2018-07-25 语音合成方法、模型训练方法、装置和计算机设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/999,989 Continuation US12014720B2 (en) 2018-07-25 2020-08-21 Voice synthesis method, model training method, device and computer device

Publications (1)

Publication Number Publication Date
WO2020019885A1 true WO2020019885A1 (zh) 2020-01-30

Family

ID=64645210

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/090493 WO2020019885A1 (zh) 2018-07-25 2019-06-10 语音合成方法、模型训练方法、装置和计算机设备

Country Status (5)

Country Link
US (1) US12014720B2 (zh)
EP (1) EP3742436A4 (zh)
CN (1) CN109036375B (zh)
TW (1) TWI732225B (zh)
WO (1) WO2020019885A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111785248A (zh) * 2020-03-12 2020-10-16 北京京东尚科信息技术有限公司 文本信息处理方法及装置
CN112802450A (zh) * 2021-01-05 2021-05-14 杭州一知智能科技有限公司 一种韵律可控的中英文混合的语音合成方法及其系统
CN112951200A (zh) * 2021-01-28 2021-06-11 北京达佳互联信息技术有限公司 语音合成模型的训练方法、装置、计算机设备及存储介质
CN113707125A (zh) * 2021-08-30 2021-11-26 中国科学院声学研究所 一种多语言语音合成模型的训练方法及装置

Families Citing this family (46)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109036375B (zh) 2018-07-25 2023-03-24 腾讯科技(深圳)有限公司 语音合成方法、模型训练方法、装置和计算机设备
KR20200015418A (ko) * 2018-08-02 2020-02-12 네오사피엔스 주식회사 순차적 운율 특징을 기초로 기계학습을 이용한 텍스트-음성 합성 방법, 장치 및 컴퓨터 판독가능한 저장매체
CN109754779A (zh) * 2019-01-14 2019-05-14 出门问问信息科技有限公司 可控情感语音合成方法、装置、电子设备及可读存储介质
CN109754778B (zh) * 2019-01-17 2023-05-30 平安科技(深圳)有限公司 文本的语音合成方法、装置和计算机设备
CN109767755A (zh) * 2019-03-01 2019-05-17 广州多益网络股份有限公司 一种语音合成方法和系统
CN110070852B (zh) * 2019-04-26 2023-06-16 平安科技(深圳)有限公司 合成中文语音的方法、装置、设备及存储介质
CN110288973B (zh) * 2019-05-20 2024-03-29 平安科技(深圳)有限公司 语音合成方法、装置、设备及计算机可读存储介质
CN110264991B (zh) * 2019-05-20 2023-12-22 平安科技(深圳)有限公司 语音合成模型的训练方法、语音合成方法、装置、设备及存储介质
US11410684B1 (en) * 2019-06-04 2022-08-09 Amazon Technologies, Inc. Text-to-speech (TTS) processing with transfer of vocal characteristics
CN110335587B (zh) * 2019-06-14 2023-11-10 平安科技(深圳)有限公司 语音合成方法、系统、终端设备和可读存储介质
CN112289297A (zh) * 2019-07-25 2021-01-29 阿里巴巴集团控股有限公司 语音合成方法、装置和系统
CN110299131B (zh) * 2019-08-01 2021-12-10 苏州奇梦者网络科技有限公司 一种可控制韵律情感的语音合成方法、装置、存储介质
CN110534084B (zh) * 2019-08-06 2022-05-13 广州探迹科技有限公司 一种基于FreeSWITCH的智能语音控制方法及系统
CN110288972B (zh) * 2019-08-07 2021-08-13 北京新唐思创教育科技有限公司 语音合成模型训练方法、语音合成方法及装置
CN110457661B (zh) * 2019-08-16 2023-06-20 腾讯科技(深圳)有限公司 自然语言生成方法、装置、设备及存储介质
CN114303186A (zh) * 2019-08-21 2022-04-08 杜比实验室特许公司 用于在语音合成中适配人类说话者嵌入的系统和方法
CN111816158B (zh) * 2019-09-17 2023-08-04 北京京东尚科信息技术有限公司 一种语音合成方法及装置、存储介质
CN110808027B (zh) * 2019-11-05 2020-12-08 腾讯科技(深圳)有限公司 语音合成方法、装置以及新闻播报方法、系统
CN112786001B (zh) * 2019-11-11 2024-04-09 北京地平线机器人技术研发有限公司 语音合成模型训练方法、语音合成方法和装置
CN112885326A (zh) * 2019-11-29 2021-06-01 阿里巴巴集团控股有限公司 个性化语音合成模型创建、语音合成和测试方法及装置
CN111161702B (zh) * 2019-12-23 2022-08-26 爱驰汽车有限公司 个性化语音合成方法、装置、电子设备、存储介质
CN110992926B (zh) * 2019-12-26 2022-06-10 标贝(北京)科技有限公司 语音合成方法、装置、系统和存储介质
CN111259148B (zh) * 2020-01-19 2024-03-26 北京小米松果电子有限公司 信息处理方法、装置及存储介质
CN111145720B (zh) * 2020-02-04 2022-06-21 清华珠三角研究院 一种将文本转换成语音的方法、系统、装置和存储介质
CN111325817B (zh) * 2020-02-04 2023-07-18 清华珠三角研究院 一种虚拟人物场景视频的生成方法、终端设备及介质
CN113450756A (zh) * 2020-03-13 2021-09-28 Tcl科技集团股份有限公司 一种语音合成模型的训练方法及一种语音合成方法
CN111508509A (zh) * 2020-04-02 2020-08-07 广东九联科技股份有限公司 基于深度学习的声音质量处理系统及其方法
CN111583900B (zh) * 2020-04-27 2022-01-07 北京字节跳动网络技术有限公司 歌曲合成方法、装置、可读介质及电子设备
CN111862931A (zh) * 2020-05-08 2020-10-30 北京嘀嘀无限科技发展有限公司 一种语音生成方法及装置
CN111710326B (zh) * 2020-06-12 2024-01-23 携程计算机技术(上海)有限公司 英文语音的合成方法及系统、电子设备及存储介质
CN111899716B (zh) * 2020-08-03 2021-03-12 北京帝派智能科技有限公司 一种语音合成方法和系统
US11514888B2 (en) * 2020-08-13 2022-11-29 Google Llc Two-level speech prosody transfer
CN112365880B (zh) * 2020-11-05 2024-03-26 北京百度网讯科技有限公司 语音合成方法、装置、电子设备及存储介质
CN112614479B (zh) * 2020-11-26 2022-03-25 北京百度网讯科技有限公司 训练数据的处理方法、装置及电子设备
CN112634856B (zh) * 2020-12-10 2022-09-02 思必驰科技股份有限公司 语音合成模型训练方法和语音合成方法
CN112382272B (zh) * 2020-12-11 2023-05-23 平安科技(深圳)有限公司 可控制语音速度的语音合成方法、装置、设备及存储介质
CN112712788A (zh) * 2020-12-24 2021-04-27 北京达佳互联信息技术有限公司 语音合成方法、语音合成模型的训练方法及装置
CN112992177B (zh) * 2021-02-20 2023-10-17 平安科技(深圳)有限公司 语音风格迁移模型的训练方法、装置、设备及存储介质
CN113053353B (zh) * 2021-03-10 2022-10-04 度小满科技(北京)有限公司 一种语音合成模型的训练方法及装置
CN115410585A (zh) * 2021-05-29 2022-11-29 华为技术有限公司 音频数据编解码方法和相关装置及计算机可读存储介质
CN113345412A (zh) * 2021-05-31 2021-09-03 平安科技(深圳)有限公司 语音合成方法、装置、设备以及存储介质
CN115272537A (zh) 2021-08-06 2022-11-01 宿迁硅基智能科技有限公司 基于因果卷积的音频驱动表情方法及装置
CN113838453B (zh) * 2021-08-17 2022-06-28 北京百度网讯科技有限公司 语音处理方法、装置、设备和计算机存储介质
CN114120973B (zh) * 2022-01-29 2022-04-08 成都启英泰伦科技有限公司 一种语音语料生成系统训练方法
CN116741149B (zh) * 2023-06-08 2024-05-14 北京家瑞科技有限公司 跨语言语音转换方法、训练方法及相关装置
CN117765926B (zh) * 2024-02-19 2024-05-14 上海蜜度科技股份有限公司 语音合成方法、系统、电子设备及介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080177546A1 (en) * 2007-01-19 2008-07-24 Microsoft Corporation Hidden trajectory modeling with differential cepstra for speech recognition
CN106157948A (zh) * 2015-04-22 2016-11-23 科大讯飞股份有限公司 一种基频建模方法及系统
CN108091321A (zh) * 2017-11-06 2018-05-29 芋头科技(杭州)有限公司 一种语音合成方法
CN109036375A (zh) * 2018-07-25 2018-12-18 腾讯科技(深圳)有限公司 语音合成方法、模型训练方法、装置和计算机设备

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5673362A (en) * 1991-11-12 1997-09-30 Fujitsu Limited Speech synthesis system in which a plurality of clients and at least one voice synthesizing server are connected to a local area network
JP3404016B2 (ja) * 2000-12-26 2003-05-06 三菱電機株式会社 音声符号化装置及び音声符号化方法
US8886537B2 (en) * 2007-03-20 2014-11-11 Nuance Communications, Inc. Method and system for text-to-speech synthesis with personalized voice
CN101359473A (zh) * 2007-07-30 2009-02-04 国际商业机器公司 自动进行语音转换的方法和装置
DK2242045T3 (da) * 2009-04-16 2012-09-24 Univ Mons Talesyntese og kodningsfremgangsmåder
US8731931B2 (en) * 2010-06-18 2014-05-20 At&T Intellectual Property I, L.P. System and method for unit selection text-to-speech using a modified Viterbi approach
TWI503813B (zh) * 2012-09-10 2015-10-11 Univ Nat Chiao Tung 可控制語速的韻律訊息產生裝置及語速相依之階層式韻律模組
TWI573129B (zh) * 2013-02-05 2017-03-01 國立交通大學 編碼串流產生裝置、韻律訊息編碼裝置、韻律結構分析裝置與語音合成之裝置及方法
US9881631B2 (en) * 2014-10-21 2018-01-30 Mitsubishi Electric Research Laboratories, Inc. Method for enhancing audio signal using phase information
JP6523893B2 (ja) * 2015-09-16 2019-06-05 株式会社東芝 学習装置、音声合成装置、学習方法、音声合成方法、学習プログラム及び音声合成プログラム
RU2632424C2 (ru) * 2015-09-29 2017-10-04 Общество С Ограниченной Ответственностью "Яндекс" Способ и сервер для синтеза речи по тексту
JP6784022B2 (ja) 2015-12-18 2020-11-11 ヤマハ株式会社 音声合成方法、音声合成制御方法、音声合成装置、音声合成制御装置およびプログラム
CN105529023B (zh) * 2016-01-25 2019-09-03 百度在线网络技术(北京)有限公司 语音合成方法和装置
US10249289B2 (en) * 2017-03-14 2019-04-02 Google Llc Text-to-speech synthesis using an autoencoder
WO2018167522A1 (en) * 2017-03-14 2018-09-20 Google Llc Speech synthesis unit selection
CN107293288B (zh) * 2017-06-09 2020-04-21 清华大学 一种残差长短期记忆循环神经网络的声学模型建模方法
WO2019000170A1 (en) * 2017-06-26 2019-01-03 Microsoft Technology Licensing, Llc GENERATION OF ANSWERS IN AN AUTOMATED ONLINE CONVERSATION

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080177546A1 (en) * 2007-01-19 2008-07-24 Microsoft Corporation Hidden trajectory modeling with differential cepstra for speech recognition
CN106157948A (zh) * 2015-04-22 2016-11-23 科大讯飞股份有限公司 一种基频建模方法及系统
CN108091321A (zh) * 2017-11-06 2018-05-29 芋头科技(杭州)有限公司 一种语音合成方法
CN109036375A (zh) * 2018-07-25 2018-12-18 腾讯科技(深圳)有限公司 语音合成方法、模型训练方法、装置和计算机设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3742436A4

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111785248A (zh) * 2020-03-12 2020-10-16 北京京东尚科信息技术有限公司 文本信息处理方法及装置
CN111785248B (zh) * 2020-03-12 2023-06-23 北京汇钧科技有限公司 文本信息处理方法及装置
CN112802450A (zh) * 2021-01-05 2021-05-14 杭州一知智能科技有限公司 一种韵律可控的中英文混合的语音合成方法及其系统
CN112802450B (zh) * 2021-01-05 2022-11-18 杭州一知智能科技有限公司 一种韵律可控的中英文混合的语音合成方法及其系统
CN112951200A (zh) * 2021-01-28 2021-06-11 北京达佳互联信息技术有限公司 语音合成模型的训练方法、装置、计算机设备及存储介质
CN112951200B (zh) * 2021-01-28 2024-03-12 北京达佳互联信息技术有限公司 语音合成模型的训练方法、装置、计算机设备及存储介质
CN113707125A (zh) * 2021-08-30 2021-11-26 中国科学院声学研究所 一种多语言语音合成模型的训练方法及装置
CN113707125B (zh) * 2021-08-30 2024-02-27 中国科学院声学研究所 一种多语言语音合成模型的训练方法及装置

Also Published As

Publication number Publication date
TWI732225B (zh) 2021-07-01
US12014720B2 (en) 2024-06-18
EP3742436A4 (en) 2021-05-19
CN109036375B (zh) 2023-03-24
US20200380949A1 (en) 2020-12-03
CN109036375A (zh) 2018-12-18
TW202008348A (zh) 2020-02-16
EP3742436A1 (en) 2020-11-25

Similar Documents

Publication Publication Date Title
WO2020019885A1 (zh) 语音合成方法、模型训练方法、装置和计算机设备
JP7106680B2 (ja) ニューラルネットワークを使用したターゲット話者の声でのテキストからの音声合成
JP7395792B2 (ja) 2レベル音声韻律転写
CN112735373B (zh) 语音合成方法、装置、设备及存储介质
JP2022169714A (ja) 多言語テキスト音声合成モデルを利用した音声翻訳方法およびシステム
JP7238204B2 (ja) 音声合成方法及び装置、記憶媒体
CN114175143A (zh) 控制端到端语音合成系统中的表达性
US20220277728A1 (en) Paragraph synthesis with cross utterance features for neural TTS
CN112005298A (zh) 时钟式层次变分编码器
EP4191586A1 (en) Method and system for applying synthetic speech to speaker image
US20240105160A1 (en) Method and system for generating synthesis voice using style tag represented by natural language
CN111627420A (zh) 极低资源下的特定发音人情感语音合成方法及装置
Chen et al. Speech bert embedding for improving prosody in neural tts
EP4343755A1 (en) Method and system for generating composite speech by using style tag expressed in natural language
KR102449223B1 (ko) 음성의 속도 및 피치를 변경하는 방법 및 음성 합성 시스템
KR20240024960A (ko) 견고한 다이렉트 스피치-투-스피치 번역
CN116312476A (zh) 语音合成方法和装置、存储介质、电子设备
WO2023197206A1 (en) Personalized and dynamic text to speech voice cloning using incompletely trained text to speech models
Bae et al. Hierarchical and multi-scale variational autoencoder for diverse and natural non-autoregressive text-to-speech
CN113314097B (zh) 语音合成方法、语音合成模型处理方法、装置和电子设备
CN114495896A (zh) 一种语音播放方法及计算机设备
Ding A Systematic Review on the Development of Speech Synthesis
KR102677459B1 (ko) 2-레벨 스피치 운율 전송
KR102584481B1 (ko) 인공 신경망을 이용한 다화자 음성 합성 방법 및 장치
Matoušek et al. VITS: quality vs. speed analysis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19840536

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019840536

Country of ref document: EP

Effective date: 20200820

NENP Non-entry into the national phase

Ref country code: DE