WO2021134581A1 - 基于韵律特征预测的语音合成方法、装置、终端及介质 - Google Patents

基于韵律特征预测的语音合成方法、装置、终端及介质 Download PDF

Info

Publication number
WO2021134581A1
WO2021134581A1 PCT/CN2019/130741 CN2019130741W WO2021134581A1 WO 2021134581 A1 WO2021134581 A1 WO 2021134581A1 CN 2019130741 W CN2019130741 W CN 2019130741W WO 2021134581 A1 WO2021134581 A1 WO 2021134581A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
prosodic
prosody
phrase
text
Prior art date
Application number
PCT/CN2019/130741
Other languages
English (en)
French (fr)
Inventor
李贤�
黄东延
丁万
张皓
熊友军
Original Assignee
深圳市优必选科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市优必选科技股份有限公司 filed Critical 深圳市优必选科技股份有限公司
Priority to CN201980003386.2A priority Critical patent/CN111226275A/zh
Priority to PCT/CN2019/130741 priority patent/WO2021134581A1/zh
Publication of WO2021134581A1 publication Critical patent/WO2021134581A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L2013/083Special characters, e.g. punctuation marks

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a speech synthesis method, device, intelligent terminal, and computer-readable storage medium based on prosodic feature prediction.
  • Speech synthesis can convert text, text, etc. into natural speech output.
  • Prosody affects the naturalness and fluency of pronunciation.
  • a good prosody prediction result will make the synthesized speech more like the pause of human speech, thus making the synthesized speech more natural.
  • the training and prediction of the neural network model is mainly based on the acoustic features such as Chinese phonemes.
  • the acoustic features such as Chinese phonemes.
  • a speech synthesis method based on prosodic feature prediction including:
  • the text to be synthesized is input into a preset prosody prediction model, the prosody feature of the text to be synthesized is acquired as the first prosody feature, and the target prosody feature is determined according to the first prosody feature.
  • the prosody feature of the text to be synthesized includes Features of prosodic words, features of prosodic phrases, and features of prosodic intonation phrases;
  • the step of inputting the text to be synthesized into a preset prosody prediction model, and obtaining the prosody feature of the text to be synthesized as the first prosody feature further includes:
  • the first prosodic word feature, the first prosodic phrase feature, and the first prosodic intonation phrase feature are used as the first prosodic feature.
  • a speech synthesis device based on prosody feature prediction including:
  • the text acquisition module is used to acquire the text to be synthesized
  • the prosody feature acquisition module is configured to input the text to be synthesized into a preset prosody prediction model, acquire the prosody feature of the text to be synthesized as a first prosody feature, and determine the target prosody feature according to the first prosody feature, the
  • the prosodic features of the text to be synthesized include prosodic word features, prosodic phrase features, and prosodic intonation phrase features;
  • the speech synthesis module is configured to perform speech synthesis according to the target prosody feature to generate a target speech corresponding to the text to be synthesized.
  • an intelligent terminal is proposed.
  • An intelligent terminal includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps:
  • the text to be synthesized is input into a preset prosody prediction model, the prosody feature of the text to be synthesized is acquired as the first prosody feature, and the target prosody feature is determined according to the first prosody feature.
  • the prosody feature of the text to be synthesized includes Features of prosodic words, features of prosodic phrases, and features of prosodic intonation phrases;
  • a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps:
  • the text to be synthesized is input into a preset prosody prediction model, the prosody feature of the text to be synthesized is acquired as the first prosody feature, and the target prosody feature is determined according to the first prosody feature.
  • the prosody feature of the text to be synthesized includes Features of prosodic words, features of prosodic phrases, and features of prosodic intonation phrases;
  • the prosody feature of the synthesized text is predicted through the prosody prediction model, where the predicted prosody feature It includes prosody-level features such as prosodic word features, prosody phrase features, and prosodic intonation phrase features. Then the prosody feature is used as the basis of speech synthesis, and then the target speech corresponding to the text to be synthesized is determined according to the prosody feature to complete the process of speech synthesis.
  • the prosody prediction model can accurately predict prosody-level features such as prosody word features, prosody phrase features, and prosodic intonation phrase features, which improves the accuracy of prosody feature prediction, thereby improving the pronunciation
  • prosody-level features such as prosody word features, prosody phrase features, and prosodic intonation phrase features
  • FIG. 1 is an application environment diagram of a speech synthesis method based on prosodic feature prediction according to an embodiment of the application;
  • FIG. 2 is a schematic flowchart of a speech synthesis method based on prosodic feature prediction according to an embodiment of the application
  • Fig. 3 is a schematic diagram of a prosodic feature structure in an embodiment of the application.
  • FIG. 4 is a schematic diagram of the process of acquiring the first prosody feature in an embodiment of this application.
  • Fig. 5 is a schematic diagram of the first prosodic feature acquisition process in an embodiment of the application.
  • FIG. 6 is a schematic flowchart of a speech synthesis method based on prosodic feature prediction according to an embodiment of the application
  • FIG. 7 is a schematic diagram of the process of acquiring the second prosody feature in an embodiment of this application.
  • FIG. 8 is a schematic diagram of the process of acquiring target prosody features in an embodiment of this application.
  • FIG. 9 is a schematic diagram of a target prosody feature acquisition process in an embodiment of this application.
  • FIG. 10 is a schematic diagram of a process of training a prosody prediction model in an embodiment of this application.
  • FIG. 11 is a schematic diagram of a process of training a prosody prediction model in an embodiment of this application.
  • FIG. 12 is a schematic structural diagram of a speech synthesis device based on prosodic feature prediction in an embodiment of the application.
  • FIG. 13 is a schematic structural diagram of a speech synthesis device based on prosodic feature prediction in an embodiment of the application.
  • FIG. 14 is a schematic structural diagram of a speech synthesis device based on prosodic feature prediction in an embodiment of the application.
  • FIG. 15 is a schematic structural diagram of a computer device running the above-mentioned speech synthesis method based on prosodic feature prediction according to an embodiment of the application.
  • Fig. 1 is an application environment diagram of a speech synthesis method based on prosodic feature prediction in an embodiment.
  • the speech synthesis system includes a terminal 110 and a server 120.
  • the terminal 110 and the server 120 are connected through a network.
  • the terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically be at least one of a mobile phone, a tablet computer, and a notebook computer.
  • the server 120 may be implemented as an independent server or a server cluster composed of multiple servers.
  • the terminal 110 is used to analyze and process the text to be synthesized, and the server 120 is used to train and predict the model.
  • the speech synthesis system applied to the aforementioned speech synthesis method based on prosodic feature prediction may also be implemented based on the terminal 110.
  • the terminal is used for model training and prediction, and converts the text to be synthesized into speech.
  • a speech synthesis method based on prosodic feature prediction is provided.
  • the method can be applied to a terminal or a server, and this embodiment is applied to a terminal as an example.
  • the speech synthesis method based on prosody feature prediction specifically includes the following steps:
  • Step S102 Obtain the text to be synthesized.
  • the text to be synthesized is text information that requires speech synthesis. For example, in scenarios such as voice chat robots and voice newspaper reading, text messages that need to be converted into voices.
  • the text to be synthesized could be "Since that moment, she will no longer be arrogant.”
  • Step S104 Input the text to be synthesized into a preset prosody prediction model, obtain the prosody feature of the text to be synthesized as a first prosody feature, and determine a target prosody feature according to the first prosody feature.
  • the prosody prediction model refers to predicting the prosodic features of the text to be synthesized based on the deep learning or neural network model, so that the predicted prosody features can be used in the acoustic encoder to obtain a better speech synthesis effect.
  • the prosody preset model is a pre-trained neural network model.
  • the training text and the marked prosody feature results corresponding to each training text are used to train the prosody preset model so that the prosody preset model can be
  • the prosodic feature of the text to be synthesized is predicted, and the prosodic feature obtained by the prediction is the first prosodic feature.
  • the final target prosody feature for speech synthesis can be determined, for example, the first prosody feature is directly used as the target prosody feature.
  • the prosodic features include prosodic word features (abbreviated as PW), prosodic phrase features (abbreviated as PPH), and prosodic intonation phrase features (abbreviated as IPH).
  • PW prosodic word features
  • PPH prosodic phrase features
  • IPH prosodic intonation phrase features
  • the prosodic hierarchical structure corresponding to the prosodic word feature, the prosodic phrase feature, and the prosodic intonation phrase feature included in the prosodic feature is given.
  • the characteristics of prosodic intonation phrases are based on the characteristics of prosodic phrases
  • the characteristics of prosodic phrases are based on the characteristics of prosodic words.
  • the process of obtaining the corresponding prosodic features of the text to be synthesized through the preset prosody prediction model also includes the prosodic features under the prosodic hierarchical structure corresponding to the prosodic features.
  • the preset prosody prediction model is input to the character vector corresponding to the text to be synthesized, and the prosody prediction model is trained and prosody based on the word granularity.
  • the prediction of structure can improve the accuracy of prosody prediction and speech synthesis.
  • the method further includes: determining a plurality of word vectors corresponding to the text to be synthesized. That is to say, the text to be synthesized is processed, the text to be synthesized is divided into multiple word vectors, and then the multiple word vectors corresponding to the text to be synthesized are used as the input of the prosody prediction model.
  • the dimension of the aforementioned word vector may be a 200-dimensional word vector.
  • the prediction process of the first prosody feature including the prosodic word feature, the prosodic phrase feature, and the prosodic intonation phrase feature is described in detail:
  • the calculation process of the first prosody feature includes steps S1041-S1044 as shown in Fig. 4:
  • Step S1041 Input the text to be synthesized into a preset prosody word prediction model to obtain the first prosody word feature;
  • Step S1042 Obtain the first prosodic phrase feature by combining the text to be synthesized and/or the first prosodic word feature and a preset prosodic phrase prediction model;
  • Step S1043 Input the text to be synthesized, the first prosodic word feature and/or the first prosodic phrase feature into a preset prosodic intonation phrase prediction model to obtain the first prosodic intonation phrase feature;
  • Step S1044 Use the first prosodic word feature, the first prosodic phrase feature, and the first prosodic intonation phrase feature as the first prosodic feature.
  • prosodic features include prosodic word features, prosody phrase features, and prosodic intonation phrase features.
  • the module corresponding to the prosodic intonation phrase feature predicts the prosodic word feature, the prosodic phrase feature, and the prosodic intonation phrase feature.
  • the above-mentioned prosody prediction model includes a prosody word prediction model, a prosody phrase prediction model, and a prosody and intonation phrase prediction model, which are used to predict prosodic word features, prosody phrase features, and prosody and intonation phrase features in the composition of prosody, respectively.
  • step S102 the text to be synthesized is first input into the prosodic word prediction model to obtain an output result, where the output result is the first prosodic word feature.
  • the text to be synthesized and the first prosodic word feature are input into a preset prosodic phrase prediction model to obtain an output result, and the output result is the first prosodic phrase feature.
  • the text to be synthesized and the above-mentioned first prosodic word feature and first prosodic phrase feature are input into the preset prosodic intonation phrase prediction model, and the output result is obtained, and the output result is the first prosodic intonation Phrases characteristics.
  • the first prosodic word feature, the first prosodic phrase feature, and the first prosodic intonation phrase feature constitute the first prosodic feature.
  • the text to be synthesized in the input prosodic word prediction model, prosody phrase prediction model, and prosodic intonation phrase prediction model may be a character vector corresponding to the text to be synthesized obtained after processing the text to be synthesized as described above .
  • prosodic word prediction model prosody phrase prediction model, and prosodic intonation phrase prediction model are respectively used to predict the prosody feature under the prosodic hierarchical structure such as the prosody word feature, the prosodic phrase feature and the prosodic intonation phrase feature in the composition of the prosody structure, and improve
  • the accuracy of prosody special diagnosis prediction is used as input in the subsequent speech synthesis process to improve the accuracy of speech synthesis.
  • Step S106 Perform speech synthesis according to the target prosody feature, and generate a target speech corresponding to the text to be synthesized.
  • the prosody feature is used as input, and the prosody feature corresponding to the text to be synthesized is synthesized through a preset acoustic encoder, and the corresponding target speech is output.
  • the first prosody feature may be directly used as the input of the acoustic encoder to determine the corresponding target speech. In other embodiments, further calculation processing may be performed on the first prosody feature to determine the corresponding target prosody feature, and then the target prosody feature is used as the input of the acoustic encoder to synthesize the target speech.
  • the prosody feature in order to further improve the accuracy of the prosody feature prediction, may be further optimized through an optimization algorithm.
  • the above-mentioned speech synthesis method based on prosodic feature prediction further includes:
  • Step S105 Process the first prosody feature through a preset optimization algorithm, obtain a second prosody feature corresponding to the first prosody feature, and join the first prosody feature and the second prosody feature Process to obtain the target prosody feature.
  • the first prosody feature is optimized through a preset optimization algorithm, and the corresponding second prosody feature is obtained.
  • the process of optimizing the first prosody feature through the optimization algorithm is a process of optimizing each feature parameter included in the first prosody feature.
  • the first prosody feature and the second prosody feature are spliced, and the spliced prosody feature is obtained as the target prosody feature.
  • the second prosody feature is spliced behind the first prosody feature, and the spliced feature feature vector is obtained as the target prosody feature.
  • the target prosody feature after the optimization algorithm processing and the splicing processing is used as the input in the subsequent speech synthesis step, and the speech synthesis result with better accuracy can be obtained.
  • the prosody feature of the text to be synthesized that needs to be synthesized is obtained through the prosody prediction model, and the obtained prosody feature is optimized through the optimization algorithm and spliced to the output of the prosody prediction model.
  • the target prosody feature after stitching is obtained; then the voice synthesis is performed according to the target prosody feature through the preset acoustic encoder, so as to obtain the speech synthesis result corresponding to the text to be synthesized (that is, the target voice).
  • the calculation process of the second prosodic feature may be as shown in Fig. 7:
  • Step S1051 Process the first prosody word feature through the preset optimization algorithm, and obtain a second prosody word feature corresponding to the first prosody word feature;
  • Step S1052 Process the first prosodic phrase feature through the preset optimization algorithm, and obtain a second prosodic phrase feature corresponding to the first prosodic phrase feature;
  • Step S1053 processing the first prosodic intonation phrase feature by the preset optimization algorithm to obtain a second prosodic intonation phrase feature corresponding to the prosodic intonation phrase feature;
  • Step S1054 Use the second prosodic word feature, the second prosodic phrase feature, and the second prosodic intonation phrase feature as the second prosodic feature.
  • the text to be synthesized is first input into the prosodic word prediction model to obtain an output result, where the output result is the first prosodic word feature. Then, in order to optimize the features of the prosody word, it is also necessary to optimize the features of the first prosody word through an optimization algorithm to obtain the corresponding feature of the second prosody word. Finally, the second prosody word feature is spliced to the back of the first prosody word feature to form a new prosody word feature vector as the target prosody word feature.
  • the text to be synthesized and the first prosodic word feature are input into a preset prosodic phrase prediction model to obtain an output result, and the output result is the first prosodic phrase feature.
  • an optimization algorithm is used to optimize the features of the first prosodic phrases to obtain the corresponding features of the second prosodic phrases.
  • the second prosodic phrase feature is spliced to the back of the first prosodic phrase feature to form a new prosodic phrase feature vector as the target prosodic phrase feature.
  • the text to be synthesized and the above-mentioned first prosodic word feature and first prosodic phrase feature are input into the preset prosodic intonation phrase prediction model to obtain the output result, and the output result is the first prosodic intonation Phrases characteristics.
  • the optimization algorithm is used to optimize the first prosodic intonation phrase feature to obtain the corresponding second prosodic intonation phrase feature.
  • the second prosodic intonation phrase feature is spliced to the back of the first prosodic intonation phrase feature to form a new prosodic intonation phrase feature vector as the target prosodic intonation phrase feature.
  • the second prosody word feature, the second prosody phrase feature, and the second prosodic intonation phrase feature constitute the second prosody feature;
  • the target prosody word feature, the target prosody phrase feature, and the target prosodic intonation phrase feature constitute the target prosody feature.
  • the text to be synthesized in the input prosodic word prediction model, prosody phrase prediction model, and prosodic intonation phrase prediction model may be a character vector corresponding to the text to be synthesized obtained after processing the text to be synthesized as described above .
  • the above-mentioned algorithm for processing the first prosody feature is the Viterbi algorithm.
  • the generation of the above-mentioned target prosody feature may also be a comprehensive processing process based on the optimization algorithm in steps S1041-S1044 and step S105 (taking the Viterbi algorithm as an example).
  • the process of generating target prosodic features also includes:
  • Step S211 Input the text to be synthesized into the preset prosodic word prediction model to obtain the first prosodic word feature
  • Step S212 processing the first prosody word feature through the Viterbi algorithm to obtain the second prosody word feature corresponding to the first prosody word feature;
  • Step S213 splicing the first prosody word feature and the second prosody word feature to obtain the target prosody word feature;
  • Step S221 Input the features of the text to be synthesized and/or the target prosodic word into the preset prosodic phrase prediction model to obtain the first prosodic phrase feature;
  • Step S222 Process the first prosodic phrase feature through the Viterbi algorithm to obtain the second prosodic phrase feature corresponding to the first prosodic phrase feature;
  • Step S223 splicing the first prosodic phrase feature and the second prosodic phrase feature to obtain the target prosodic phrase feature;
  • Step S231 Input the text to be synthesized, the target prosodic word feature, and/or the target prosodic phrase feature into the preset prosodic intonation phrase prediction model to obtain the first prosodic intonation phrase feature;
  • Step S232 processing the first prosodic intonation phrase feature by the Viterbi algorithm to obtain the second prosodic intonation phrase feature corresponding to the prosodic intonation phrase feature;
  • Step S233 splicing the first prosodic intonation phrase feature and the second prosodic intonation phrase feature to obtain the target prosodic intonation phrase feature;
  • Step S240 Use the target prosodic word feature, the target prosodic phrase feature, and the target prosodic intonation phrase feature as the target prosody feature.
  • the text to be synthesized is first input into the prosodic word prediction model to obtain an output result, where the output result is the first prosodic word feature. Then, in order to optimize the features of the prosodic words, it is also necessary to optimize the features of the first prosodic words through the Viterbi algorithm to obtain the corresponding features of the second prosodic words. Finally, the second prosody word feature is spliced to the back of the first prosody word feature to form a new prosody word feature vector as the target prosody word feature.
  • the text to be synthesized and the target prosodic word feature are input into a preset prosodic phrase prediction model to obtain an output result, which is the first prosodic phrase feature.
  • the Viterbi algorithm is used to optimize the features of the first prosodic phrases to obtain the corresponding features of the second prosodic phrases.
  • the second prosodic phrase feature is spliced to the back of the first prosodic phrase feature to form a new prosodic phrase feature vector as the target prosodic phrase feature.
  • the process of predicting prosodic and intonation phrase features input the text to be synthesized and the above-mentioned target prosodic word features and target prosodic phrase features into the preset prosodic intonation phrase prediction model to obtain the output result, which is the first prosodic intonation phrase feature . Then, in order to optimize the prosodic and intonation phrase features, the Viterbi algorithm is used to optimize the first prosody and intonation phrase feature to obtain the corresponding second prosody and intonation phrase feature. Finally, the second prosodic intonation phrase feature is spliced to the back of the first prosodic intonation phrase feature to form a new prosodic intonation phrase feature vector as the target prosodic intonation phrase feature.
  • the second prosody word feature, the second prosody phrase feature, and the second prosodic intonation phrase feature constitute the second prosody feature;
  • the target prosody word feature, the target prosody phrase feature, and the target prosodic intonation phrase feature constitute the target prosody feature.
  • the text to be synthesized in the input prosodic word prediction model, prosody phrase prediction model, and prosodic intonation phrase prediction model may be a character vector corresponding to the text to be synthesized obtained after processing the text to be synthesized as described above .
  • a schematic flow chart of the process of generating the target prosody feature in the above steps S211-S240 is given.
  • prosodic word prediction model prosody phrase prediction model, and prosodic intonation phrase prediction model are respectively used to predict the prosodic word features, prosody phrase features, and prosodic intonation phrase features in the composition of the prosody structure, and use the Viterbi algorithm to predict the features in the prediction results.
  • Prosodic word features, prosodic phrase features, and prosodic intonation phrase features are optimized, and then spliced to the back of the model output.
  • the spliced prosody feature is used as the target prosody feature, which is used as the input in the subsequent speech synthesis process to improve the speech The accuracy of synthesis.
  • the above-mentioned prosody prediction model, prosody word prediction model, prosody phrase prediction model, and prosodic intonation phrase prediction model can make good predictions of the prosodic features of the synthesized text, and before using the corresponding model for prediction, it needs to be based on The training data trains the corresponding model.
  • FIG. 10 a schematic flow chart of the training process of a prosody prediction model is given.
  • the above-mentioned prosody prediction model training process includes steps S302-304 as shown in Fig. 10:
  • Step S302 Obtain a training data set, the training data set including a plurality of training texts and corresponding reference values of prosodic features;
  • Step S304 Using the training text as input and the prosody feature reference value as output, training the prosody prediction model.
  • the data format corresponding to the reference value of the prosody feature can be: since #1 ⁇ #1 moment #3, she will no longer #1 no longer #2 ⁇ #3, will be processed into a prosody word (will be #1, #2, #3 are all regarded as prosodic word tags): 01100101010001, prosodic phrases (#2, #3): 00000100010001, intonation phrases (#3): 00000100000001 (where the corresponding training text is: Since that From a moment on, she no longer conceited herself.).
  • a large number of training texts are manually annotated, corresponding reference values of prosodic features are obtained, and a training data set is determined.
  • the training data set includes a plurality of training texts and the reference value of the prosody feature corresponding to each training text.
  • the training text is used as input, and the corresponding prosody feature reference value is used as output, and the preset prosody prediction model is trained so that the prosody training model has the function of prosody feature prediction.
  • the process of training the prosody prediction model also includes the process of separately training the prosody word prediction model, the prosody phrase prediction model, and the prosody and intonation phrase prediction model.
  • the aforementioned prosodic feature reference values determined by manually labeling the training text include prosodic word feature reference values, prosodic phrase feature reference values, and prosodic intonation phrase feature reference values.
  • the process of separately training the prosody word prediction model, the prosody phrase prediction model, and the prosody and intonation phrase prediction model includes steps S3041-S3042 as shown in FIG. 11:
  • Step S3041 Taking the training text as input and the reference value of the prosody word feature as the output, training the prosody word prediction model;
  • Step S3042 Taking the training text and/or the prosodic word feature reference value as input, and the prosodic phrase feature reference value as output, to train the prosody phrase prediction model;
  • Step S3043 Taking the training text and the prosodic phrase feature reference value as input, and the prosody and intonation phrase feature reference value as output, to train the prosody and intonation phrase prediction model model.
  • the training text is used as the input and the reference value of the prosody word feature is used as the output to train the prosody word prediction model so that the prosody word prediction model has the characteristics of prosody word. The ability to predict.
  • the training text and the corresponding prosodic word feature reference value are used as input, and the prosodic phrase feature reference value is used as output.
  • the prosody phrase prediction model is trained to make the prosody phrase prediction model have the correct The ability to predict prosodic phrase features.
  • the training text, the reference value of the prosody word feature and the feature reference value of the prosodic phrase are used as input, and the reference value of the prosody and intonation phrase feature is used as the output, and the prosody and intonation phrase prediction model is trained to
  • the prosodic intonation phrase prediction model has the ability to pre-store the characteristics of the prosodic intonation phrase.
  • the training text input to the model may also be a character vector corresponding to the training text.
  • multiple word vectors corresponding to the training text need to be determined.
  • the prosody prediction model, the prosody word prediction model, the prosody phrase prediction model, and the prosody and intonation phrase prediction model are neural network models, and in a specific embodiment, they are two-way long and short-term memory neural networks.
  • Model (BiLSTM model).
  • the BiLSTM model belongs to time series data (with time dependence), and the processing of the data is globalized. Data prediction can be made through the data before and after data in the data to obtain more accurate prediction results.
  • the prediction of the prosody feature is performed through the BiLSTM model, which can obtain the context feature more effectively, and can improve the accuracy of the prosody feature prediction.
  • a speech synthesis device based on prosodic feature prediction is provided.
  • the above-mentioned speech synthesis device based on prosodic feature prediction includes:
  • the text obtaining module 402 is used to obtain the text to be synthesized
  • the prosodic feature acquisition module 404 is configured to acquire the prosodic feature of the text to be synthesized as a first prosodic feature, and determine a target prosody feature according to the first prosody feature, and the prosodic feature of the text to be synthesized includes a prosodic word feature and a prosodic phrase Features, prosodic intonation phrase features;
  • the speech synthesis module 406 is configured to perform speech synthesis according to the target prosody feature to generate a target speech corresponding to the text to be synthesized.
  • the prosodic feature acquisition module 404 is further configured to input the to-be-synthesized text into a preset prosody word prediction model to obtain the first prosody-word feature; A prosodic word feature and a preset prosodic phrase prediction model to obtain the first prosodic phrase feature; input the text to be synthesized, the first prosodic word feature and/or the first prosodic phrase feature into the preset prosodic intonation phrase prediction The model obtains the first prosodic intonation phrase feature; the first prosodic word feature, the first prosodic phrase feature, and the first prosodic intonation phrase feature are used as the first prosodic feature.
  • the prosody feature acquisition module 404 is further configured to process the first prosody feature through a preset optimization algorithm, and obtain a second prosody feature corresponding to the first prosody feature; The first prosody feature and the second prosody feature are spliced to obtain the target prosody feature.
  • the prosody feature acquisition module 404 is further configured to process the first prosody feature by using a preset Viterbi algorithm to obtain a second prosody feature corresponding to the first prosody feature.
  • the prosody feature acquisition module 404 is further configured to process the first prosody word feature through the preset optimization algorithm, and obtain a second prosody word corresponding to the first prosody word feature Feature; processing the first prosodic phrase feature by the preset optimization algorithm to obtain a second prosodic phrase feature corresponding to the first prosodic phrase feature; using the preset optimization algorithm for the first prosodic phrase A prosodic intonation phrase feature is processed to obtain a second prosodic intonation phrase feature corresponding to the prosodic intonation phrase feature; the second prosodic word feature, the second prosodic phrase feature, and the second prosodic intonation phrase feature are used as the first prosodic intonation phrase feature Two prosody features.
  • the prosody feature acquisition module 404 is further configured to optimize the feature parameters included in the first prosody feature by using a preset Viterbi algorithm.
  • the prosody feature acquisition module 404 is further configured to splice the first prosody word feature and the second prosody word feature to obtain the target prosody word feature; compare the first prosody phrase feature and the second prosody word feature Splicing prosodic phrase features to obtain target prosodic phrase characteristics; splicing the first prosodic intonation phrase feature and the second prosodic intonation phrase feature to obtain the target prosodic intonation phrase feature; combining the target prosody word feature and target prosodic phrase feature , The target prosody and intonation phrase feature is used as the target prosody feature.
  • the above-mentioned speech synthesis device further includes a text processing module 403, which is used to determine a plurality of word vectors corresponding to the text to be synthesized.
  • the prosody prediction model is a BiLSTM model.
  • the aforementioned speech synthesis device based on prosody feature prediction further includes a training sample acquisition module 412 and a model training module 414, wherein the training sample acquisition module 412 is used to acquire a training data set,
  • the training data set includes multiple training texts and corresponding reference values of prosodic features;
  • the model training module 414 is configured to train the prosody prediction model using the training text as an input and the prosody feature reference value as an output.
  • the training sample acquisition module 412 is further configured to determine multiple word vectors corresponding to the training text
  • the model training module 414 is further configured to take the multiple character vectors corresponding to the training text as input and the prosody feature reference value as output to train the prosody prediction model.
  • the prosodic feature reference value includes a prosodic word feature reference value, a prosodic phrase feature reference value, and a prosody and intonation phrase feature reference value;
  • the model training module 414 is further configured to use the training text as input and the prosodic word feature reference value as output to train the prosody word prediction model; and use the training text and/or the prosody word feature as output.
  • the reference value is used as input, and the prosodic phrase feature reference value is used as output to train the prosodic phrase prediction model; the training text and the prosodic phrase feature reference value are used as input, and the prosodic phrase feature reference value is used as input Output, training the prosodic intonation phrase prediction model.
  • Fig. 15 shows an internal structure diagram of a computer device in an embodiment.
  • the computer device may specifically be a terminal or a server.
  • the computer device includes a processor, a memory, and a network interface connected through a system bus.
  • the memory includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the computer device stores an operating system, and may also store a computer program.
  • the processor can enable the processor to implement a speech synthesis method based on prosodic feature prediction.
  • a computer program may also be stored in the internal memory, and when the computer program is executed by the processor, the processor can execute a speech synthesis method based on prosodic feature prediction.
  • FIG. 15 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • an intelligent terminal which includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps:
  • the text to be synthesized is input into a preset prosody prediction model, the prosody feature of the text to be synthesized is acquired as the first prosody feature, and the target prosody feature is determined according to the first prosody feature.
  • the prosody feature of the text to be synthesized includes Features of prosodic words, features of prosodic phrases, and features of prosodic intonation phrases;
  • a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps:
  • the text to be synthesized is input into a preset prosody prediction model, the prosody feature of the text to be synthesized is acquired as the first prosody feature, and the target prosody feature is determined according to the first prosody feature.
  • the prosody feature of the text to be synthesized includes Features of prosodic words, features of prosodic phrases, and features of prosodic intonation phrases;
  • the prosody feature of the synthesized text is predicted through the prosody prediction model, where the predicted prosody feature It includes prosody-level features such as prosodic word features, prosody phrase features, and prosodic intonation phrase features. Then the prosody feature is used as the basis of speech synthesis, and then the target speech corresponding to the text to be synthesized is determined according to the prosody feature to complete the process of speech synthesis.
  • the prosody prediction model can accurately predict prosody-level features such as prosody word features, prosody phrase features, and prosodic intonation phrase features, which improves the accuracy of prosody feature prediction, thereby improving the pronunciation
  • prosody-level features such as prosody word features, prosody phrase features, and prosodic intonation phrase features
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Channel
  • memory bus Radbus direct RAM
  • RDRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

一种基于韵律特征预测的语音合成方法、语音合成装置、智能终端及计算机可读存储介质,方法包括:获取待合成文本(S102);将待合成文本输入预设的韵律预测模型,获取待合成文本的韵律特征作为第一韵律特征,根据第一韵律特征确定目标韵律特征(S104),待合成文本的韵律特征包括韵律词特征、韵律短语特征、韵律语调短语特征;根据目标韵律特征进行语音合成,生成与待合成文本对应的目标语音(S106)。可以提高文本的韵律特征预测的准确性,提高语音合成的效果。

Description

基于韵律特征预测的语音合成方法、装置、终端及介质 技术领域
本申请涉及人工智能技术领域,尤其涉及一种基于韵律特征预测的语音合成方法、装置、智能终端及计算机可读存储介质。
背景技术
随着移动互联网和人工智能技术的快速发展,语音播报、听小说、听新闻、智能交互等一系列语音合成的场景越来越多。语音合成可以将文本、文字等转换成自然语音输出。
在语音合成的过程中,需要对文本进行韵律预测。韵律影响发音的自然度、流利度,一个好的韵律预测结果会使得合成得到的语音更像人说话的停顿方式,从而使得合成的语音更自然。
技术问题
但是,在现有的韵律预测的方案中,主要是根据汉语的音素等声学特征进行神经网络模型的训练和预测。但是,通过上述方案得到的韵律特征预测结果与真实的韵律特征之间存在一定的误差,导致了韵律预测的准确性有所不足,从而造成了语音合成的效果不足。
也就是说,上述语音合成的方案中,因为韵律预测的准确性不足导致了合成的语音的效果不足。
技术解决方案
基于此,有必要针对上述问题,提出了一种基于韵律特征预测的语音合成方法、装置、智能终端及计算机可读存储介质。
在本申请的第一方面,提出了一种基于韵律特征预测的语音合成方法。
一种基于韵律特征预测的语音合成方法,包括:
获取待合成文本;
将所述待合成文本输入预设的韵律预测模型,获取所述待合成文本的韵律特征作为第一韵律特征,根据所述第一韵律特征确定目标韵律特征,所述待合成文本的韵律特征包括韵律词特征、韵律短语特征、韵律语调短语特征;
根据所述目标韵律特征进行语音合成,生成与所述待合成文本对应的目标语音。
其中,所述将所述待合成文本输入预设的韵律预测模型,获取所述待合成文本的韵律特征作为第一韵律特征的步骤,还包括:
将所述待合成文本输入预设的韵律词预测模型,获取第一韵律词特征;
将所述待合成文本和/或所述第一韵律词特征和预设的韵律短语预测模型,获取第一韵律短语特征;
将所述待合成文本、第一韵律词特征和/或所述第一韵律短语特征输入预设的韵律语调短语预测模型,获取第一韵律语调短语特征;
将所述第一韵律词特征、第一韵律短语特征、第一韵律语调短语特征作为所述第一韵律特征。
在本申请的第二方面,提出了一种基于韵律特征预测的语音合成装置。
一种基于韵律特征预测的语音合成装置,包括:
文本获取模块,用于获取待合成文本;
韵律特征获取模块,用于将所述待合成文本输入预设的韵律预测模型,获取所述待合成文本的韵律特征作为第一韵律特征,根据所述第一韵律特征确定目标韵律特征,所述待合成文本的韵律特征包括韵律词特征、韵律短语特征、韵律语调短语特征;
语音合成模块,用于根据所述目标韵律特征进行语音合成,生成与所述待合成文本对应的目标语音。
在本申请的第三方面,提出了一种智能终端。
一种智能终端,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行以下步骤:
获取待合成文本;
将所述待合成文本输入预设的韵律预测模型,获取所述待合成文本的韵律特征作为第一韵律特征,根据所述第一韵律特征确定目标韵律特征,所述待合成文本的韵律特征包括韵律词特征、韵律短语特征、韵律语调短语特征;
根据所述目标韵律特征进行语音合成,生成与所述待合成文本对应的目标语音。
在本申请的第四方面,提出了一种计算机可读存储介质。
一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:
获取待合成文本;
将所述待合成文本输入预设的韵律预测模型,获取所述待合成文本的韵律特征作为第一韵律特征,根据所述第一韵律特征确定目标韵律特征,所述待合成文本的韵律特征包括韵律词特征、韵律短语特征、韵律语调短语特征;
根据所述目标韵律特征进行语音合成,生成与所述待合成文本对应的目标语音。
有益效果
实施本申请实施例,将具有如下有益效果:
采用了上述基于韵律特征预测的语音合成方法、装置、智能终端及计算机可读存储介质之后,在语音合成的过程中,通过韵律预测模型对待合成文本的韵律特征进行预测,其中,预测的韵律特征包括韵律词特征、韵律短语特征、韵律语调短语特征等韵律层级特征,然后将该韵律特征作为语音合成的基础,然后根据韵律特征确定与待合成文本对应的目标语音,完成语音合成的过程。也就是说,在本实施例中,通过韵律预测模型可以对韵律词特征、韵律短语特征、韵律语调短语特征等韵律层级特征进行准确的预测,提高了韵律特征预测的准确性,从而提高了语音合成的效果,提升了用户体验。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
其中:
图1为本申请的一个实施例的基于韵律特征预测的语音合成方法的应用环境图;
图2为本申请的一个实施例的一种基于韵律特征预测的语音合成方法的流程示意图;
图3为本申请的一个实施例中韵律特征结构示意图;
图4为本申请的一个实施例中第一韵律特征获取的流程示意图;
图5为本申请的一个实施例中第一韵律特征获取过程的示意图;
图6为本申请的一个实施例的一种基于韵律特征预测的语音合成方法的流程示意图;
图7为本申请的一个实施例中第二韵律特征获取的流程示意图;
图8为本申请的一个实施例中目标韵律特征获取的流程示意图;
图9为本申请的一个实施例中目标韵律特征获取过程的示意图;
图10为本申请的一个实施例中韵律预测模型训练的流程示意图;
图11为本申请的一个实施例中韵律预测模型训练的流程示意图;
图12为本申请的一个实施例中基于韵律特征预测的语音合成装置的结构示意图;
图13为本申请的一个实施例中基于韵律特征预测的语音合成装置的结构示意图;
图14为本申请的一个实施例中基于韵律特征预测的语音合成装置的结构示意图;
图15为本申请的一个实施例的运行上述基于韵律特征预测的语音合成方法的计算机设备的结构示意图。
本发明的实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
图1为一个实施例中一种基于韵律特征预测的语音合成方法的应用环境图。参照图1,该基于韵律特征预测的语音合成方法可应用于语音合成系统。该语音合成系统包括终端110和服务器120。终端110和服务器120通过网络连接,终端110具体可以是台式终端或移动终端,移动终端具体可以是手机、平板电脑、笔记本电脑等中的至少一种。服务器120可以用独立的服务器或者是多个服务器组成的服务器集群来实现。其中,终端110用于对需要进行合成的文本进行分析处理,服务器120用于模型的训练与预测。
在另一个实施例中,上述基于韵律特征预测的语音合成方法所应用的语音合成系统还可以是基于终端110实现的。终端用于模型的训练与预测,并将需要进行合成的文本转换成语音。
如图2所示,在一个实施例中,提供了一种基于韵律特征预测的语音合成方法。该方法既可以应用于终端,也可以应用于服务器,本实施例以应用于终端举例说明。该基于韵律特征预测的语音合成方法具体包括如下步骤:
步骤S102:获取待合成文本。
待合成文本为需要进行语音合成的文本信息。例如,在语音聊天机器人、语音读报等场景下,需要转换成语音的文本信息。
示例性的,待合成文本可以是“自从那一刻起,她便不再妄自菲薄。”。
步骤S104:将所述待合成文本输入预设的韵律预测模型,获取所述待合成文本的韵律特征作为第一韵律特征,根据所述第一韵律特征确定目标韵律特征。
对待合成文本进行文本分析,确定人说与待合成文本对应的话时的时长、延续、停顿、停顿时长、能量等进行预测,为语音合成过程中韵律预测所需要达到的效果。在本实施例中,韵律预测模型指基于深度学习或神经网络模型对待合成文本的韵律特征进行预测,以使得预测得到的韵律特征能用于声学编码器来获得较好的语音合成效果。
韵律预设模型为预先训练完成的神经网络模型,模型训练的过程中采用训练文本以及与每一个训练文本对应的标注好的韵律特征结果对韵律预设模型进行训练,以使得韵律预设模型可以对待合成文本的韵律特征进行预测,预测得到得到韵律特征为第一韵律特征。根据第一韵律特征可以确定最终用于语音合成的目标韵律特征,例如,将第一韵律特征直接作为目标韵律特征。
在本实施例中,韵律特征包括韵律词特征(简称PW)、韵律短语特征(简称PPH)、韵律语调短语特征(简称IPH)。
如图3所示,给出了韵律特征包括的韵律词特征、韵律短语特征、韵律语调短语特征对应的韵律层级结构。其中,韵律语调短语特征基于韵律短语特征,韵律短语特征基于韵律词特征。
也就是说,在本实施例中,通过预设的韵律预测模型获取待合成文本的对应的韵律特征的过程中,还包括了与韵律特征对应的韵律层级结构下的韵律特征。
为了对待合成文本的韵律特征进行准确的预测,在本实施例中,输入预设的韵律预测模型的是与待合成文本对应的字向量,基于字粒度的基础上对韵律预测模型进行训练和韵律结构的预测,可以提高韵律预测和语音合成的准确性。
具体实施例中,上述获取待合成文本的步骤之后,还包括:确定与所述待合成文本对应的多个字向量。也就是说,对待合成文本进行处理,将待合成文本划分成多个字向量,然后将与待合成文本对应的多个字向量作为韵律预测模型的输入。在一个具体的实施例中,上述字向量的维度可以为200维的字向量。
在一个具体的实施例汇总,对包含韵律词特征、韵律短语特征、韵律语调短语特征的第一韵律特征的预测过程进行详细说明:
如图4所示,第一韵律特征的计算过程包括如图4所示的步骤S1041-S1044:
步骤S1041:将所述待合成文本输入预设的韵律词预测模型,获取第一韵律词特征;
步骤S1042:将所述待合成文本和/或所述第一韵律词特征和预设的韵律短语预测模型,获取第一韵律短语特征;
步骤S1043:将所述待合成文本、第一韵律词特征和/或所述第一韵律短语特征输入预设的韵律语调短语预测模型,获取第一韵律语调短语特征;
步骤S1044:将所述第一韵律词特征、第一韵律短语特征、第一韵律语调短语特征作为所述第一韵律特征。
如前所述,韵律特征包括韵律词特征、韵律短语特征、韵律语调短语特征,在通过韵律预测模型对韵律特征进行预测的过程中,需要分别通过韵律预测模型中与韵律词特征、韵律短语特征、韵律语调短语特征对应的模块进行韵律词特征、韵律短语特征、韵律语调短语特征的预测。
上述韵律预测模型包括韵律词预测模型、韵律短语预测模型、韵律语调短语预测模型,分别用于对韵律结构组成中的韵律词特征、韵律短语特征以及韵律语调短语特征的预测。
在步骤S102获取到待合成文本之后,首先将待合成文本输入韵律词预测模型,获取输出结果,其中,输出结果为第一韵律词特征。
在对韵律短语特征进行预测的过程中,将待合成文本、以及上述第一韵律词特征输入预设的韵律短语预测模型,获取输出结果,输出结果为第一韵律短语特征。
在对韵律语调短语特征进行预测的过程中,将待合成文本以及上述第一韵律词特征、第一韵律短语特征输入预设的韵律语调短语预测模型,获取输出结果,输出结果为第一韵律语调短语特征。
其中,第一韵律词特征、第一韵律短语特征、第一韵律语调短语特征组成第一韵律特征。
并且,上述输入韵律词预测模型、韵律短语预测模型、韵律语调短语预测模型中的待合成文本,可以是如前所述的对待合成文本进行处理之后获取的与所述待合成文本对应的字向量。
如图5所示,给出了上述步骤S1041-S1044中的第一韵律特征的生成过程的流程示意图。
上述韵律词预测模型、韵律短语预测模型、韵律语调短语预测模型,分别用于对韵律结构组成中的韵律词特征、韵律短语特征以及韵律语调短语特征等韵律层级结构下的韵律特征进行预测,提高了韵律特诊预测的准确性,以此作为后续语音合成过程中的输入,以提高语音合成的准确性。
步骤S106:根据所述目标韵律特征进行语音合成,生成与所述待合成文本对应的目标语音。
在语音合成的步骤中,将韵律特征作为输入,通过预设的声学编码器对待合成文本对应的韵律特征进行语音合成,输出对应的目标语音。
在一个实施例中,可以直接将第一韵律特征作为声学编码器的输入,确定对应的目标语音。在其它实施例中,还可以对第一韵律特征进行进一步的计算处理,确定对应的目标韵律特征,然后将目标韵律特征作为声学编码器的输入,进行目标语音的合成。
在另一个可选的实施例中,为了进一步的提高韵律特征预测的准确性,还可以通过优化算法对韵律特征进行进一步的优化。
具体的,如图6所示,上述基于韵律特征预测的语音合成方法还包括:
步骤S105:通过预设的优化算法对所述第一韵律特征进行处理,获取与所述第一韵律特征对应的第二韵律特征,对所述第一韵律特征和所述第二韵律特征进行拼接处理,获取目标韵律特征。
在本实施例中,在通过预设的韵律预测模型获取与待合成文本对应的第一韵律特征之后,还需要对第一韵律特征进行进一步的处理,以提高韵律预测和后续的语音合成的准确性。
在获取第一韵律特征之后,通过预设的优化算法对第一韵律特征进行优化处理,获取对应的第二韵律特征。其中,通过优化算法对第一韵律特征进行优化处理的过程是对第一韵律特征中包含的各个特征参数进行优化处理的过程。
在对优化算法对第一韵律特征进行优化处理之后,将第一韵律特征和第二韵律特征进行拼接处理,获取拼接完成的韵律特征作为目标韵律特征。具体的,将第二韵律特征拼接到第一韵律特征的后面,获取拼接后的特征特征向量作为目标韵律特征。
在后续的语音合成过程中,将经过优化算法处理和拼接处理后的目标韵律特征作为后续语音合成步骤中的输入,可以获取准确性更好的语音合成结果。
在本实施例中,在语音合成的过程中,通过韵律预测模型获取需要进行语音合成的待合成文本的韵律特征,并且通过优化算法对获取的韵律特征进行优化处理并拼接至韵律预测模型输出的韵律特征的后面,获取拼接完成的目标韵律特征;然后通过预设的声学编码器根据目标韵律特征进行语音合成,从而获取待合成文本对应的语音合成结果(即目标语音)。
在一个具体的实施例中,上述步骤S105中,第二韵律特征的计算过程可以如图7所示:
步骤S1051:通过所述预设的优化算法对所述第一韵律词特征进行处理,获取与所述第一韵律词特征对应的第二韵律词特征;
步骤S1052:通过所述预设的优化算法对所述第一韵律短语特征进行处理,获取与所述第一韵律短语特征对应的第二韵律短语特征;
步骤S1053:通过所述预设的优化算法对所述第一韵律语调短语特征进行处理,获取与所述韵律语调短语特征对应的第二韵律语调短语特征;
步骤S1054:将所述第二韵律词特征、第二韵律短语特征、第二韵律语调短语特征作为所述第二韵律特征。
也就是说,在步骤S102获取到待合成文本之后,首先将待合成文本输入韵律词预测模型,获取输出结果,其中,输出结果为第一韵律词特征。然后,为了对韵律词特征进行优化,还需要通过优化算法对第一韵律词特征进行优化处理,获取对应的第二韵律词特征。最终,将第二韵律词特征拼接到第一韵律词特征的后面,组成新的韵律词特征向量,作为目标韵律词特征。
在对韵律短语特征进行预测的过程中,将待合成文本、以及上述第一韵律词特征输入预设的韵律短语预测模型,获取输出结果,输出结果为第一韵律短语特征。然后,为了对韵律短语特征进行优化,通过优化算法对第一韵律短语特征进行优化处理,获取对应的第二韵律短语特征。最终,将第二韵律短语特征拼接到第一韵律短语特征的后面,组成新的韵律短语特征向量,作为目标韵律短语特征。
在对韵律语调短语特征进行预测的过程中,将待合成文本以及上述第一韵律词特征、第一韵律短语特征输入预设的韵律语调短语预测模型,获取输出结果,输出结果为第一韵律语调短语特征。然后,为了对韵律语调短语特征进行优化,通过优化算法对第一韵律语调短语特征进行优化处理,获取对应的第二韵律语调短语特征。最终,将第二韵律语调短语特征拼接到第一韵律语调短语特征的后面,组成新的韵律语调短语特征向量,作为目标韵律语调短语特征。
其中,第二韵律词特征、第二韵律短语特征、第二韵律语调短语特征组成第二韵律特征;目标韵律词特征、目标韵律短语特征、目标韵律语调短语特征组成目标韵律特征。
并且,上述输入韵律词预测模型、韵律短语预测模型、韵律语调短语预测模型中的待合成文本,可以是如前所述的对待合成文本进行处理之后获取的与所述待合成文本对应的字向量。
在一个具体的实施例中,上述对第一韵律特征进行处理的算法为Viterbi算法。
进一步的,在一个具体的实施例中,如图8所示,上述目标韵律特征的生成还可以是基于步骤S1041-S1044以及步骤S105中的优化算法(以Viterbi算法为例)的综合处理过程。
具体的,目标韵律特征的生成过程还包括:
步骤S211:将待合成文本输入预设的韵律词预测模型,获取第一韵律词特征;
步骤S212:通过Viterbi算法对第一韵律词特征进行处理,获取与第一韵律词特征对应的第二韵律词特征;
步骤S213:对第一韵律词特征与第二韵律词特征进行拼接,获取目标韵律词特征;
步骤S221:将待合成文本和/或目标韵律词特征输入预设的韵律短语预测模型,获取第一韵律短语特征;
步骤S222:通过Viterbi算法对第一韵律短语特征进行处理,获取与第一韵律短语特征对应的第二韵律短语特征;
步骤S223:对第一韵律短语特征与第二韵律短语特征进行拼接,获取目标韵律短语特征;
步骤S231:将待合成文本、目标韵律词特征和/或目标韵律短语特征输入预设的韵律语调短语预测模型,获取第一韵律语调短语特征;
步骤S232:通过Viterbi算法对第一韵律语调短语特征进行处理,获取与韵律语调短语特征对应的第二韵律语调短语特征;
步骤S233:对第一韵律语调短语特征与第二韵律语调短语特征进行拼接,获取目标韵律语调短语特征;
步骤S240:将目标韵律词特征、目标韵律短语特征、目标韵律语调短语特征作为目标韵律特征。
在步骤S102获取到待合成文本之后,首先将待合成文本输入韵律词预测模型,获取输出结果,其中,输出结果为第一韵律词特征。然后,为了对韵律词特征进行优化,还需要通过Viterbi算法对第一韵律词特征进行优化处理,获取对应的第二韵律词特征。最终,将第二韵律词特征拼接到第一韵律词特征的后面,组成新的韵律词特征向量,作为目标韵律词特征。
在对韵律短语特征进行预测的过程中,将待合成文本、以及上述目标韵律词特征输入预设的韵律短语预测模型,获取输出结果,输出结果为第一韵律短语特征。然后,为了对韵律短语特征进行优化,通过Viterbi算法对第一韵律短语特征进行优化处理,获取对应的第二韵律短语特征。最终,将第二韵律短语特征拼接到第一韵律短语特征的后面,组成新的韵律短语特征向量,作为目标韵律短语特征。
在对韵律语调短语特征进行预测的过程中,将待合成文本以及上述目标韵律词特征、目标韵律短语特征输入预设的韵律语调短语预测模型,获取输出结果,输出结果为第一韵律语调短语特征。然后,为了对韵律语调短语特征进行优化,通过Viterbi算法对第一韵律语调短语特征进行优化处理,获取对应的第二韵律语调短语特征。最终,将第二韵律语调短语特征拼接到第一韵律语调短语特征的后面,组成新的韵律语调短语特征向量,作为目标韵律语调短语特征。
其中,第二韵律词特征、第二韵律短语特征、第二韵律语调短语特征组成第二韵律特征;目标韵律词特征、目标韵律短语特征、目标韵律语调短语特征组成目标韵律特征。
并且,上述输入韵律词预测模型、韵律短语预测模型、韵律语调短语预测模型中的待合成文本,可以是如前所述的对待合成文本进行处理之后获取的与所述待合成文本对应的字向量。
如图9所示,给出了上述步骤S211-S240中的目标韵律特征的生成过程的流程示意图。
上述韵律词预测模型、韵律短语预测模型、韵律语调短语预测模型,分别用于对韵律结构组成中的韵律词特征、韵律短语特征以及韵律语调短语特征的预测,并且通过Viterbi算法对预测结果中的韵律词特征、韵律短语特征、韵律语调短语特征进行优化处理,然后拼接至模型输出结果的后面,将拼接成的韵律特征作为目标韵律特征,以此作为后续语音合成过程中的输入,以提高语音合成的准确性。
进一步的,上述韵律预测模型以及韵律词预测模型、韵律短语预测模型、韵律语调短语预测模型可以对待合成文本的韵律特征进行很好的预测,并且,在使用相应的模型进行预测之前,还需要根据训练数据对相应的模型进行训练。
具体的,如图10所示,给出了一种韵律预测模型训练过程的流程示意图。
如图10所述,上述韵律预测模型训练过程包括如图10所示的步骤S302-304:
步骤S302:获取训练数据集,所述训练数据集包括多个训练文本及对应的韵律特征参考值;
步骤S304:将所述训练文本作为输入、所述韵律特征参考值作为输出,对所述韵律预测模型进行训练。
在进行模型训练之前,首先需要对数据进行标识,确定文本对应的韵律特征。例如,对于一条训练文本,需要通过人工标注将训练文本处理成韵律词、韵律短语、韵律语调短语真实值的形式,即确定与该条训练文本对应的韵律特征参考值。
在一个具体的实施例中,韵律特征参考值对应的数据格式可以是:自从#1那#1一刻起#3,她便#1不再#2妄自菲薄#3,将会处理成韵律词(将#1、#2、#3都看作是韵律词标记):01100101010001,韵律短语(#2、#3):00000100010001,语调短语(#3):00000100000001(其中,对应的训练文本为:自从那一刻起,她便不再妄自菲薄。)。
具体实施例中,对大量的训练文本进行人工标注,获取对应的韵律特征参考值,确定训练数据集。也就是说,训练数据集包括了多个训练文本以及与每一个训练文本对应的韵律特征参考值。
针对训练数据集包含的每一条训练文本,将训练文本作为输入,将对应的韵律特征参考值作为输出,对预设的韵律预测模型进行训练,以使韵律训练模型具备韵律特征预测的功能。
进一步的,在本实施例中,对韵律预测模型进行训练的过程,还包括了对韵律词预测模型、韵律短语预测模型、韵律语调短语预测模型分别进行训练的过程。
具体的,前述通过人工对训练文本进行人工标注确定的韵律特征参考值包括韵律词特征参考值、韵律短语特征参考值、韵律语调短语特征参考值。对韵律词预测模型、韵律短语预测模型、韵律语调短语预测模型分别进行训练的过程包括如图11所示的步骤S3041-S3042:
步骤S3041:将训练文本作为输入,韵律词特征参考值作为输出,对韵律词预测模型进行训练;
步骤S3042:将训练文本和/或韵律词特征参考值作为输入,韵律短语特征参考值作为输出,对韵律短语预测模型进行训练;
步骤S3043:将训练文本和韵律短语特征参考值作为输入,韵律语调短语特征参考值作为输出,对韵律语调短语预测模型模型进行训练。
也就是说,在对韵律词预测模型进行训练的过程中,以训练文本作为输入、韵律词特征参考值作为输出,对韵律词预测模型进行训练,以使韵律词预测模型具备对韵律词特征进行预测的能力。
在对韵律短语预测模型进行训练的过程中,以训练文本以及对应的韵律词特征参考值作为输入、韵律短语特征参考值作为输出,对韵律短语预测模型进行训练,以使韵律短语预测模型具备对韵律短语特征进行预测的能力。
在对韵律语调短语预测模型进行训练的过程中,以训练文本、韵律词特征参考值以及韵律短语特征参考值作为输入、韵律语调短语特征参考值作为输出,对韵律语调短语预测模型进行训练,以使韵律语调短语预测模型具备对韵律语调短语特征进行预存的能力。
在上述对韵律预测模型或韵律词预测模型、韵律短语预测模型、韵律语调短语预测模型进行训练的过程,作为模型输入的训练文本,还可以是与该训练文本对应的字向量。也就是说,在对模型进行训练之前,还需要确定与训练文本对应的多个字向量。然后,在对韵律预测模型或韵律词预测模型、韵律短语预测模型、韵律语调短语预测模型进行训练的过程中,将与训练文本对应的多个字向量作为输入、对应的韵律特征参考值作为输出,对韵律预测模型或韵律词预测模型、韵律短语预测模型、韵律语调短语预测模型模型进行训练,以使韵律预测模型或韵律词预测模型、韵律短语预测模型、韵律语调短语预测模型具备对韵律特征进行预测的能力。
在上述模型训练以及模型训练的过程中,韵律预测模型以及韵律词预测模型、韵律短语预测模型、韵律语调短语预测模型为神经网络模型,在一个具体的实施例中,为双向长短期记忆神经网络模型(BiLSTM模型)。BiLSTM模型属于时序数据(有时间依赖性),对数据的处理为全局化处理,可以通过数据中的前后数据等进行数据预测,获取更准确的预测结果。
在本实施例中,通过BiLSTM模型进行韵律特征的预测,可以更有效的获取上下文特征,可以提高韵律特征预测的准确性。
在另一个可选的实施例中,如图12所示,提供了一种基于韵律特征预测的语音合成装置。
如图12所示,上述基于韵律特征预测的语音合成装置包括:
文本获取模块402,用于获取待合成文本;
韵律特征获取模块404,用于获取所述待合成文本的韵律特征作为第一韵律特征,根据所述第一韵律特征确定目标韵律特征,所述待合成文本的韵律特征包括韵律词特征、韵律短语特征、韵律语调短语特征;
语音合成模块406,用于根据所述目标韵律特征进行语音合成,生成与所述待合成文本对应的目标语音。
在一个实施例中,所述韵律特征获取模块404还用于将所述待合成文本输入预设的韵律词预测模型,获取第一韵律词特征;将所述待合成文本和/或所述第一韵律词特征和预设的韵律短语预测模型,获取第一韵律短语特征;将所述待合成文本、第一韵律词特征和/或所述第一韵律短语特征输入预设的韵律语调短语预测模型,获取第一韵律语调短语特征;将所述第一韵律词特征、第一韵律短语特征、第一韵律语调短语特征作为所述第一韵律特征。
在一个实施例中,所述韵律特征获取模块404还用于通过预设的优化算法对所述第一韵律特征进行处理,获取与所述第一韵律特征对应的第二韵律特征;对所述第一韵律特征和所述第二韵律特征进行拼接处理,获取目标韵律特征。
在一个实施例中,所述韵律特征获取模块404还用于通过预设的Viterbi算法对所述第一韵律特征进行处理,获取与所述第一韵律特征对应的第二韵律特征。
在一个实施例中,所述韵律特征获取模块404还用于通过所述预设的优化算法对所述第一韵律词特征进行处理,获取与所述第一韵律词特征对应的第二韵律词特征;通过所述预设的优化算法对所述第一韵律短语特征进行处理,获取与所述第一韵律短语特征对应的第二韵律短语特征;通过所述预设的优化算法对所述第一韵律语调短语特征进行处理,获取与所述韵律语调短语特征对应的第二韵律语调短语特征;将所述第二韵律词特征、第二韵律短语特征、第二韵律语调短语特征作为所述第二韵律特征。
在一个实施例中,所述韵律特征获取模块404还用于通过预设的Viterbi算法,对所述第一韵律特征中包含的特征参数进行优化处理。
在一个实施例中,所述韵律特征获取模块404还用于对所述第一韵律词特征与第二韵律词特征进行拼接,获取目标韵律词特征;对所述第一韵律短语特征与第二韵律短语特征进行拼接,获取目标韵律短语特征;对所述第一韵律语调短语特征与第二韵律语调短语特征进行拼接,获取目标韵律语调短语特征;将所述目标韵律词特征、目标韵律短语特征、目标韵律语调短语特征作为所述目标韵律特征。
在一个实施例中,如图13所示,上述语音合成装置还包括文本处理模块403,用于确定与所述待合成文本对应的多个字向量。
在一个实施例中,所述韵律预测模型为BiLSTM模型。
在一个实施例中,如图14所示,上述基于韵律特征预测的语音合成装置还包括训练样本获取模块412和模型训练模块414,其中,所述训练样本获取模块412用于获取训练数据集,所述训练数据集包括多个训练文本及对应的韵律特征参考值;
所述模型训练模块414用于将所述训练文本作为输入、所述韵律特征参考值作为输出,对所述韵律预测模型进行训练。
在一个实施例中,所述训练样本获取模块412还用于确定与所述训练文本对应的多个字向量;
所述模型训练模块414还用于将所述与所述训练文本对应的多个字向量作为输入、所述韵律特征参考值作为输出,对所述韵律预测模型进行训练。
在一个实施例中,所述韵律特征参考值包括韵律词特征参考值、韵律短语特征参考值、韵律语调短语特征参考值;
所述模型训练模块414还用于将所述训练文本作为输入,所述韵律词特征参考值作为输出,对所述韵律词预测模型进行训练;将所述训练文本和/或所述韵律词特征参考值作为输入,所述韵律短语特征参考值作为输出,对所述韵律短语预测模型进行训练;将所述训练文本和所述韵律短语特征参考值作为输入,所述韵律语调短语特征参考值作为输出,对所述韵律语调短语预测模型模型进行训练。
图15示出了一个实施例中计算机设备的内部结构图。该计算机设备具体可以是终端,也可以是服务器。如图15所示,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,存储器包括非易失性存储介质和内存储器。该计算机设备的非易失性存储介质存储有操作系统,还可存储有计算机程序,该计算机程序被处理器执行时,可使得处理器实现基于韵律特征预测的语音合成方法。该内存储器中也可储存有计算机程序,该计算机程序被处理器执行时,可使得处理器执行基于韵律特征预测的语音合成方法。本领域技术人员可以理解,图15中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,提出了一种智能终端,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行以下步骤:
获取待合成文本;
将所述待合成文本输入预设的韵律预测模型,获取所述待合成文本的韵律特征作为第一韵律特征,根据所述第一韵律特征确定目标韵律特征,所述待合成文本的韵律特征包括韵律词特征、韵律短语特征、韵律语调短语特征;
根据所述目标韵律特征进行语音合成,生成与所述待合成文本对应的目标语音。
在一个实施例中,提出了一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:
获取待合成文本;
将所述待合成文本输入预设的韵律预测模型,获取所述待合成文本的韵律特征作为第一韵律特征,根据所述第一韵律特征确定目标韵律特征,所述待合成文本的韵律特征包括韵律词特征、韵律短语特征、韵律语调短语特征;
根据所述目标韵律特征进行语音合成,生成与所述待合成文本对应的目标语音。
采用了上述基于韵律特征预测的语音合成方法、装置、智能终端及计算机可读存储介质之后,在语音合成的过程中,通过韵律预测模型对待合成文本的韵律特征进行预测,其中,预测的韵律特征包括韵律词特征、韵律短语特征、韵律语调短语特征等韵律层级特征,然后将该韵律特征作为语音合成的基础,然后根据韵律特征确定与待合成文本对应的目标语音,完成语音合成的过程。也就是说,在本实施例中,通过韵律预测模型可以对韵律词特征、韵律短语特征、韵律语调短语特征等韵律层级特征进行准确的预测,提高了韵律特征预测的准确性,从而提高了语音合成的效果,提升了用户体验。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据 库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本发明的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。因此,本发明专利的保护范围应以所附权利要求为准。

Claims (26)

  1. 一种基于韵律特征预测的语音合成方法,其特征在于,包括:
    获取待合成文本;
    将所述待合成文本输入预设的韵律预测模型,获取所述待合成文本的韵律特征作为第一韵律特征,根据所述第一韵律特征确定目标韵律特征,所述待合成文本的韵律特征包括韵律词特征、韵律短语特征、韵律语调短语特征;
    根据所述目标韵律特征进行语音合成,生成与所述待合成文本对应的目标语音。
  2. 根据权利要求1所述的方法,其特征在于,所述将所述待合成文本输入预设的韵律预测模型,获取所述待合成文本的韵律特征作为第一韵律特征的步骤,还包括:
    将所述待合成文本输入预设的韵律词预测模型,获取第一韵律词特征;
    将所述待合成文本和/或所述第一韵律词特征和预设的韵律短语预测模型,获取第一韵律短语特征;
    将所述待合成文本、第一韵律词特征和/或所述第一韵律短语特征输入预设的韵律语调短语预测模型,获取第一韵律语调短语特征;
    将所述第一韵律词特征、第一韵律短语特征、第一韵律语调短语特征作为所述第一韵律特征。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述第一韵律特征确定目标韵律特征的步骤,还包括:
    通过预设的优化算法对所述第一韵律特征进行处理,获取与所述第一韵律特征对应的第二韵律特征;
    对所述第一韵律特征和所述第二韵律特征进行拼接处理,获取目标韵律特征。
  4. 根据权利要求3所述的方法,其特征在于,所述通过预设的优化算法对所述第一韵律特征进行处理,获取与所述第一韵律特征对应的第二韵律特征的步骤,还包括:
    通过预设的Viterbi算法对所述第一韵律特征进行处理,获取与所述第一韵律特征对应的第二韵律特征。
  5. 根据权利要求3所述的方法,其特征在于,所述通过预设的优化算法对所述第一韵律特征进行处理,获取与所述第一韵律特征对应的第二韵律特征的步骤,还包括:
    通过所述预设的优化算法对所述第一韵律词特征进行处理,获取与所述第一韵律词特征对应的第二韵律词特征;
    通过所述预设的优化算法对所述第一韵律短语特征进行处理,获取与所述第一韵律短语特征对应的第二韵律短语特征;
    通过所述预设的优化算法对所述第一韵律语调短语特征进行处理,获取与所述韵律语调短语特征对应的第二韵律语调短语特征;
    将所述第二韵律词特征、第二韵律短语特征、第二韵律语调短语特征作为所述第二韵律特征。
  6. 根据权利要求4所述的方法,其特征在于,所述通过预设的Viterbi算法对所述第一韵律特征进行处理,获取与所述第一韵律特征对应的第二韵律特征的步骤,还包括:
    通过预设的Viterbi算法,对所述第一韵律特征中包含的特征参数进行优化处理。
  7. 根据权利要求5所述的方法,其特征在于,所述对所述第一韵律特征和所述第二韵律特征进行拼接处理,获取目标韵律预测结果的步骤,还包括:
    对所述第一韵律词特征与第二韵律词特征进行拼接,获取目标韵律词特征;
    对所述第一韵律短语特征与第二韵律短语特征进行拼接,获取目标韵律短语特征;
    对所述第一韵律语调短语特征与第二韵律语调短语特征进行拼接,获取目标韵律语调短语特征;
    将所述目标韵律词特征、目标韵律短语特征、目标韵律语调短语特征作为所述目标韵律特征。
  8. 根据权利要求1所述的方法,其特征在于,所述获取待合成文本的步骤之后,还包括:
    确定与所述待合成文本对应的多个字向量。
  9. 根据权利要求1所述的方法,其特征在于,所述韵律预测模型为BiLSTM模型。
  10. 根据权利要求2所述的方法,其特征在于,所述方法还包括:
    获取训练数据集,所述训练数据集包括多个训练文本及对应的韵律特征参考值;
    将所述训练文本作为输入、所述韵律特征参考值作为输出,对所述韵律预测模型进行训练。
  11. 根据权利要求10所述的方法,其特征在于,所述将所述训练文本作为输入、所述韵律特征参考值作为输出,对所述韵律预测模型进行训练的步骤,还包括:
    确定与所述训练文本对应的多个字向量;
    将所述与所述训练文本对应的多个字向量作为输入、所述韵律特征参考值作为输出,对所述韵律预测模型进行训练。
  12. 根据权利要求10所述的方法,其特征在于,所述韵律特征参考值包括韵律词特征参考值、韵律短语特征参考值、韵律语调短语特征参考值;
    所述将所述训练文本作为输入、所述韵律特征参考值作为输出,对所述韵律预测模型进行训练的步骤,还包括:
    将所述训练文本作为输入,所述韵律词特征参考值作为输出,对所述韵律词预测模型进行训练;
    将所述训练文本和/或所述韵律词特征参考值作为输入,所述韵律短语特征参考值作为输出,对所述韵律短语预测模型进行训练;
    将所述训练文本和所述韵律短语特征参考值作为输入,所述韵律语调短语特征参考值作为输出,对所述韵律语调短语预测模型模型进行训练。
  13. 一种基于韵律特征预测的语音合成装置,其特征在于,包括:
    文本获取模块,用于获取待合成文本;
    韵律特征获取模块,用于获取所述待合成文本的韵律特征作为第一韵律特征,根据所述第一韵律特征确定目标韵律特征,所述待合成文本的韵律特征包括韵律词特征、韵律短语特征、韵律语调短语特征;
    语音合成模块,用于根据所述目标韵律特征进行语音合成,生成与所述待合成文本对应的目标语音。
  14. 根据权利要求13所述的装置,其特征在于,所述韵律特征获取模块还用于:
    将所述待合成文本输入预设的韵律词预测模型,获取第一韵律词特征;
    将所述待合成文本和/或所述第一韵律词特征和预设的韵律短语预测模型,获取第一韵律短语特征;
    将所述待合成文本、第一韵律词特征和/或所述第一韵律短语特征输入预设的韵律语调短语预测模型,获取第一韵律语调短语特征;
    将所述第一韵律词特征、第一韵律短语特征、第一韵律语调短语特征作为所述第一韵律特征。
  15. 根据权利要求14所述的装置,其特征在于,所述韵律特征获取模块还用于:
    通过预设的优化算法对所述第一韵律特征进行处理,获取与所述第一韵律特征对应的第二韵律特征;
    对所述第一韵律特征和所述第二韵律特征进行拼接处理,获取目标韵律特征。
  16. 根据权利要求15所述的装置,其特征在于,所述韵律特征获取模块还用于:
    通过预设的Viterbi算法对所述第一韵律特征进行处理,获取与所述第一韵律特征对应的第二韵律特征。
  17. 根据权利要求15所述的装置,其特征在于,所述韵律特征获取模块还用于:
    通过所述预设的优化算法对所述第一韵律词特征进行处理,获取与所述第一韵律词特征对应的第二韵律词特征;
    通过所述预设的优化算法对所述第一韵律短语特征进行处理,获取与所述第一韵律短语特征对应的第二韵律短语特征;
    通过所述预设的优化算法对所述第一韵律语调短语特征进行处理,获取与所述韵律语调短语特征对应的第二韵律语调短语特征;
    将所述第二韵律词特征、第二韵律短语特征、第二韵律语调短语特征作为所述第二韵律特征。
  18. 根据权利要求16所述的装置,其特征在于,所述韵律特征获取模块还用于:
    通过预设的Viterbi算法,对所述第一韵律特征中包含的特征参数进行优化处理。
  19. 根据权利要求17所述的装置,其特征在于,所述韵律特征获取模块还用于:
    对所述第一韵律词特征与第二韵律词特征进行拼接,获取目标韵律词特征;
    对所述第一韵律短语特征与第二韵律短语特征进行拼接,获取目标韵律短语特征;
    对所述第一韵律语调短语特征与第二韵律语调短语特征进行拼接,获取目标韵律语调短语特征;
    将所述目标韵律词特征、目标韵律短语特征、目标韵律语调短语特征作为所述目标韵律特征。
  20. 根据权利要求13所述的装置,其特征在于,所述装置还包括文本处理模块,用于确定与所述待合成文本对应的多个字向量。
  21. 根据权利要求13所述的装置,其特征在于,所述韵律预测模型为BiLSTM模型。
  22. 根据权利要求14所述的装置,其特征在于,所述装置还包括训练样本获取模块和模型训练模块,其中,所述训练样本获取模块用于获取训练数据集,所述训练数据集包括多个训练文本及对应的韵律特征参考值;
    所述模型训练模块用于将所述训练文本作为输入、所述韵律特征参考值作为输出,对所述韵律预测模型进行训练。
  23. 根据权利要求22所述的装置,其特征在于,所述训练样本获取模块还用于确定与所述训练文本对应的多个字向量;
    所述模型训练模块还用于将所述与所述训练文本对应的多个字向量作为输入、所述韵律特征参考值作为输出,对所述韵律预测模型进行训练。
  24. 根据权利要求22所述的装置,其特征在于,所述韵律特征参考值包括韵律词特征参考值、韵律短语特征参考值、韵律语调短语特征参考值;
    所述模型训练模块还用于将所述训练文本作为输入,所述韵律词特征参考值作为输出,对所述韵律词预测模型进行训练;将所述训练文本和/或所述韵律词特征参考值作为输入,所述韵律短语特征参考值作为输出,对所述韵律短语预测模型进行训练;将所述训练文本和所述韵律短语特征参考值作为输入,所述韵律语调短语特征参考值作为输出,对所述韵律语调短语预测模型模型进行训练。
  25. 一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行如权利要求1至12中任一项所述方法的步骤。
  26. 一种智能终端,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行如权利要求1至12中任一项所述方法的步骤。
PCT/CN2019/130741 2019-12-31 2019-12-31 基于韵律特征预测的语音合成方法、装置、终端及介质 WO2021134581A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201980003386.2A CN111226275A (zh) 2019-12-31 2019-12-31 基于韵律特征预测的语音合成方法、装置、终端及介质
PCT/CN2019/130741 WO2021134581A1 (zh) 2019-12-31 2019-12-31 基于韵律特征预测的语音合成方法、装置、终端及介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/130741 WO2021134581A1 (zh) 2019-12-31 2019-12-31 基于韵律特征预测的语音合成方法、装置、终端及介质

Publications (1)

Publication Number Publication Date
WO2021134581A1 true WO2021134581A1 (zh) 2021-07-08

Family

ID=70832798

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/130741 WO2021134581A1 (zh) 2019-12-31 2019-12-31 基于韵律特征预测的语音合成方法、装置、终端及介质

Country Status (2)

Country Link
CN (1) CN111226275A (zh)
WO (1) WO2021134581A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112365877A (zh) * 2020-11-27 2021-02-12 北京百度网讯科技有限公司 语音合成方法、装置、电子设备和存储介质
CN112542167B (zh) * 2020-12-02 2021-10-22 上海卓繁信息技术股份有限公司 一种非接触式语音问答方法和系统
CN115862584A (zh) * 2021-09-24 2023-03-28 华为云计算技术有限公司 一种韵律信息标注方法以及相关设备
WO2023085584A1 (en) * 2021-11-09 2023-05-19 Lg Electronics Inc. Speech synthesis device and speech synthesis method
CN114613351A (zh) * 2022-03-21 2022-06-10 北京有竹居网络技术有限公司 韵律预测方法、装置、可读介质及电子设备
WO2023184874A1 (zh) * 2022-03-31 2023-10-05 美的集团(上海)有限公司 语音合成方法和装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101000764A (zh) * 2006-12-18 2007-07-18 黑龙江大学 基于韵律结构的语音合成文本处理方法
CN101000765A (zh) * 2007-01-09 2007-07-18 黑龙江大学 基于韵律特征的语音合成方法
US20080059190A1 (en) * 2006-08-22 2008-03-06 Microsoft Corporation Speech unit selection using HMM acoustic models
CN104021784A (zh) * 2014-06-19 2014-09-03 百度在线网络技术(北京)有限公司 基于大语料库的语音合成方法和装置
CN109697973A (zh) * 2019-01-22 2019-04-30 清华大学深圳研究生院 一种韵律层级标注的方法、模型训练的方法及装置
CN110223671A (zh) * 2019-06-06 2019-09-10 标贝(深圳)科技有限公司 语言韵律边界预测方法、装置、系统和存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104867491B (zh) * 2015-06-17 2017-08-18 百度在线网络技术(北京)有限公司 用于语音合成的韵律模型训练方法和装置
CN105185373B (zh) * 2015-08-06 2017-04-05 百度在线网络技术(北京)有限公司 韵律层级预测模型的生成及韵律层级预测方法和装置
CN105185374B (zh) * 2015-09-11 2017-03-29 百度在线网络技术(北京)有限公司 韵律层级标注方法和装置
CN106227721B (zh) * 2016-08-08 2019-02-01 中国科学院自动化研究所 汉语韵律层级结构预测系统
CN107451115B (zh) * 2017-07-11 2020-03-06 中国科学院自动化研究所 端到端的汉语韵律层级结构预测模型的构建方法及系统
CN110534089B (zh) * 2019-07-10 2022-04-22 西安交通大学 一种基于音素和韵律结构的中文语音合成方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080059190A1 (en) * 2006-08-22 2008-03-06 Microsoft Corporation Speech unit selection using HMM acoustic models
CN101000764A (zh) * 2006-12-18 2007-07-18 黑龙江大学 基于韵律结构的语音合成文本处理方法
CN101000765A (zh) * 2007-01-09 2007-07-18 黑龙江大学 基于韵律特征的语音合成方法
CN104021784A (zh) * 2014-06-19 2014-09-03 百度在线网络技术(北京)有限公司 基于大语料库的语音合成方法和装置
CN109697973A (zh) * 2019-01-22 2019-04-30 清华大学深圳研究生院 一种韵律层级标注的方法、模型训练的方法及装置
CN110223671A (zh) * 2019-06-06 2019-09-10 标贝(深圳)科技有限公司 语言韵律边界预测方法、装置、系统和存储介质

Also Published As

Publication number Publication date
CN111226275A (zh) 2020-06-02

Similar Documents

Publication Publication Date Title
WO2021134581A1 (zh) 基于韵律特征预测的语音合成方法、装置、终端及介质
US11289069B2 (en) Statistical parameter model establishing method, speech synthesis method, server and storage medium
WO2020215666A1 (zh) 语音合成方法、装置、计算机设备及存储介质
JP7395792B2 (ja) 2レベル音声韻律転写
WO2017197809A1 (zh) 语音合成方法和语音合成装置
JP2020056982A (ja) 音声評価方法、装置、機器及び読み取り可能な記憶媒体
CN111433847B (zh) 语音转换的方法及训练方法、智能装置和存储介质
CN114175143A (zh) 控制端到端语音合成系统中的表达性
JP2024510679A (ja) 教師なし並列タコトロン非自己回帰的で制御可能なテキスト読上げ
US20110144990A1 (en) Rating speech naturalness of speech utterances based on a plurality of human testers
CN111164674B (zh) 语音合成方法、装置、终端及存储介质
WO2021212954A1 (zh) 极低资源下的特定发音人情感语音合成方法及装置
US20210118425A1 (en) System and method using parameterized speech synthesis to train acoustic models
US20140236597A1 (en) System and method for supervised creation of personalized speech samples libraries in real-time for text-to-speech synthesis
KR20200111609A (ko) 음성 합성 장치 및 그 방법
WO2014176489A2 (en) A system and method for supervised creation of personalized speech samples libraries in real-time for text-to-speech synthesis
CN112712789A (zh) 跨语言音频转换方法、装置、计算机设备和存储介质
JP2020013008A (ja) 音声処理装置、音声処理プログラムおよび音声処理方法
Hanzlíček et al. WaveNet-based speech synthesis applied to Czech: a comparison with the traditional synthesis methods
Jaiswal et al. A generative adversarial network based ensemble technique for automatic evaluation of machine synthesized speech
CN113192484A (zh) 基于文本生成音频的方法、设备和存储介质
Matoušek et al. VITS: quality vs. speed analysis
WO2022141126A1 (zh) 个性化语音转换训练方法、计算机设备及存储介质
CN113823259B (zh) 将文本数据转换为音素序列的方法及设备
KR102677459B1 (ko) 2-레벨 스피치 운율 전송

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19958315

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19958315

Country of ref document: EP

Kind code of ref document: A1