CN111226275A

CN111226275A - Voice synthesis method, device, terminal and medium based on rhythm characteristic prediction

Info

Publication number: CN111226275A
Application number: CN201980003386.2A
Authority: CN
Inventors: 李贤�; 黄东延; 丁万; 张皓; 熊友军
Original assignee: Ubtech Robotics Corp
Current assignee: Ubtech Robotics Corp
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-06-02
Also published as: WO2021134581A1

Abstract

The application discloses a speech synthesis method based on prosodic feature prediction, which comprises the following steps: acquiring a text to be synthesized; inputting the text to be synthesized into a preset prosody prediction model, acquiring prosody features of the text to be synthesized as first prosody features, and determining target prosody features according to the first prosody features, wherein the prosody features of the text to be synthesized comprise prosody word features, prosody phrase features and prosody intonation phrase features; and performing voice synthesis according to the target prosody characteristics to generate target voice corresponding to the text to be synthesized. In addition, the application also discloses a voice synthesis device based on prosodic feature prediction, an intelligent terminal and a computer-readable storage medium. By the method and the device, the accuracy of text prosodic feature prediction can be improved, and the voice synthesis effect is improved.

Description

Voice synthesis method, device, terminal and medium based on rhythm characteristic prediction

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for speech synthesis based on prosodic feature prediction, an intelligent terminal, and a computer-readable storage medium.

Background

With the rapid development of mobile internet and artificial intelligence technology, a series of voice synthesis scenes such as voice broadcasting, novel listening, news listening, intelligent interaction and the like are increasing. Speech synthesis may convert text, words, etc. into natural speech output.

During speech synthesis, prosody prediction needs to be performed on text. The prosody influences the naturalness and fluency of pronunciation, and a good prosody prediction result can enable the synthesized voice to be more like a pause mode of human speaking, so that the synthesized voice is more natural.

However, in the conventional prosody prediction scheme, training and prediction of a neural network model are mainly performed based on acoustic features such as phonemes of chinese. However, a certain error exists between the prosody feature prediction result obtained by the above scheme and the real prosody feature, which causes the accuracy of prosody prediction to be insufficient, thereby causing the effect of speech synthesis to be insufficient.

That is, in the above-described scheme of speech synthesis, the effect of synthesized speech is insufficient because the accuracy of prosody prediction is insufficient.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method and an apparatus for speech synthesis based on prosodic feature prediction, an intelligent terminal, and a computer-readable storage medium.

In a first aspect of the present application, a method for speech synthesis based on prosodic feature prediction is provided.

A speech synthesis method based on prosodic feature prediction comprises the following steps:

acquiring a text to be synthesized;

inputting the text to be synthesized into a preset prosody prediction model, acquiring prosody features of the text to be synthesized as first prosody features, and determining target prosody features according to the first prosody features, wherein the prosody features of the text to be synthesized comprise prosody word features, prosody phrase features and prosody intonation phrase features;

and performing voice synthesis according to the target prosody characteristics to generate target voice corresponding to the text to be synthesized.

The step of inputting the text to be synthesized into a preset prosody prediction model and acquiring prosody features of the text to be synthesized as first prosody features further includes:

inputting the text to be synthesized into a preset prosodic word prediction model to obtain a first prosodic word characteristic;

acquiring a first prosodic phrase characteristic from the text to be synthesized and/or the first prosodic word characteristic and a preset prosodic phrase prediction model;

inputting the text to be synthesized, the first prosodic word feature and/or the first prosodic phrase feature into a preset prosodic intonation phrase prediction model to obtain a first prosodic intonation phrase feature;

and taking the first prosodic word feature, the first prosodic phrase feature and the first prosodic intonation phrase feature as the first prosodic feature.

In a second aspect of the present application, a speech synthesis apparatus based on prosodic feature prediction is presented.

A speech synthesis apparatus based on prosodic feature prediction, comprising:

the text acquisition module is used for acquiring a text to be synthesized;

the prosodic feature acquisition module is used for inputting the text to be synthesized into a preset prosodic prediction model, acquiring prosodic features of the text to be synthesized as first prosodic features, and determining target prosodic features according to the first prosodic features, wherein the prosodic features of the text to be synthesized comprise prosodic word features, prosodic phrase features and prosodic intonation phrase features;

and the voice synthesis module is used for carrying out voice synthesis according to the target prosody characteristics to generate target voice corresponding to the text to be synthesized.

In a third aspect of the present application, a smart terminal is provided.

A smart terminal comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of:

acquiring a text to be synthesized;

In a fourth aspect of the present application, a computer-readable storage medium is presented.

A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

acquiring a text to be synthesized;

The embodiment of the application has the following beneficial effects:

after the prosodic feature prediction-based voice synthesis method, the prosodic feature prediction-based voice synthesis device, the intelligent terminal and the computer-readable storage medium are adopted, in the voice synthesis process, the prosodic features of a text to be synthesized are predicted through a prosodic prediction model, wherein the predicted prosodic features comprise prosodic hierarchy features such as prosodic word features, prosodic phrase features and prosodic intonation phrase features, the prosodic features serve as the basis of voice synthesis, then target voice corresponding to the text to be synthesized is determined according to the prosodic features, and the voice synthesis process is completed. That is to say, in this embodiment, prosody hierarchy features such as prosodic word features, prosodic phrase features, prosodic intonation phrase features, and the like can be accurately predicted through the prosody prediction model, so that the accuracy of prosody feature prediction is improved, the voice synthesis effect is improved, and the user experience is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Wherein:

FIG. 1 is a diagram of an application environment of a prosodic feature prediction-based speech synthesis method according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a method for speech synthesis based on prosodic feature prediction according to an embodiment of the present application;

FIG. 3 is a schematic diagram of prosodic features in an embodiment of the present application;

FIG. 4 is a schematic flow chart illustrating the first prosodic feature acquisition according to an embodiment of the present application;

FIG. 5 is a diagram illustrating a first prosodic feature acquisition process according to an embodiment of the present application;

FIG. 6 is a flowchart illustrating a method for speech synthesis based on prosodic feature prediction according to an embodiment of the present application;

FIG. 7 is a schematic flow chart illustrating the second prosodic feature acquisition according to an embodiment of the present application;

FIG. 8 is a schematic diagram illustrating a process of obtaining a target prosodic feature according to an embodiment of the present application;

FIG. 9 is a diagram illustrating a process of obtaining a target prosodic feature according to an embodiment of the present application;

FIG. 10 is a schematic flow chart illustrating prosody prediction model training in one embodiment of the present application;

FIG. 11 is a schematic flow chart illustrating prosody prediction model training in one embodiment of the present application;

FIG. 12 is a block diagram of a speech synthesis apparatus based on prosodic feature prediction according to an embodiment of the present application;

FIG. 13 is a block diagram of a speech synthesis apparatus based on prosodic feature prediction according to an embodiment of the present application;

FIG. 14 is a block diagram of a speech synthesis apparatus based on prosodic feature prediction according to an embodiment of the present application;

fig. 15 is a schematic structural diagram of a computer device for executing the prosodic feature prediction-based speech synthesis method according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

FIG. 1 is a diagram illustrating an exemplary embodiment of a method for speech synthesis based on prosodic feature prediction. Referring to fig. 1, the prosodic feature prediction-based speech synthesis method may be applied to a speech synthesis system. The speech synthesis system includes a terminal 110 and a server 120. The terminal 110 and the server 120 are connected through a network, the terminal 110 may be specifically a desktop terminal or a mobile terminal, and the mobile terminal may be specifically at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers. The terminal 110 is configured to analyze and process a text that needs to be synthesized, and the server 120 is configured to train and predict a model.

In another embodiment, the speech synthesis system applied by the above-mentioned speech synthesis method based on prosodic feature prediction may also be implemented based on the terminal 110. The terminal is used for training and predicting the model and converting the text needing to be synthesized into voice.

In one embodiment, as shown in FIG. 2, a method for speech synthesis based on prosodic feature prediction is provided. The method can be applied to both the terminal and the server, and this embodiment is exemplified by being applied to the terminal. The speech synthesis method based on prosodic feature prediction specifically comprises the following steps:

step S102: and acquiring a text to be synthesized.

The text to be synthesized is text information which needs to be subjected to voice synthesis. For example, in a scenario such as a voice chat robot or a voice newspaper, text information converted into voice is required.

Illustratively, the text to be synthesized may be "she is no longer delphine since that moment. ".

Step S104: inputting the text to be synthesized into a preset prosody prediction model, acquiring prosody features of the text to be synthesized as first prosody features, and determining target prosody features according to the first prosody features.

And performing text analysis on the text to be synthesized, and determining the duration, the continuation, the pause duration, the energy and the like of the speech corresponding to the text to be synthesized, so as to achieve the effect required by prosody prediction in the voice synthesis process. In this embodiment, the prosody prediction model is used to predict prosody features of a text to be synthesized based on a deep learning model or a neural network model, so that the predicted prosody features can be used in an acoustic encoder to obtain a better speech synthesis effect.

The prosody preset model is a neural network model which is trained in advance, training is carried out on the prosody preset model by adopting training texts and labeled prosody feature results corresponding to each training text in the model training process, so that the prosody preset model can predict prosody features of texts to be synthesized, and the obtained prosody features are first prosody features. The target prosodic feature to be finally used for speech synthesis may be determined based on the first prosodic feature, e.g., using the first prosodic feature directly as the target prosodic feature.

In the present embodiment, the prosodic features include prosodic word features (PW), prosodic phrase features (PPH), and prosodic intonation phrase features (IPH).

As shown in fig. 3, a prosodic hierarchy corresponding to prosodic word features, prosodic phrase features, and prosodic intonation phrase features included in the prosodic features is given. The prosodic intonation phrase features are based on the prosodic phrase features, and the prosodic phrase features are based on the prosodic word features.

That is to say, in the present embodiment, in the process of obtaining the prosody feature corresponding to the text to be synthesized through the preset prosody prediction model, the prosody feature under the prosody hierarchy corresponding to the prosody feature is further included.

In order to accurately predict the prosodic features of the text to be synthesized, in this embodiment, the preset prosodic prediction model is input as a word vector corresponding to the text to be synthesized, and the training and prosodic structure prediction are performed on the prosodic prediction model based on the word granularity, so that the accuracy of prosodic prediction and speech synthesis can be improved.

In a specific embodiment, after the step of obtaining the text to be synthesized, the method further includes: determining a plurality of word vectors corresponding to the text to be synthesized. That is, the text to be synthesized is processed, the text to be synthesized is divided into a plurality of word vectors, and then the plurality of word vectors corresponding to the text to be synthesized are used as the input of the prosody prediction model. In a specific embodiment, the dimension of the word vector may be a 200-dimensional word vector.

In a specific embodiment, a detailed description is given of a prediction process of a first prosodic feature including a prosodic word feature, a prosodic phrase feature, and a prosodic intonation phrase feature:

as shown in fig. 4, the calculation process of the first prosodic feature includes steps S1041-S1044 as shown in fig. 4:

step S1041: inputting the text to be synthesized into a preset prosodic word prediction model to obtain a first prosodic word characteristic;

step S1042: acquiring a first prosodic phrase characteristic from the text to be synthesized and/or the first prosodic word characteristic and a preset prosodic phrase prediction model;

step S1043: inputting the text to be synthesized, the first prosodic word feature and/or the first prosodic phrase feature into a preset prosodic intonation phrase prediction model to obtain a first prosodic intonation phrase feature;

step S1044: and taking the first prosodic word feature, the first prosodic phrase feature and the first prosodic intonation phrase feature as the first prosodic feature.

As mentioned above, the prosodic features include prosodic word features, prosodic phrase features, and prosodic intonation phrase features, and in the process of predicting the prosodic features through the prosodic prediction model, the prosodic word features, the prosodic phrase features, and the prosodic intonation phrase features need to be predicted through modules corresponding to the prosodic word features, the prosodic phrase features, and the prosodic intonation phrase features in the prosodic prediction model.

The prosodic prediction model comprises a prosodic word prediction model, a prosodic phrase prediction model and a prosodic intonation phrase prediction model, and is used for predicting prosodic word features, prosodic phrase features and prosodic intonation phrase features in prosodic structure composition respectively.

After the text to be synthesized is obtained in step S102, the text to be synthesized is input into the prosodic word prediction model, and an output result is obtained, where the output result is the first prosodic word feature.

In the process of predicting the prosodic phrase features, inputting a text to be synthesized and the first prosodic word features into a preset prosodic phrase prediction model, and acquiring an output result, wherein the output result is the first prosodic phrase features.

In the process of predicting prosodic intonation phrase characteristics, inputting a text to be synthesized, the first prosodic word characteristics and the first prosodic phrase characteristics into a preset prosodic intonation phrase prediction model, and acquiring an output result, wherein the output result is the first prosodic intonation phrase characteristics.

The first prosodic word feature, the first prosodic phrase feature and the first prosodic intonation phrase feature form a first prosodic feature.

The text to be synthesized in the input prosodic word prediction model, the prosodic phrase prediction model, and the prosodic intonation phrase prediction model may be a word vector corresponding to the text to be synthesized, which is obtained after the text to be synthesized is processed as described above.

As shown in fig. 5, a schematic flow chart of the generation process of the first prosodic feature in the above steps S1041-S1044 is given.

The prosodic word prediction model, the prosodic phrase prediction model and the prosodic intonation phrase prediction model are respectively used for predicting prosodic features under prosodic hierarchical structures such as prosodic word features, prosodic phrase features and prosodic intonation phrase features in prosodic structure composition, so that the accuracy of prosodic specific diagnosis prediction is improved, and the prosodic specific diagnosis prediction is used as input in a subsequent voice synthesis process to improve the accuracy of voice synthesis.

Step S106: and performing voice synthesis according to the target prosody characteristics to generate target voice corresponding to the text to be synthesized.

In the step of voice synthesis, the prosodic features are used as input, voice synthesis is carried out on the prosodic features corresponding to the text to be synthesized through a preset acoustic encoder, and corresponding target voice is output.

In one embodiment, the first prosodic feature may be directly used as an input to the acoustic encoder to determine the corresponding target speech. In other embodiments, the first prosodic feature may be further processed to determine a corresponding target prosodic feature, and then the target prosodic feature may be used as an input to an acoustic encoder to synthesize the target speech.

In another alternative embodiment, in order to further improve the accuracy of the prosodic feature prediction, the prosodic feature may be further optimized through an optimization algorithm.

Specifically, as shown in fig. 6, the method for synthesizing speech based on prosodic feature prediction further includes:

step S105: and processing the first prosodic feature through a preset optimization algorithm, acquiring a second prosodic feature corresponding to the first prosodic feature, and splicing the first prosodic feature and the second prosodic feature to acquire a target prosodic feature.

In this embodiment, after the first prosody feature corresponding to the text to be synthesized is obtained through the preset prosody prediction model, the first prosody feature needs to be further processed to improve the accuracy of prosody prediction and subsequent speech synthesis.

After the first prosodic feature is obtained, the first prosodic feature is optimized through a preset optimization algorithm, and a corresponding second prosodic feature is obtained. The process of optimizing the first prosodic feature by the optimization algorithm is a process of optimizing each feature parameter included in the first prosodic feature.

After the optimization algorithm is used for optimizing the first prosodic feature, the first prosodic feature and the second prosodic feature are spliced, and the spliced prosodic feature is obtained and used as a target prosodic feature. Specifically, the second prosodic feature is spliced behind the first prosodic feature, and the spliced feature vector is obtained and used as the target prosodic feature.

In the subsequent voice synthesis process, the target prosodic features processed by the optimization algorithm and the splicing are used as input in the subsequent voice synthesis step, so that a voice synthesis result with better accuracy can be obtained.

In this embodiment, in the process of voice synthesis, a prosody feature of a text to be synthesized, which needs to be subjected to voice synthesis, is obtained through a prosody prediction model, and the obtained prosody feature is optimized through an optimization algorithm and spliced to the back of the prosody feature output by the prosody prediction model, so as to obtain a spliced target prosody feature; and then, performing voice synthesis through a preset acoustic encoder according to the target prosody characteristics, thereby obtaining a voice synthesis result (namely target voice) corresponding to the text to be synthesized.

In a specific embodiment, in step S105, the second prosodic feature may be calculated as shown in fig. 7:

step S1051: processing the first prosodic word features through the preset optimization algorithm to obtain second prosodic word features corresponding to the first prosodic word features;

step S1052: processing the first prosodic phrase features through the preset optimization algorithm to obtain second prosodic phrase features corresponding to the first prosodic phrase features;

step S1053: processing the first prosodic intonation phrase features through the preset optimization algorithm to obtain second prosodic intonation phrase features corresponding to the prosodic intonation phrase features;

step S1054: and taking the second prosodic word feature, the second prosodic phrase feature and the second prosodic intonation phrase feature as the second prosodic feature.

That is to say, after the text to be synthesized is obtained in step S102, the text to be synthesized is input into the prosodic word prediction model, and an output result is obtained, where the output result is the first prosodic word feature. Then, in order to optimize the prosodic word features, the first prosodic word features are further optimized through an optimization algorithm to obtain corresponding second prosodic word features. And finally, splicing the second prosodic word features behind the first prosodic word features to form a new prosodic word feature vector as the target prosodic word features.

In the process of predicting the prosodic phrase features, inputting a text to be synthesized and the first prosodic word features into a preset prosodic phrase prediction model, and acquiring an output result, wherein the output result is the first prosodic phrase features. Then, in order to optimize the prosodic phrase features, the first prosodic phrase features are optimized through an optimization algorithm to obtain corresponding second prosodic phrase features. And finally, splicing the second prosodic phrase features behind the first prosodic phrase features to form a new prosodic phrase feature vector as the target prosodic phrase features.

In the process of predicting prosodic intonation phrase characteristics, inputting a text to be synthesized, the first prosodic word characteristics and the first prosodic phrase characteristics into a preset prosodic intonation phrase prediction model, and acquiring an output result, wherein the output result is the first prosodic intonation phrase characteristics. Then, in order to optimize the prosodic intonation phrase characteristics, the first prosodic intonation phrase characteristics are optimized through an optimization algorithm to obtain corresponding second prosodic intonation phrase characteristics. And finally, splicing the second prosodic intonation phrase characteristics to the back of the first prosodic intonation phrase characteristics to form a new prosodic intonation phrase characteristic vector serving as the target prosodic intonation phrase characteristics.

The second prosodic word feature, the second prosodic phrase feature and the second prosodic intonation phrase feature form a second prosodic feature; and the target prosodic word characteristics, the target prosodic phrase characteristics and the target prosodic intonation phrase characteristics form target prosodic characteristics.

In a specific embodiment, the algorithm for processing the first prosodic feature is a Viterbi algorithm.

Further, in a specific embodiment, as shown in fig. 8, the generation of the target prosody characteristics may be a comprehensive process based on the optimization algorithms (e.g., Viterbi algorithm) in steps S1041 to S1044 and step S105.

Specifically, the process of generating the target prosodic feature further includes:

step S211: inputting a text to be synthesized into a preset prosodic word prediction model to obtain first prosodic word characteristics;

step S212: processing the first prosodic word features through a Viterbi algorithm to acquire second prosodic word features corresponding to the first prosodic word features;

step S213: splicing the first prosodic word features and the second prosodic word features to obtain target prosodic word features;

step S221: inputting a text to be synthesized and/or target prosodic word characteristics into a preset prosodic phrase prediction model to obtain first prosodic phrase characteristics;

step S222: processing the first prosodic phrase features through a Viterbi algorithm to acquire second prosodic phrase features corresponding to the first prosodic phrase features;

step S223: splicing the first prosodic phrase features and the second prosodic phrase features to obtain target prosodic phrase features;

step S231: inputting a text to be synthesized, target prosodic word characteristics and/or target prosodic phrase characteristics into a preset prosodic intonation phrase prediction model to obtain first prosodic intonation phrase characteristics;

step S232: processing the first prosodic intonation phrase characteristics through a Viterbi algorithm to obtain second prosodic intonation phrase characteristics corresponding to the prosodic intonation phrase characteristics;

step S233: splicing the first prosodic intonation phrase characteristics and the second prosodic intonation phrase characteristics to obtain target prosodic intonation phrase characteristics;

step S240: and taking the target prosodic word characteristics, the target prosodic phrase characteristics and the target prosodic intonation phrase characteristics as target prosodic characteristics.

After the text to be synthesized is obtained in step S102, the text to be synthesized is input into the prosodic word prediction model, and an output result is obtained, where the output result is the first prosodic word feature. Then, in order to optimize the prosodic word features, the first prosodic word features are further optimized through a Viterbi algorithm to obtain corresponding second prosodic word features. And finally, splicing the second prosodic word features behind the first prosodic word features to form a new prosodic word feature vector as the target prosodic word features.

In the process of predicting the prosodic phrase characteristics, inputting a text to be synthesized and the target prosodic word characteristics into a preset prosodic phrase prediction model, and acquiring an output result, wherein the output result is the first prosodic phrase characteristics. Then, in order to optimize the prosodic phrase characteristics, the first prosodic phrase characteristics are optimized through a Viterbi algorithm, and corresponding second prosodic phrase characteristics are obtained. And finally, splicing the second prosodic phrase features behind the first prosodic phrase features to form a new prosodic phrase feature vector as the target prosodic phrase features.

In the process of predicting prosodic intonation phrase characteristics, inputting a text to be synthesized, the target prosodic word characteristics and the target prosodic phrase characteristics into a preset prosodic intonation phrase prediction model, and obtaining an output result, wherein the output result is the first prosodic intonation phrase characteristics. Then, in order to optimize the prosodic intonation phrase characteristics, the first prosodic intonation phrase characteristics are optimized through a Viterbi algorithm to obtain corresponding second prosodic intonation phrase characteristics. And finally, splicing the second prosodic intonation phrase characteristics to the back of the first prosodic intonation phrase characteristics to form a new prosodic intonation phrase characteristic vector serving as the target prosodic intonation phrase characteristics.

As shown in fig. 9, a flow chart of the generation process of the target prosody feature in the above steps S211 to S240 is given.

The prosodic word prediction model, the prosodic phrase prediction model and the prosodic intonation phrase prediction model are respectively used for predicting prosodic word characteristics, prosodic phrase characteristics and prosodic intonation phrase characteristics in prosodic structure composition, optimizing the prosodic word characteristics, prosodic phrase characteristics and prosodic intonation phrase characteristics in a prediction result through a Viterbi algorithm, splicing the prosodic word characteristics, prosodic phrase characteristics and prosodic intonation phrase characteristics to the back of a model output result, and taking the spliced prosodic characteristics as target prosodic characteristics to serve as input in a subsequent voice synthesis process so as to improve accuracy of voice synthesis.

Furthermore, the prosodic prediction model, the prosodic word prediction model, the prosodic phrase prediction model, and the prosodic intonation phrase prediction model can perform good prediction on prosodic features of a text to be synthesized, and before prediction is performed by using the corresponding models, the corresponding models need to be trained according to training data.

Specifically, as shown in fig. 10, a flow chart of a prosody prediction model training process is shown.

As shown in FIG. 10, the prosodic prediction model training process includes steps S302-304 shown in FIG. 10:

step S302: acquiring a training data set, wherein the training data set comprises a plurality of training texts and corresponding prosody feature reference values;

step S304: and taking the training text as input and the prosody feature reference value as output, and training the prosody prediction model.

Before model training, data needs to be identified first, and prosodic features corresponding to texts are determined. For example, for a piece of training text, the training text needs to be processed into the forms of prosodic words, prosodic phrases and real values of prosodic intonation phrases through manual labeling, that is, prosodic feature reference values corresponding to the piece of training text are determined.

In a specific embodiment, the data format corresponding to the prosodic feature reference value may be: since #1 and #1 at the moment of #3, she will no longer have #2 delphinium #3 and will process to prosodic words (considering #1, #2, #3 as prosodic word tokens): 01100101010001 prosodic phrases (#2, # 3): 00000100010001, intonation phrase (# 3): 00000100000001 (where the corresponding training text is: she is no longer parafilm since that moment).

In a specific embodiment, a large number of training texts are labeled manually, corresponding prosodic feature reference values are obtained, and a training data set is determined. That is, the training data set includes a plurality of training texts and prosody feature reference values corresponding to each of the training texts.

And aiming at each training text contained in the training data set, taking the training text as input, taking the corresponding prosody feature reference value as output, and training a preset prosody prediction model so as to enable the prosody training model to have the function of prosody feature prediction.

Further, in this embodiment, the process of training the prosodic prediction model further includes a process of training the prosodic word prediction model, the prosodic phrase prediction model, and the prosodic intonation phrase prediction model, respectively.

Specifically, the prosodic feature reference values determined by manually labeling the training text include prosodic word feature reference values, prosodic phrase feature reference values, and prosodic intonation phrase feature reference values. The process of training the prosodic word prediction model, the prosodic phrase prediction model, and the prosodic intonation phrase prediction model respectively includes steps S3041 to S3042 shown in fig. 11:

step S3041: training a prosodic word prediction model by taking the training text as input and taking a prosodic word feature reference value as output;

step S3042: training a prosodic phrase prediction model by taking the training text and/or prosodic word feature reference value as input and taking the prosodic phrase feature reference value as output;

step S3043: and taking the training text and the prosodic phrase feature reference value as input, taking the prosodic intonation phrase feature reference value as output, and training the prosodic intonation phrase prediction model.

That is, in the process of training the prosodic word prediction model, the training text is used as input, the prosodic word feature reference value is used as output, and the prosodic word prediction model is trained so that the prosodic word prediction model has the capability of predicting prosodic word features.

In the process of training the prosodic phrase prediction model, the training text and the corresponding prosodic word feature reference value are used as input, the prosodic phrase feature reference value is used as output, and the prosodic phrase prediction model is trained so that the prosodic phrase prediction model has the capability of predicting prosodic phrase features.

In the process of training the prosodic intonation phrase prediction model, a training text, prosodic word feature reference values and prosodic phrase feature reference values are used as input, prosodic intonation phrase feature reference values are used as output, and the prosodic intonation phrase prediction model is trained so that the prosodic intonation phrase prediction model has the capability of pre-storing prosodic intonation phrase features.

In the above training process of the prosodic prediction model or the prosodic word prediction model, the prosodic phrase prediction model, and the prosodic intonation phrase prediction model, the training text input by the model may also be a word vector corresponding to the training text. That is, prior to training the model, a number of word vectors corresponding to the training text also need to be determined. Then, in the process of training the prosodic prediction model or the prosodic word prediction model, the prosodic phrase prediction model and the prosodic intonation phrase prediction model, a plurality of word vectors corresponding to the training text are used as input, and corresponding prosodic feature reference values are used as output, and the prosodic prediction model or the prosodic word prediction model, the prosodic phrase prediction model and the prosodic intonation phrase prediction model are trained, so that the prosodic prediction model or the prosodic word prediction model, the prosodic phrase prediction model and the prosodic intonation phrase prediction model have the capability of predicting prosodic features.

In the above model training and model training processes, the prosodic prediction model and the prosodic word prediction model, the prosodic phrase prediction model, and the prosodic intonation phrase prediction model are neural network models, and in a specific embodiment, are bidirectional long-short term memory neural network models (BiLSTM models). The BilSTM model belongs to time sequence data (having time dependency), the data processing is global processing, data prediction can be carried out through front and back data in the data, and a more accurate prediction result can be obtained.

In this embodiment, the prosodic features are predicted by the BilSTM model, so that the context features can be more effectively acquired, and the accuracy of prosodic feature prediction can be improved.

In another alternative embodiment, as shown in fig. 12, a speech synthesis apparatus based on prosodic feature prediction is provided.

As shown in fig. 12, the speech synthesis apparatus based on prosodic feature prediction includes:

a text obtaining module 402, configured to obtain a text to be synthesized;

a prosodic feature obtaining module 404, configured to obtain a prosodic feature of the text to be synthesized as a first prosodic feature, and determine a target prosodic feature according to the first prosodic feature, where the prosodic feature of the text to be synthesized includes a prosodic word feature, a prosodic phrase feature, and a prosodic intonation phrase feature;

and a speech synthesis module 406, configured to perform speech synthesis according to the target prosody feature, and generate a target speech corresponding to the text to be synthesized.

In an embodiment, the prosodic feature obtaining module 404 is further configured to input the text to be synthesized into a preset prosodic word prediction model to obtain a first prosodic word feature; acquiring a first prosodic phrase characteristic from the text to be synthesized and/or the first prosodic word characteristic and a preset prosodic phrase prediction model; inputting the text to be synthesized, the first prosodic word feature and/or the first prosodic phrase feature into a preset prosodic intonation phrase prediction model to obtain a first prosodic intonation phrase feature; and taking the first prosodic word feature, the first prosodic phrase feature and the first prosodic intonation phrase feature as the first prosodic feature.

In an embodiment, the prosodic feature obtaining module 404 is further configured to process the first prosodic feature through a preset optimization algorithm, and obtain a second prosodic feature corresponding to the first prosodic feature; and splicing the first prosodic feature and the second prosodic feature to obtain a target prosodic feature.

In an embodiment, the prosodic feature obtaining module 404 is further configured to process the first prosodic feature through a preset Viterbi algorithm to obtain a second prosodic feature corresponding to the first prosodic feature.

In an embodiment, the prosodic feature obtaining module 404 is further configured to process the first prosodic word feature through the preset optimization algorithm to obtain a second prosodic word feature corresponding to the first prosodic word feature; processing the first prosodic phrase features through the preset optimization algorithm to obtain second prosodic phrase features corresponding to the first prosodic phrase features; processing the first prosodic intonation phrase features through the preset optimization algorithm to obtain second prosodic intonation phrase features corresponding to the prosodic intonation phrase features; and taking the second prosodic word feature, the second prosodic phrase feature and the second prosodic intonation phrase feature as the second prosodic feature.

In an embodiment, the prosodic feature obtaining module 404 is further configured to perform an optimization process on the feature parameters included in the first prosodic feature through a preset Viterbi algorithm.

In one embodiment, the prosodic feature obtaining module 404 is further configured to splice the first prosodic word feature and the second prosodic word feature to obtain a target prosodic word feature; splicing the first prosodic phrase features and the second prosodic phrase features to obtain target prosodic phrase features; splicing the first prosodic intonation phrase characteristics and the second prosodic intonation phrase characteristics to obtain target prosodic intonation phrase characteristics; and taking the target prosodic word features, the target prosodic phrase features and the target prosodic intonation phrase features as the target prosodic features.

In one embodiment, as shown in fig. 13, the speech synthesis apparatus further includes a text processing module 403, configured to determine a plurality of word vectors corresponding to the text to be synthesized.

In one embodiment, the prosodic prediction model is a BilSTM model.

In an embodiment, as shown in fig. 14, the apparatus for synthesizing speech based on prosodic feature prediction further includes a training sample obtaining module 412 and a model training module 414, where the training sample obtaining module 412 is configured to obtain a training data set, and the training data set includes a plurality of training texts and corresponding prosodic feature reference values;

the model training module 414 is configured to train the prosody prediction model by using the training text as input and the prosody feature reference value as output.

In one embodiment, the training sample obtaining module 412 is further configured to determine a plurality of word vectors corresponding to the training text;

the model training module 414 is further configured to train the prosody prediction model by taking the plurality of word vectors corresponding to the training text as input and the prosody feature reference value as output.

In one embodiment, the prosodic feature reference values include prosodic word feature reference values, prosodic phrase feature reference values, prosodic intonation phrase feature reference values;

the model training module 414 is further configured to train the prosodic word prediction model by using the training text as input and the prosodic word feature reference value as output; taking the training text and/or the prosodic word feature reference value as input, taking the prosodic phrase feature reference value as output, and training the prosodic phrase prediction model; and taking the training text and the prosodic phrase feature reference value as input, taking the prosodic intonation phrase feature reference value as output, and training the prosodic intonation phrase prediction model.

FIG. 15 is a diagram showing an internal structure of a computer device in one embodiment. The computer device may specifically be a terminal, and may also be a server. As shown in fig. 15, the computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement a method of speech synthesis based on prosodic feature prediction. The internal memory may also have stored therein a computer program that, when executed by the processor, causes the processor to perform a method of speech synthesis based on prosodic feature prediction. Those skilled in the art will appreciate that the architecture shown in fig. 15 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a smart terminal is proposed, comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of:

acquiring a text to be synthesized;

In one embodiment, a computer-readable storage medium is proposed, in which a computer program is stored which, when executed by a processor, causes the processor to carry out the steps of:

acquiring a text to be synthesized;

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A speech synthesis method based on prosodic feature prediction, comprising:

acquiring a text to be synthesized;

2. The method according to claim 1, wherein the step of inputting the text to be synthesized into a preset prosody prediction model and acquiring prosody features of the text to be synthesized as first prosody features further comprises:

3. The method of claim 2, wherein the step of determining a target prosodic feature based on the first prosodic feature further comprises:

processing the first prosodic feature through a preset optimization algorithm to obtain a second prosodic feature corresponding to the first prosodic feature;

and splicing the first prosodic feature and the second prosodic feature to obtain a target prosodic feature.

4. The method according to claim 3, wherein the step of processing the first prosodic feature by a preset optimization algorithm to obtain a second prosodic feature corresponding to the first prosodic feature further comprises:

and processing the first prosodic feature through a preset Viterbi algorithm to acquire a second prosodic feature corresponding to the first prosodic feature.

5. The method according to claim 3, wherein the step of processing the first prosodic feature by a preset optimization algorithm to obtain a second prosodic feature corresponding to the first prosodic feature further comprises:

processing the first prosodic word features through the preset optimization algorithm to obtain second prosodic word features corresponding to the first prosodic word features;

processing the first prosodic phrase features through the preset optimization algorithm to obtain second prosodic phrase features corresponding to the first prosodic phrase features;

processing the first prosodic intonation phrase features through the preset optimization algorithm to obtain second prosodic intonation phrase features corresponding to the prosodic intonation phrase features;

and taking the second prosodic word feature, the second prosodic phrase feature and the second prosodic intonation phrase feature as the second prosodic feature.

6. The method of claim 4, wherein the step of processing the first prosodic feature by a predetermined Viterbi algorithm to obtain a second prosodic feature corresponding to the first prosodic feature further comprises:

and optimizing the characteristic parameters contained in the first prosodic characteristics through a preset Viterbi algorithm.

7. The method according to claim 5, wherein the step of performing stitching processing on the first prosodic feature and the second prosodic feature to obtain a target prosodic prediction result further comprises:

splicing the first prosodic word features and the second prosodic word features to obtain target prosodic word features;

splicing the first prosodic phrase features and the second prosodic phrase features to obtain target prosodic phrase features;

splicing the first prosodic intonation phrase characteristics and the second prosodic intonation phrase characteristics to obtain target prosodic intonation phrase characteristics;

and taking the target prosodic word features, the target prosodic phrase features and the target prosodic intonation phrase features as the target prosodic features.

8. The method according to claim 1, wherein the step of obtaining the text to be synthesized is followed by:

determining a plurality of word vectors corresponding to the text to be synthesized.

9. The method of claim 1, wherein the prosodic prediction model is a BilSTM model.

10. The method of claim 2, further comprising:

acquiring a training data set, wherein the training data set comprises a plurality of training texts and corresponding prosody feature reference values;

and taking the training text as input and the prosody feature reference value as output, and training the prosody prediction model.

11. The method of claim 10, wherein the step of training the prosodic prediction model using the training text as input and the prosodic feature reference value as output further comprises:

determining a plurality of word vectors corresponding to the training text;

and taking the plurality of word vectors corresponding to the training texts as input and the prosody feature reference value as output, and training the prosody prediction model.

12. The method of claim 10, wherein the prosodic feature reference values comprise prosodic word feature reference values, prosodic phrase feature reference values, prosodic intonation phrase feature reference values;

the step of training the prosody prediction model by using the training text as input and the prosody feature reference value as output further includes:

taking the training text as input and the prosodic word feature reference value as output, and training the prosodic word prediction model;

taking the training text and/or the prosodic word feature reference value as input, taking the prosodic phrase feature reference value as output, and training the prosodic phrase prediction model;

and taking the training text and the prosodic phrase feature reference value as input, taking the prosodic intonation phrase feature reference value as output, and training the prosodic intonation phrase prediction model.

13. A speech synthesis apparatus based on prosodic feature prediction, comprising:

the text acquisition module is used for acquiring a text to be synthesized;

the prosodic feature acquisition module is used for acquiring prosodic features of the text to be synthesized as first prosodic features and determining target prosodic features according to the first prosodic features, wherein the prosodic features of the text to be synthesized comprise prosodic word features, prosodic phrase features and prosodic intonation phrase features;

14. The apparatus of claim 13, wherein the prosodic feature obtaining module is further configured to:

15. The apparatus of claim 14, wherein the prosodic feature obtaining module is further configured to:

16. The apparatus of claim 15, wherein the prosodic feature obtaining module is further configured to:

17. The apparatus of claim 15, wherein the prosodic feature obtaining module is further configured to:

18. The apparatus of claim 16, wherein the prosodic feature obtaining module is further configured to:

19. The apparatus of claim 17, wherein the prosodic feature obtaining module is further configured to:

20. The apparatus of claim 13, further comprising a text processing module configured to determine a plurality of word vectors corresponding to the text to be synthesized.

21. The apparatus of claim 13, wherein the prosodic prediction model is a BilSTM model.

22. The apparatus according to claim 14, further comprising a training sample obtaining module and a model training module, wherein the training sample obtaining module is configured to obtain a training data set, and the training data set includes a plurality of training texts and corresponding prosodic feature reference values;

and the model training module is used for training the prosody prediction model by taking the training text as input and the prosody feature reference value as output.

23. The apparatus of claim 22, wherein the training sample acquisition module is further configured to determine a plurality of word vectors corresponding to the training text;

the model training module is further configured to train the prosody prediction model by using the plurality of word vectors corresponding to the training text as input and the prosody feature reference value as output.

24. The apparatus according to claim 22, wherein the prosodic feature reference values comprise prosodic word feature reference values, prosodic phrase feature reference values, prosodic intonation phrase feature reference values;

the model training module is also used for taking the training text as input and the prosodic word feature reference value as output to train the prosodic word prediction model; taking the training text and/or the prosodic word feature reference value as input, taking the prosodic phrase feature reference value as output, and training the prosodic phrase prediction model; and taking the training text and the prosodic phrase feature reference value as input, taking the prosodic intonation phrase feature reference value as output, and training the prosodic intonation phrase prediction model.

25. A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 12.

26. An intelligent terminal comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method according to any one of claims 1 to 12.