CN111226275A - Voice synthesis method, device, terminal and medium based on rhythm characteristic prediction - Google Patents

Voice synthesis method, device, terminal and medium based on rhythm characteristic prediction Download PDF

Info

Publication number
CN111226275A
CN111226275A CN201980003386.2A CN201980003386A CN111226275A CN 111226275 A CN111226275 A CN 111226275A CN 201980003386 A CN201980003386 A CN 201980003386A CN 111226275 A CN111226275 A CN 111226275A
Authority
CN
China
Prior art keywords
prosodic
feature
features
phrase
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201980003386.2A
Other languages
Chinese (zh)
Inventor
李贤�
黄东延
丁万
张皓
熊友军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ubtech Robotics Corp
Original Assignee
Ubtech Robotics Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ubtech Robotics Corp filed Critical Ubtech Robotics Corp
Publication of CN111226275A publication Critical patent/CN111226275A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L2013/083Special characters, e.g. punctuation marks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a speech synthesis method based on prosodic feature prediction, which comprises the following steps: acquiring a text to be synthesized; inputting the text to be synthesized into a preset prosody prediction model, acquiring prosody features of the text to be synthesized as first prosody features, and determining target prosody features according to the first prosody features, wherein the prosody features of the text to be synthesized comprise prosody word features, prosody phrase features and prosody intonation phrase features; and performing voice synthesis according to the target prosody characteristics to generate target voice corresponding to the text to be synthesized. In addition, the application also discloses a voice synthesis device based on prosodic feature prediction, an intelligent terminal and a computer-readable storage medium. By the method and the device, the accuracy of text prosodic feature prediction can be improved, and the voice synthesis effect is improved.

Description

Voice synthesis method, device, terminal and medium based on rhythm characteristic prediction
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for speech synthesis based on prosodic feature prediction, an intelligent terminal, and a computer-readable storage medium.
Background
With the rapid development of mobile internet and artificial intelligence technology, a series of voice synthesis scenes such as voice broadcasting, novel listening, news listening, intelligent interaction and the like are increasing. Speech synthesis may convert text, words, etc. into natural speech output.
During speech synthesis, prosody prediction needs to be performed on text. The prosody influences the naturalness and fluency of pronunciation, and a good prosody prediction result can enable the synthesized voice to be more like a pause mode of human speaking, so that the synthesized voice is more natural.
However, in the conventional prosody prediction scheme, training and prediction of a neural network model are mainly performed based on acoustic features such as phonemes of chinese. However, a certain error exists between the prosody feature prediction result obtained by the above scheme and the real prosody feature, which causes the accuracy of prosody prediction to be insufficient, thereby causing the effect of speech synthesis to be insufficient.
That is, in the above-described scheme of speech synthesis, the effect of synthesized speech is insufficient because the accuracy of prosody prediction is insufficient.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a method and an apparatus for speech synthesis based on prosodic feature prediction, an intelligent terminal, and a computer-readable storage medium.
In a first aspect of the present application, a method for speech synthesis based on prosodic feature prediction is provided.
A speech synthesis method based on prosodic feature prediction comprises the following steps:
acquiring a text to be synthesized;
inputting the text to be synthesized into a preset prosody prediction model, acquiring prosody features of the text to be synthesized as first prosody features, and determining target prosody features according to the first prosody features, wherein the prosody features of the text to be synthesized comprise prosody word features, prosody phrase features and prosody intonation phrase features;
and performing voice synthesis according to the target prosody characteristics to generate target voice corresponding to the text to be synthesized.
The step of inputting the text to be synthesized into a preset prosody prediction model and acquiring prosody features of the text to be synthesized as first prosody features further includes:
inputting the text to be synthesized into a preset prosodic word prediction model to obtain a first prosodic word characteristic;
acquiring a first prosodic phrase characteristic from the text to be synthesized and/or the first prosodic word characteristic and a preset prosodic phrase prediction model;
inputting the text to be synthesized, the first prosodic word feature and/or the first prosodic phrase feature into a preset prosodic intonation phrase prediction model to obtain a first prosodic intonation phrase feature;
and taking the first prosodic word feature, the first prosodic phrase feature and the first prosodic intonation phrase feature as the first prosodic feature.
In a second aspect of the present application, a speech synthesis apparatus based on prosodic feature prediction is presented.
A speech synthesis apparatus based on prosodic feature prediction, comprising:
the text acquisition module is used for acquiring a text to be synthesized;
the prosodic feature acquisition module is used for inputting the text to be synthesized into a preset prosodic prediction model, acquiring prosodic features of the text to be synthesized as first prosodic features, and determining target prosodic features according to the first prosodic features, wherein the prosodic features of the text to be synthesized comprise prosodic word features, prosodic phrase features and prosodic intonation phrase features;
and the voice synthesis module is used for carrying out voice synthesis according to the target prosody characteristics to generate target voice corresponding to the text to be synthesized.
In a third aspect of the present application, a smart terminal is provided.
A smart terminal comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of:
acquiring a text to be synthesized;
inputting the text to be synthesized into a preset prosody prediction model, acquiring prosody features of the text to be synthesized as first prosody features, and determining target prosody features according to the first prosody features, wherein the prosody features of the text to be synthesized comprise prosody word features, prosody phrase features and prosody intonation phrase features;
and performing voice synthesis according to the target prosody characteristics to generate target voice corresponding to the text to be synthesized.
In a fourth aspect of the present application, a computer-readable storage medium is presented.
A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:
acquiring a text to be synthesized;
inputting the text to be synthesized into a preset prosody prediction model, acquiring prosody features of the text to be synthesized as first prosody features, and determining target prosody features according to the first prosody features, wherein the prosody features of the text to be synthesized comprise prosody word features, prosody phrase features and prosody intonation phrase features;
and performing voice synthesis according to the target prosody characteristics to generate target voice corresponding to the text to be synthesized.
The embodiment of the application has the following beneficial effects:
after the prosodic feature prediction-based voice synthesis method, the prosodic feature prediction-based voice synthesis device, the intelligent terminal and the computer-readable storage medium are adopted, in the voice synthesis process, the prosodic features of a text to be synthesized are predicted through a prosodic prediction model, wherein the predicted prosodic features comprise prosodic hierarchy features such as prosodic word features, prosodic phrase features and prosodic intonation phrase features, the prosodic features serve as the basis of voice synthesis, then target voice corresponding to the text to be synthesized is determined according to the prosodic features, and the voice synthesis process is completed. That is to say, in this embodiment, prosody hierarchy features such as prosodic word features, prosodic phrase features, prosodic intonation phrase features, and the like can be accurately predicted through the prosody prediction model, so that the accuracy of prosody feature prediction is improved, the voice synthesis effect is improved, and the user experience is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Wherein:
FIG. 1 is a diagram of an application environment of a prosodic feature prediction-based speech synthesis method according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating a method for speech synthesis based on prosodic feature prediction according to an embodiment of the present application;
FIG. 3 is a schematic diagram of prosodic features in an embodiment of the present application;
FIG. 4 is a schematic flow chart illustrating the first prosodic feature acquisition according to an embodiment of the present application;
FIG. 5 is a diagram illustrating a first prosodic feature acquisition process according to an embodiment of the present application;
FIG. 6 is a flowchart illustrating a method for speech synthesis based on prosodic feature prediction according to an embodiment of the present application;
FIG. 7 is a schematic flow chart illustrating the second prosodic feature acquisition according to an embodiment of the present application;
FIG. 8 is a schematic diagram illustrating a process of obtaining a target prosodic feature according to an embodiment of the present application;
FIG. 9 is a diagram illustrating a process of obtaining a target prosodic feature according to an embodiment of the present application;
FIG. 10 is a schematic flow chart illustrating prosody prediction model training in one embodiment of the present application;
FIG. 11 is a schematic flow chart illustrating prosody prediction model training in one embodiment of the present application;
FIG. 12 is a block diagram of a speech synthesis apparatus based on prosodic feature prediction according to an embodiment of the present application;
FIG. 13 is a block diagram of a speech synthesis apparatus based on prosodic feature prediction according to an embodiment of the present application;
FIG. 14 is a block diagram of a speech synthesis apparatus based on prosodic feature prediction according to an embodiment of the present application;
fig. 15 is a schematic structural diagram of a computer device for executing the prosodic feature prediction-based speech synthesis method according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
FIG. 1 is a diagram illustrating an exemplary embodiment of a method for speech synthesis based on prosodic feature prediction. Referring to fig. 1, the prosodic feature prediction-based speech synthesis method may be applied to a speech synthesis system. The speech synthesis system includes a terminal 110 and a server 120. The terminal 110 and the server 120 are connected through a network, the terminal 110 may be specifically a desktop terminal or a mobile terminal, and the mobile terminal may be specifically at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers. The terminal 110 is configured to analyze and process a text that needs to be synthesized, and the server 120 is configured to train and predict a model.
In another embodiment, the speech synthesis system applied by the above-mentioned speech synthesis method based on prosodic feature prediction may also be implemented based on the terminal 110. The terminal is used for training and predicting the model and converting the text needing to be synthesized into voice.
In one embodiment, as shown in FIG. 2, a method for speech synthesis based on prosodic feature prediction is provided. The method can be applied to both the terminal and the server, and this embodiment is exemplified by being applied to the terminal. The speech synthesis method based on prosodic feature prediction specifically comprises the following steps:
step S102: and acquiring a text to be synthesized.
The text to be synthesized is text information which needs to be subjected to voice synthesis. For example, in a scenario such as a voice chat robot or a voice newspaper, text information converted into voice is required.
Illustratively, the text to be synthesized may be "she is no longer delphine since that moment. ".
Step S104: inputting the text to be synthesized into a preset prosody prediction model, acquiring prosody features of the text to be synthesized as first prosody features, and determining target prosody features according to the first prosody features.
And performing text analysis on the text to be synthesized, and determining the duration, the continuation, the pause duration, the energy and the like of the speech corresponding to the text to be synthesized, so as to achieve the effect required by prosody prediction in the voice synthesis process. In this embodiment, the prosody prediction model is used to predict prosody features of a text to be synthesized based on a deep learning model or a neural network model, so that the predicted prosody features can be used in an acoustic encoder to obtain a better speech synthesis effect.
The prosody preset model is a neural network model which is trained in advance, training is carried out on the prosody preset model by adopting training texts and labeled prosody feature results corresponding to each training text in the model training process, so that the prosody preset model can predict prosody features of texts to be synthesized, and the obtained prosody features are first prosody features. The target prosodic feature to be finally used for speech synthesis may be determined based on the first prosodic feature, e.g., using the first prosodic feature directly as the target prosodic feature.
In the present embodiment, the prosodic features include prosodic word features (PW), prosodic phrase features (PPH), and prosodic intonation phrase features (IPH).
As shown in fig. 3, a prosodic hierarchy corresponding to prosodic word features, prosodic phrase features, and prosodic intonation phrase features included in the prosodic features is given. The prosodic intonation phrase features are based on the prosodic phrase features, and the prosodic phrase features are based on the prosodic word features.
That is to say, in the present embodiment, in the process of obtaining the prosody feature corresponding to the text to be synthesized through the preset prosody prediction model, the prosody feature under the prosody hierarchy corresponding to the prosody feature is further included.
In order to accurately predict the prosodic features of the text to be synthesized, in this embodiment, the preset prosodic prediction model is input as a word vector corresponding to the text to be synthesized, and the training and prosodic structure prediction are performed on the prosodic prediction model based on the word granularity, so that the accuracy of prosodic prediction and speech synthesis can be improved.
In a specific embodiment, after the step of obtaining the text to be synthesized, the method further includes: determining a plurality of word vectors corresponding to the text to be synthesized. That is, the text to be synthesized is processed, the text to be synthesized is divided into a plurality of word vectors, and then the plurality of word vectors corresponding to the text to be synthesized are used as the input of the prosody prediction model. In a specific embodiment, the dimension of the word vector may be a 200-dimensional word vector.
In a specific embodiment, a detailed description is given of a prediction process of a first prosodic feature including a prosodic word feature, a prosodic phrase feature, and a prosodic intonation phrase feature:
as shown in fig. 4, the calculation process of the first prosodic feature includes steps S1041-S1044 as shown in fig. 4:
step S1041: inputting the text to be synthesized into a preset prosodic word prediction model to obtain a first prosodic word characteristic;
step S1042: acquiring a first prosodic phrase characteristic from the text to be synthesized and/or the first prosodic word characteristic and a preset prosodic phrase prediction model;
step S1043: inputting the text to be synthesized, the first prosodic word feature and/or the first prosodic phrase feature into a preset prosodic intonation phrase prediction model to obtain a first prosodic intonation phrase feature;
step S1044: and taking the first prosodic word feature, the first prosodic phrase feature and the first prosodic intonation phrase feature as the first prosodic feature.
As mentioned above, the prosodic features include prosodic word features, prosodic phrase features, and prosodic intonation phrase features, and in the process of predicting the prosodic features through the prosodic prediction model, the prosodic word features, the prosodic phrase features, and the prosodic intonation phrase features need to be predicted through modules corresponding to the prosodic word features, the prosodic phrase features, and the prosodic intonation phrase features in the prosodic prediction model.
The prosodic prediction model comprises a prosodic word prediction model, a prosodic phrase prediction model and a prosodic intonation phrase prediction model, and is used for predicting prosodic word features, prosodic phrase features and prosodic intonation phrase features in prosodic structure composition respectively.
After the text to be synthesized is obtained in step S102, the text to be synthesized is input into the prosodic word prediction model, and an output result is obtained, where the output result is the first prosodic word feature.
In the process of predicting the prosodic phrase features, inputting a text to be synthesized and the first prosodic word features into a preset prosodic phrase prediction model, and acquiring an output result, wherein the output result is the first prosodic phrase features.
In the process of predicting prosodic intonation phrase characteristics, inputting a text to be synthesized, the first prosodic word characteristics and the first prosodic phrase characteristics into a preset prosodic intonation phrase prediction model, and acquiring an output result, wherein the output result is the first prosodic intonation phrase characteristics.
The first prosodic word feature, the first prosodic phrase feature and the first prosodic intonation phrase feature form a first prosodic feature.
The text to be synthesized in the input prosodic word prediction model, the prosodic phrase prediction model, and the prosodic intonation phrase prediction model may be a word vector corresponding to the text to be synthesized, which is obtained after the text to be synthesized is processed as described above.
As shown in fig. 5, a schematic flow chart of the generation process of the first prosodic feature in the above steps S1041-S1044 is given.
The prosodic word prediction model, the prosodic phrase prediction model and the prosodic intonation phrase prediction model are respectively used for predicting prosodic features under prosodic hierarchical structures such as prosodic word features, prosodic phrase features and prosodic intonation phrase features in prosodic structure composition, so that the accuracy of prosodic specific diagnosis prediction is improved, and the prosodic specific diagnosis prediction is used as input in a subsequent voice synthesis process to improve the accuracy of voice synthesis.
Step S106: and performing voice synthesis according to the target prosody characteristics to generate target voice corresponding to the text to be synthesized.
In the step of voice synthesis, the prosodic features are used as input, voice synthesis is carried out on the prosodic features corresponding to the text to be synthesized through a preset acoustic encoder, and corresponding target voice is output.
In one embodiment, the first prosodic feature may be directly used as an input to the acoustic encoder to determine the corresponding target speech. In other embodiments, the first prosodic feature may be further processed to determine a corresponding target prosodic feature, and then the target prosodic feature may be used as an input to an acoustic encoder to synthesize the target speech.
In another alternative embodiment, in order to further improve the accuracy of the prosodic feature prediction, the prosodic feature may be further optimized through an optimization algorithm.
Specifically, as shown in fig. 6, the method for synthesizing speech based on prosodic feature prediction further includes:
step S105: and processing the first prosodic feature through a preset optimization algorithm, acquiring a second prosodic feature corresponding to the first prosodic feature, and splicing the first prosodic feature and the second prosodic feature to acquire a target prosodic feature.
In this embodiment, after the first prosody feature corresponding to the text to be synthesized is obtained through the preset prosody prediction model, the first prosody feature needs to be further processed to improve the accuracy of prosody prediction and subsequent speech synthesis.
After the first prosodic feature is obtained, the first prosodic feature is optimized through a preset optimization algorithm, and a corresponding second prosodic feature is obtained. The process of optimizing the first prosodic feature by the optimization algorithm is a process of optimizing each feature parameter included in the first prosodic feature.
After the optimization algorithm is used for optimizing the first prosodic feature, the first prosodic feature and the second prosodic feature are spliced, and the spliced prosodic feature is obtained and used as a target prosodic feature. Specifically, the second prosodic feature is spliced behind the first prosodic feature, and the spliced feature vector is obtained and used as the target prosodic feature.
In the subsequent voice synthesis process, the target prosodic features processed by the optimization algorithm and the splicing are used as input in the subsequent voice synthesis step, so that a voice synthesis result with better accuracy can be obtained.
In this embodiment, in the process of voice synthesis, a prosody feature of a text to be synthesized, which needs to be subjected to voice synthesis, is obtained through a prosody prediction model, and the obtained prosody feature is optimized through an optimization algorithm and spliced to the back of the prosody feature output by the prosody prediction model, so as to obtain a spliced target prosody feature; and then, performing voice synthesis through a preset acoustic encoder according to the target prosody characteristics, thereby obtaining a voice synthesis result (namely target voice) corresponding to the text to be synthesized.
In a specific embodiment, in step S105, the second prosodic feature may be calculated as shown in fig. 7:
step S1051: processing the first prosodic word features through the preset optimization algorithm to obtain second prosodic word features corresponding to the first prosodic word features;
step S1052: processing the first prosodic phrase features through the preset optimization algorithm to obtain second prosodic phrase features corresponding to the first prosodic phrase features;
step S1053: processing the first prosodic intonation phrase features through the preset optimization algorithm to obtain second prosodic intonation phrase features corresponding to the prosodic intonation phrase features;
step S1054: and taking the second prosodic word feature, the second prosodic phrase feature and the second prosodic intonation phrase feature as the second prosodic feature.
That is to say, after the text to be synthesized is obtained in step S102, the text to be synthesized is input into the prosodic word prediction model, and an output result is obtained, where the output result is the first prosodic word feature. Then, in order to optimize the prosodic word features, the first prosodic word features are further optimized through an optimization algorithm to obtain corresponding second prosodic word features. And finally, splicing the second prosodic word features behind the first prosodic word features to form a new prosodic word feature vector as the target prosodic word features.
In the process of predicting the prosodic phrase features, inputting a text to be synthesized and the first prosodic word features into a preset prosodic phrase prediction model, and acquiring an output result, wherein the output result is the first prosodic phrase features. Then, in order to optimize the prosodic phrase features, the first prosodic phrase features are optimized through an optimization algorithm to obtain corresponding second prosodic phrase features. And finally, splicing the second prosodic phrase features behind the first prosodic phrase features to form a new prosodic phrase feature vector as the target prosodic phrase features.
In the process of predicting prosodic intonation phrase characteristics, inputting a text to be synthesized, the first prosodic word characteristics and the first prosodic phrase characteristics into a preset prosodic intonation phrase prediction model, and acquiring an output result, wherein the output result is the first prosodic intonation phrase characteristics. Then, in order to optimize the prosodic intonation phrase characteristics, the first prosodic intonation phrase characteristics are optimized through an optimization algorithm to obtain corresponding second prosodic intonation phrase characteristics. And finally, splicing the second prosodic intonation phrase characteristics to the back of the first prosodic intonation phrase characteristics to form a new prosodic intonation phrase characteristic vector serving as the target prosodic intonation phrase characteristics.
The second prosodic word feature, the second prosodic phrase feature and the second prosodic intonation phrase feature form a second prosodic feature; and the target prosodic word characteristics, the target prosodic phrase characteristics and the target prosodic intonation phrase characteristics form target prosodic characteristics.
The text to be synthesized in the input prosodic word prediction model, the prosodic phrase prediction model, and the prosodic intonation phrase prediction model may be a word vector corresponding to the text to be synthesized, which is obtained after the text to be synthesized is processed as described above.
In a specific embodiment, the algorithm for processing the first prosodic feature is a Viterbi algorithm.
Further, in a specific embodiment, as shown in fig. 8, the generation of the target prosody characteristics may be a comprehensive process based on the optimization algorithms (e.g., Viterbi algorithm) in steps S1041 to S1044 and step S105.
Specifically, the process of generating the target prosodic feature further includes:
step S211: inputting a text to be synthesized into a preset prosodic word prediction model to obtain first prosodic word characteristics;
step S212: processing the first prosodic word features through a Viterbi algorithm to acquire second prosodic word features corresponding to the first prosodic word features;
step S213: splicing the first prosodic word features and the second prosodic word features to obtain target prosodic word features;
step S221: inputting a text to be synthesized and/or target prosodic word characteristics into a preset prosodic phrase prediction model to obtain first prosodic phrase characteristics;
step S222: processing the first prosodic phrase features through a Viterbi algorithm to acquire second prosodic phrase features corresponding to the first prosodic phrase features;
step S223: splicing the first prosodic phrase features and the second prosodic phrase features to obtain target prosodic phrase features;
step S231: inputting a text to be synthesized, target prosodic word characteristics and/or target prosodic phrase characteristics into a preset prosodic intonation phrase prediction model to obtain first prosodic intonation phrase characteristics;
step S232: processing the first prosodic intonation phrase characteristics through a Viterbi algorithm to obtain second prosodic intonation phrase characteristics corresponding to the prosodic intonation phrase characteristics;
step S233: splicing the first prosodic intonation phrase characteristics and the second prosodic intonation phrase characteristics to obtain target prosodic intonation phrase characteristics;
step S240: and taking the target prosodic word characteristics, the target prosodic phrase characteristics and the target prosodic intonation phrase characteristics as target prosodic characteristics.
After the text to be synthesized is obtained in step S102, the text to be synthesized is input into the prosodic word prediction model, and an output result is obtained, where the output result is the first prosodic word feature. Then, in order to optimize the prosodic word features, the first prosodic word features are further optimized through a Viterbi algorithm to obtain corresponding second prosodic word features. And finally, splicing the second prosodic word features behind the first prosodic word features to form a new prosodic word feature vector as the target prosodic word features.
In the process of predicting the prosodic phrase characteristics, inputting a text to be synthesized and the target prosodic word characteristics into a preset prosodic phrase prediction model, and acquiring an output result, wherein the output result is the first prosodic phrase characteristics. Then, in order to optimize the prosodic phrase characteristics, the first prosodic phrase characteristics are optimized through a Viterbi algorithm, and corresponding second prosodic phrase characteristics are obtained. And finally, splicing the second prosodic phrase features behind the first prosodic phrase features to form a new prosodic phrase feature vector as the target prosodic phrase features.
In the process of predicting prosodic intonation phrase characteristics, inputting a text to be synthesized, the target prosodic word characteristics and the target prosodic phrase characteristics into a preset prosodic intonation phrase prediction model, and obtaining an output result, wherein the output result is the first prosodic intonation phrase characteristics. Then, in order to optimize the prosodic intonation phrase characteristics, the first prosodic intonation phrase characteristics are optimized through a Viterbi algorithm to obtain corresponding second prosodic intonation phrase characteristics. And finally, splicing the second prosodic intonation phrase characteristics to the back of the first prosodic intonation phrase characteristics to form a new prosodic intonation phrase characteristic vector serving as the target prosodic intonation phrase characteristics.
The second prosodic word feature, the second prosodic phrase feature and the second prosodic intonation phrase feature form a second prosodic feature; and the target prosodic word characteristics, the target prosodic phrase characteristics and the target prosodic intonation phrase characteristics form target prosodic characteristics.
The text to be synthesized in the input prosodic word prediction model, the prosodic phrase prediction model, and the prosodic intonation phrase prediction model may be a word vector corresponding to the text to be synthesized, which is obtained after the text to be synthesized is processed as described above.
As shown in fig. 9, a flow chart of the generation process of the target prosody feature in the above steps S211 to S240 is given.
The prosodic word prediction model, the prosodic phrase prediction model and the prosodic intonation phrase prediction model are respectively used for predicting prosodic word characteristics, prosodic phrase characteristics and prosodic intonation phrase characteristics in prosodic structure composition, optimizing the prosodic word characteristics, prosodic phrase characteristics and prosodic intonation phrase characteristics in a prediction result through a Viterbi algorithm, splicing the prosodic word characteristics, prosodic phrase characteristics and prosodic intonation phrase characteristics to the back of a model output result, and taking the spliced prosodic characteristics as target prosodic characteristics to serve as input in a subsequent voice synthesis process so as to improve accuracy of voice synthesis.
Furthermore, the prosodic prediction model, the prosodic word prediction model, the prosodic phrase prediction model, and the prosodic intonation phrase prediction model can perform good prediction on prosodic features of a text to be synthesized, and before prediction is performed by using the corresponding models, the corresponding models need to be trained according to training data.
Specifically, as shown in fig. 10, a flow chart of a prosody prediction model training process is shown.
As shown in FIG. 10, the prosodic prediction model training process includes steps S302-304 shown in FIG. 10:
step S302: acquiring a training data set, wherein the training data set comprises a plurality of training texts and corresponding prosody feature reference values;
step S304: and taking the training text as input and the prosody feature reference value as output, and training the prosody prediction model.
Before model training, data needs to be identified first, and prosodic features corresponding to texts are determined. For example, for a piece of training text, the training text needs to be processed into the forms of prosodic words, prosodic phrases and real values of prosodic intonation phrases through manual labeling, that is, prosodic feature reference values corresponding to the piece of training text are determined.
In a specific embodiment, the data format corresponding to the prosodic feature reference value may be: since #1 and #1 at the moment of #3, she will no longer have #2 delphinium #3 and will process to prosodic words (considering #1, #2, #3 as prosodic word tokens): 01100101010001 prosodic phrases (#2, # 3): 00000100010001, intonation phrase (# 3): 00000100000001 (where the corresponding training text is: she is no longer parafilm since that moment).
In a specific embodiment, a large number of training texts are labeled manually, corresponding prosodic feature reference values are obtained, and a training data set is determined. That is, the training data set includes a plurality of training texts and prosody feature reference values corresponding to each of the training texts.
And aiming at each training text contained in the training data set, taking the training text as input, taking the corresponding prosody feature reference value as output, and training a preset prosody prediction model so as to enable the prosody training model to have the function of prosody feature prediction.
Further, in this embodiment, the process of training the prosodic prediction model further includes a process of training the prosodic word prediction model, the prosodic phrase prediction model, and the prosodic intonation phrase prediction model, respectively.
Specifically, the prosodic feature reference values determined by manually labeling the training text include prosodic word feature reference values, prosodic phrase feature reference values, and prosodic intonation phrase feature reference values. The process of training the prosodic word prediction model, the prosodic phrase prediction model, and the prosodic intonation phrase prediction model respectively includes steps S3041 to S3042 shown in fig. 11:
step S3041: training a prosodic word prediction model by taking the training text as input and taking a prosodic word feature reference value as output;
step S3042: training a prosodic phrase prediction model by taking the training text and/or prosodic word feature reference value as input and taking the prosodic phrase feature reference value as output;
step S3043: and taking the training text and the prosodic phrase feature reference value as input, taking the prosodic intonation phrase feature reference value as output, and training the prosodic intonation phrase prediction model.
That is, in the process of training the prosodic word prediction model, the training text is used as input, the prosodic word feature reference value is used as output, and the prosodic word prediction model is trained so that the prosodic word prediction model has the capability of predicting prosodic word features.
In the process of training the prosodic phrase prediction model, the training text and the corresponding prosodic word feature reference value are used as input, the prosodic phrase feature reference value is used as output, and the prosodic phrase prediction model is trained so that the prosodic phrase prediction model has the capability of predicting prosodic phrase features.
In the process of training the prosodic intonation phrase prediction model, a training text, prosodic word feature reference values and prosodic phrase feature reference values are used as input, prosodic intonation phrase feature reference values are used as output, and the prosodic intonation phrase prediction model is trained so that the prosodic intonation phrase prediction model has the capability of pre-storing prosodic intonation phrase features.
In the above training process of the prosodic prediction model or the prosodic word prediction model, the prosodic phrase prediction model, and the prosodic intonation phrase prediction model, the training text input by the model may also be a word vector corresponding to the training text. That is, prior to training the model, a number of word vectors corresponding to the training text also need to be determined. Then, in the process of training the prosodic prediction model or the prosodic word prediction model, the prosodic phrase prediction model and the prosodic intonation phrase prediction model, a plurality of word vectors corresponding to the training text are used as input, and corresponding prosodic feature reference values are used as output, and the prosodic prediction model or the prosodic word prediction model, the prosodic phrase prediction model and the prosodic intonation phrase prediction model are trained, so that the prosodic prediction model or the prosodic word prediction model, the prosodic phrase prediction model and the prosodic intonation phrase prediction model have the capability of predicting prosodic features.
In the above model training and model training processes, the prosodic prediction model and the prosodic word prediction model, the prosodic phrase prediction model, and the prosodic intonation phrase prediction model are neural network models, and in a specific embodiment, are bidirectional long-short term memory neural network models (BiLSTM models). The BilSTM model belongs to time sequence data (having time dependency), the data processing is global processing, data prediction can be carried out through front and back data in the data, and a more accurate prediction result can be obtained.
In this embodiment, the prosodic features are predicted by the BilSTM model, so that the context features can be more effectively acquired, and the accuracy of prosodic feature prediction can be improved.
In another alternative embodiment, as shown in fig. 12, a speech synthesis apparatus based on prosodic feature prediction is provided.
As shown in fig. 12, the speech synthesis apparatus based on prosodic feature prediction includes:
a text obtaining module 402, configured to obtain a text to be synthesized;
a prosodic feature obtaining module 404, configured to obtain a prosodic feature of the text to be synthesized as a first prosodic feature, and determine a target prosodic feature according to the first prosodic feature, where the prosodic feature of the text to be synthesized includes a prosodic word feature, a prosodic phrase feature, and a prosodic intonation phrase feature;
and a speech synthesis module 406, configured to perform speech synthesis according to the target prosody feature, and generate a target speech corresponding to the text to be synthesized.
In an embodiment, the prosodic feature obtaining module 404 is further configured to input the text to be synthesized into a preset prosodic word prediction model to obtain a first prosodic word feature; acquiring a first prosodic phrase characteristic from the text to be synthesized and/or the first prosodic word characteristic and a preset prosodic phrase prediction model; inputting the text to be synthesized, the first prosodic word feature and/or the first prosodic phrase feature into a preset prosodic intonation phrase prediction model to obtain a first prosodic intonation phrase feature; and taking the first prosodic word feature, the first prosodic phrase feature and the first prosodic intonation phrase feature as the first prosodic feature.
In an embodiment, the prosodic feature obtaining module 404 is further configured to process the first prosodic feature through a preset optimization algorithm, and obtain a second prosodic feature corresponding to the first prosodic feature; and splicing the first prosodic feature and the second prosodic feature to obtain a target prosodic feature.
In an embodiment, the prosodic feature obtaining module 404 is further configured to process the first prosodic feature through a preset Viterbi algorithm to obtain a second prosodic feature corresponding to the first prosodic feature.
In an embodiment, the prosodic feature obtaining module 404 is further configured to process the first prosodic word feature through the preset optimization algorithm to obtain a second prosodic word feature corresponding to the first prosodic word feature; processing the first prosodic phrase features through the preset optimization algorithm to obtain second prosodic phrase features corresponding to the first prosodic phrase features; processing the first prosodic intonation phrase features through the preset optimization algorithm to obtain second prosodic intonation phrase features corresponding to the prosodic intonation phrase features; and taking the second prosodic word feature, the second prosodic phrase feature and the second prosodic intonation phrase feature as the second prosodic feature.
In an embodiment, the prosodic feature obtaining module 404 is further configured to perform an optimization process on the feature parameters included in the first prosodic feature through a preset Viterbi algorithm.
In one embodiment, the prosodic feature obtaining module 404 is further configured to splice the first prosodic word feature and the second prosodic word feature to obtain a target prosodic word feature; splicing the first prosodic phrase features and the second prosodic phrase features to obtain target prosodic phrase features; splicing the first prosodic intonation phrase characteristics and the second prosodic intonation phrase characteristics to obtain target prosodic intonation phrase characteristics; and taking the target prosodic word features, the target prosodic phrase features and the target prosodic intonation phrase features as the target prosodic features.
In one embodiment, as shown in fig. 13, the speech synthesis apparatus further includes a text processing module 403, configured to determine a plurality of word vectors corresponding to the text to be synthesized.
In one embodiment, the prosodic prediction model is a BilSTM model.
In an embodiment, as shown in fig. 14, the apparatus for synthesizing speech based on prosodic feature prediction further includes a training sample obtaining module 412 and a model training module 414, where the training sample obtaining module 412 is configured to obtain a training data set, and the training data set includes a plurality of training texts and corresponding prosodic feature reference values;
the model training module 414 is configured to train the prosody prediction model by using the training text as input and the prosody feature reference value as output.
In one embodiment, the training sample obtaining module 412 is further configured to determine a plurality of word vectors corresponding to the training text;
the model training module 414 is further configured to train the prosody prediction model by taking the plurality of word vectors corresponding to the training text as input and the prosody feature reference value as output.
In one embodiment, the prosodic feature reference values include prosodic word feature reference values, prosodic phrase feature reference values, prosodic intonation phrase feature reference values;
the model training module 414 is further configured to train the prosodic word prediction model by using the training text as input and the prosodic word feature reference value as output; taking the training text and/or the prosodic word feature reference value as input, taking the prosodic phrase feature reference value as output, and training the prosodic phrase prediction model; and taking the training text and the prosodic phrase feature reference value as input, taking the prosodic intonation phrase feature reference value as output, and training the prosodic intonation phrase prediction model.
FIG. 15 is a diagram showing an internal structure of a computer device in one embodiment. The computer device may specifically be a terminal, and may also be a server. As shown in fig. 15, the computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement a method of speech synthesis based on prosodic feature prediction. The internal memory may also have stored therein a computer program that, when executed by the processor, causes the processor to perform a method of speech synthesis based on prosodic feature prediction. Those skilled in the art will appreciate that the architecture shown in fig. 15 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a smart terminal is proposed, comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of:
acquiring a text to be synthesized;
inputting the text to be synthesized into a preset prosody prediction model, acquiring prosody features of the text to be synthesized as first prosody features, and determining target prosody features according to the first prosody features, wherein the prosody features of the text to be synthesized comprise prosody word features, prosody phrase features and prosody intonation phrase features;
and performing voice synthesis according to the target prosody characteristics to generate target voice corresponding to the text to be synthesized.
In one embodiment, a computer-readable storage medium is proposed, in which a computer program is stored which, when executed by a processor, causes the processor to carry out the steps of:
acquiring a text to be synthesized;
inputting the text to be synthesized into a preset prosody prediction model, acquiring prosody features of the text to be synthesized as first prosody features, and determining target prosody features according to the first prosody features, wherein the prosody features of the text to be synthesized comprise prosody word features, prosody phrase features and prosody intonation phrase features;
and performing voice synthesis according to the target prosody characteristics to generate target voice corresponding to the text to be synthesized.
After the prosodic feature prediction-based voice synthesis method, the prosodic feature prediction-based voice synthesis device, the intelligent terminal and the computer-readable storage medium are adopted, in the voice synthesis process, the prosodic features of a text to be synthesized are predicted through a prosodic prediction model, wherein the predicted prosodic features comprise prosodic hierarchy features such as prosodic word features, prosodic phrase features and prosodic intonation phrase features, the prosodic features serve as the basis of voice synthesis, then target voice corresponding to the text to be synthesized is determined according to the prosodic features, and the voice synthesis process is completed. That is to say, in this embodiment, prosody hierarchy features such as prosodic word features, prosodic phrase features, prosodic intonation phrase features, and the like can be accurately predicted through the prosody prediction model, so that the accuracy of prosody feature prediction is improved, the voice synthesis effect is improved, and the user experience is improved.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (26)

1. A speech synthesis method based on prosodic feature prediction, comprising:
acquiring a text to be synthesized;
inputting the text to be synthesized into a preset prosody prediction model, acquiring prosody features of the text to be synthesized as first prosody features, and determining target prosody features according to the first prosody features, wherein the prosody features of the text to be synthesized comprise prosody word features, prosody phrase features and prosody intonation phrase features;
and performing voice synthesis according to the target prosody characteristics to generate target voice corresponding to the text to be synthesized.
2. The method according to claim 1, wherein the step of inputting the text to be synthesized into a preset prosody prediction model and acquiring prosody features of the text to be synthesized as first prosody features further comprises:
inputting the text to be synthesized into a preset prosodic word prediction model to obtain a first prosodic word characteristic;
acquiring a first prosodic phrase characteristic from the text to be synthesized and/or the first prosodic word characteristic and a preset prosodic phrase prediction model;
inputting the text to be synthesized, the first prosodic word feature and/or the first prosodic phrase feature into a preset prosodic intonation phrase prediction model to obtain a first prosodic intonation phrase feature;
and taking the first prosodic word feature, the first prosodic phrase feature and the first prosodic intonation phrase feature as the first prosodic feature.
3. The method of claim 2, wherein the step of determining a target prosodic feature based on the first prosodic feature further comprises:
processing the first prosodic feature through a preset optimization algorithm to obtain a second prosodic feature corresponding to the first prosodic feature;
and splicing the first prosodic feature and the second prosodic feature to obtain a target prosodic feature.
4. The method according to claim 3, wherein the step of processing the first prosodic feature by a preset optimization algorithm to obtain a second prosodic feature corresponding to the first prosodic feature further comprises:
and processing the first prosodic feature through a preset Viterbi algorithm to acquire a second prosodic feature corresponding to the first prosodic feature.
5. The method according to claim 3, wherein the step of processing the first prosodic feature by a preset optimization algorithm to obtain a second prosodic feature corresponding to the first prosodic feature further comprises:
processing the first prosodic word features through the preset optimization algorithm to obtain second prosodic word features corresponding to the first prosodic word features;
processing the first prosodic phrase features through the preset optimization algorithm to obtain second prosodic phrase features corresponding to the first prosodic phrase features;
processing the first prosodic intonation phrase features through the preset optimization algorithm to obtain second prosodic intonation phrase features corresponding to the prosodic intonation phrase features;
and taking the second prosodic word feature, the second prosodic phrase feature and the second prosodic intonation phrase feature as the second prosodic feature.
6. The method of claim 4, wherein the step of processing the first prosodic feature by a predetermined Viterbi algorithm to obtain a second prosodic feature corresponding to the first prosodic feature further comprises:
and optimizing the characteristic parameters contained in the first prosodic characteristics through a preset Viterbi algorithm.
7. The method according to claim 5, wherein the step of performing stitching processing on the first prosodic feature and the second prosodic feature to obtain a target prosodic prediction result further comprises:
splicing the first prosodic word features and the second prosodic word features to obtain target prosodic word features;
splicing the first prosodic phrase features and the second prosodic phrase features to obtain target prosodic phrase features;
splicing the first prosodic intonation phrase characteristics and the second prosodic intonation phrase characteristics to obtain target prosodic intonation phrase characteristics;
and taking the target prosodic word features, the target prosodic phrase features and the target prosodic intonation phrase features as the target prosodic features.
8. The method according to claim 1, wherein the step of obtaining the text to be synthesized is followed by:
determining a plurality of word vectors corresponding to the text to be synthesized.
9. The method of claim 1, wherein the prosodic prediction model is a BilSTM model.
10. The method of claim 2, further comprising:
acquiring a training data set, wherein the training data set comprises a plurality of training texts and corresponding prosody feature reference values;
and taking the training text as input and the prosody feature reference value as output, and training the prosody prediction model.
11. The method of claim 10, wherein the step of training the prosodic prediction model using the training text as input and the prosodic feature reference value as output further comprises:
determining a plurality of word vectors corresponding to the training text;
and taking the plurality of word vectors corresponding to the training texts as input and the prosody feature reference value as output, and training the prosody prediction model.
12. The method of claim 10, wherein the prosodic feature reference values comprise prosodic word feature reference values, prosodic phrase feature reference values, prosodic intonation phrase feature reference values;
the step of training the prosody prediction model by using the training text as input and the prosody feature reference value as output further includes:
taking the training text as input and the prosodic word feature reference value as output, and training the prosodic word prediction model;
taking the training text and/or the prosodic word feature reference value as input, taking the prosodic phrase feature reference value as output, and training the prosodic phrase prediction model;
and taking the training text and the prosodic phrase feature reference value as input, taking the prosodic intonation phrase feature reference value as output, and training the prosodic intonation phrase prediction model.
13. A speech synthesis apparatus based on prosodic feature prediction, comprising:
the text acquisition module is used for acquiring a text to be synthesized;
the prosodic feature acquisition module is used for acquiring prosodic features of the text to be synthesized as first prosodic features and determining target prosodic features according to the first prosodic features, wherein the prosodic features of the text to be synthesized comprise prosodic word features, prosodic phrase features and prosodic intonation phrase features;
and the voice synthesis module is used for carrying out voice synthesis according to the target prosody characteristics to generate target voice corresponding to the text to be synthesized.
14. The apparatus of claim 13, wherein the prosodic feature obtaining module is further configured to:
inputting the text to be synthesized into a preset prosodic word prediction model to obtain a first prosodic word characteristic;
acquiring a first prosodic phrase characteristic from the text to be synthesized and/or the first prosodic word characteristic and a preset prosodic phrase prediction model;
inputting the text to be synthesized, the first prosodic word feature and/or the first prosodic phrase feature into a preset prosodic intonation phrase prediction model to obtain a first prosodic intonation phrase feature;
and taking the first prosodic word feature, the first prosodic phrase feature and the first prosodic intonation phrase feature as the first prosodic feature.
15. The apparatus of claim 14, wherein the prosodic feature obtaining module is further configured to:
processing the first prosodic feature through a preset optimization algorithm to obtain a second prosodic feature corresponding to the first prosodic feature;
and splicing the first prosodic feature and the second prosodic feature to obtain a target prosodic feature.
16. The apparatus of claim 15, wherein the prosodic feature obtaining module is further configured to:
and processing the first prosodic feature through a preset Viterbi algorithm to acquire a second prosodic feature corresponding to the first prosodic feature.
17. The apparatus of claim 15, wherein the prosodic feature obtaining module is further configured to:
processing the first prosodic word features through the preset optimization algorithm to obtain second prosodic word features corresponding to the first prosodic word features;
processing the first prosodic phrase features through the preset optimization algorithm to obtain second prosodic phrase features corresponding to the first prosodic phrase features;
processing the first prosodic intonation phrase features through the preset optimization algorithm to obtain second prosodic intonation phrase features corresponding to the prosodic intonation phrase features;
and taking the second prosodic word feature, the second prosodic phrase feature and the second prosodic intonation phrase feature as the second prosodic feature.
18. The apparatus of claim 16, wherein the prosodic feature obtaining module is further configured to:
and optimizing the characteristic parameters contained in the first prosodic characteristics through a preset Viterbi algorithm.
19. The apparatus of claim 17, wherein the prosodic feature obtaining module is further configured to:
splicing the first prosodic word features and the second prosodic word features to obtain target prosodic word features;
splicing the first prosodic phrase features and the second prosodic phrase features to obtain target prosodic phrase features;
splicing the first prosodic intonation phrase characteristics and the second prosodic intonation phrase characteristics to obtain target prosodic intonation phrase characteristics;
and taking the target prosodic word features, the target prosodic phrase features and the target prosodic intonation phrase features as the target prosodic features.
20. The apparatus of claim 13, further comprising a text processing module configured to determine a plurality of word vectors corresponding to the text to be synthesized.
21. The apparatus of claim 13, wherein the prosodic prediction model is a BilSTM model.
22. The apparatus according to claim 14, further comprising a training sample obtaining module and a model training module, wherein the training sample obtaining module is configured to obtain a training data set, and the training data set includes a plurality of training texts and corresponding prosodic feature reference values;
and the model training module is used for training the prosody prediction model by taking the training text as input and the prosody feature reference value as output.
23. The apparatus of claim 22, wherein the training sample acquisition module is further configured to determine a plurality of word vectors corresponding to the training text;
the model training module is further configured to train the prosody prediction model by using the plurality of word vectors corresponding to the training text as input and the prosody feature reference value as output.
24. The apparatus according to claim 22, wherein the prosodic feature reference values comprise prosodic word feature reference values, prosodic phrase feature reference values, prosodic intonation phrase feature reference values;
the model training module is also used for taking the training text as input and the prosodic word feature reference value as output to train the prosodic word prediction model; taking the training text and/or the prosodic word feature reference value as input, taking the prosodic phrase feature reference value as output, and training the prosodic phrase prediction model; and taking the training text and the prosodic phrase feature reference value as input, taking the prosodic intonation phrase feature reference value as output, and training the prosodic intonation phrase prediction model.
25. A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 12.
26. An intelligent terminal comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method according to any one of claims 1 to 12.
CN201980003386.2A 2019-12-31 2019-12-31 Voice synthesis method, device, terminal and medium based on rhythm characteristic prediction Pending CN111226275A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/130741 WO2021134581A1 (en) 2019-12-31 2019-12-31 Prosodic feature prediction-based speech synthesis method, apparatus, terminal, and medium

Publications (1)

Publication Number Publication Date
CN111226275A true CN111226275A (en) 2020-06-02

Family

ID=70832798

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201980003386.2A Pending CN111226275A (en) 2019-12-31 2019-12-31 Voice synthesis method, device, terminal and medium based on rhythm characteristic prediction

Country Status (2)

Country Link
CN (1) CN111226275A (en)
WO (1) WO2021134581A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112365877A (en) * 2020-11-27 2021-02-12 北京百度网讯科技有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN112542167A (en) * 2020-12-02 2021-03-23 上海卓繁信息技术股份有限公司 Non-contact new crown consultation method and system
WO2023045433A1 (en) * 2021-09-24 2023-03-30 华为云计算技术有限公司 Prosodic information labeling method and related device
WO2023085584A1 (en) * 2021-11-09 2023-05-19 Lg Electronics Inc. Speech synthesis device and speech synthesis method
WO2023179506A1 (en) * 2022-03-21 2023-09-28 北京有竹居网络技术有限公司 Prosody prediction method and apparatus, and readable medium and electronic device
WO2023184874A1 (en) * 2022-03-31 2023-10-05 美的集团(上海)有限公司 Speech synthesis method and apparatus

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104021784A (en) * 2014-06-19 2014-09-03 百度在线网络技术(北京)有限公司 Voice synthesis method and device based on large corpus
CN104867491A (en) * 2015-06-17 2015-08-26 百度在线网络技术(北京)有限公司 Training method and device for prosody model used for speech synthesis
CN105185373A (en) * 2015-08-06 2015-12-23 百度在线网络技术(北京)有限公司 Rhythm-level prediction model generation method and apparatus, and rhythm-level prediction method and apparatus
CN105185374A (en) * 2015-09-11 2015-12-23 百度在线网络技术(北京)有限公司 Prosodic hierarchy annotation method and device
CN106227721A (en) * 2016-08-08 2016-12-14 中国科学院自动化研究所 Chinese Prosodic Hierarchy prognoses system
CN107451115A (en) * 2017-07-11 2017-12-08 中国科学院自动化研究所 The construction method and system of Chinese Prosodic Hierarchy forecast model end to end
CN110223671A (en) * 2019-06-06 2019-09-10 标贝(深圳)科技有限公司 Language rhythm Boundary Prediction method, apparatus, system and storage medium
CN110534089A (en) * 2019-07-10 2019-12-03 西安交通大学 A kind of Chinese speech synthesis method based on phoneme and rhythm structure

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080059190A1 (en) * 2006-08-22 2008-03-06 Microsoft Corporation Speech unit selection using HMM acoustic models
CN101000764B (en) * 2006-12-18 2011-05-18 黑龙江大学 Speech synthetic text processing method based on rhythm structure
CN101000765B (en) * 2007-01-09 2011-03-30 黑龙江大学 Speech synthetic method based on rhythm character
CN109697973A (en) * 2019-01-22 2019-04-30 清华大学深圳研究生院 A kind of method, the method and device of model training of prosody hierarchy mark

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104021784A (en) * 2014-06-19 2014-09-03 百度在线网络技术(北京)有限公司 Voice synthesis method and device based on large corpus
CN104867491A (en) * 2015-06-17 2015-08-26 百度在线网络技术(北京)有限公司 Training method and device for prosody model used for speech synthesis
CN105185373A (en) * 2015-08-06 2015-12-23 百度在线网络技术(北京)有限公司 Rhythm-level prediction model generation method and apparatus, and rhythm-level prediction method and apparatus
CN105185374A (en) * 2015-09-11 2015-12-23 百度在线网络技术(北京)有限公司 Prosodic hierarchy annotation method and device
CN106227721A (en) * 2016-08-08 2016-12-14 中国科学院自动化研究所 Chinese Prosodic Hierarchy prognoses system
CN107451115A (en) * 2017-07-11 2017-12-08 中国科学院自动化研究所 The construction method and system of Chinese Prosodic Hierarchy forecast model end to end
CN110223671A (en) * 2019-06-06 2019-09-10 标贝(深圳)科技有限公司 Language rhythm Boundary Prediction method, apparatus, system and storage medium
CN110534089A (en) * 2019-07-10 2019-12-03 西安交通大学 A kind of Chinese speech synthesis method based on phoneme and rhythm structure

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112365877A (en) * 2020-11-27 2021-02-12 北京百度网讯科技有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN112542167A (en) * 2020-12-02 2021-03-23 上海卓繁信息技术股份有限公司 Non-contact new crown consultation method and system
WO2023045433A1 (en) * 2021-09-24 2023-03-30 华为云计算技术有限公司 Prosodic information labeling method and related device
WO2023085584A1 (en) * 2021-11-09 2023-05-19 Lg Electronics Inc. Speech synthesis device and speech synthesis method
WO2023179506A1 (en) * 2022-03-21 2023-09-28 北京有竹居网络技术有限公司 Prosody prediction method and apparatus, and readable medium and electronic device
WO2023184874A1 (en) * 2022-03-31 2023-10-05 美的集团(上海)有限公司 Speech synthesis method and apparatus

Also Published As

Publication number Publication date
WO2021134581A1 (en) 2021-07-08

Similar Documents

Publication Publication Date Title
CN111226275A (en) Voice synthesis method, device, terminal and medium based on rhythm characteristic prediction
US11289069B2 (en) Statistical parameter model establishing method, speech synthesis method, server and storage medium
US20190221201A1 (en) Speech conversion method, computer device, and storage medium
CN109271646A (en) Text interpretation method, device, readable storage medium storing program for executing and computer equipment
CN111164674B (en) Speech synthesis method, device, terminal and storage medium
US8706493B2 (en) Controllable prosody re-estimation system and method and computer program product thereof
CN111133507B (en) Speech synthesis method, device, intelligent terminal and readable medium
CN110135441B (en) Text description method and device for image
CN113178188B (en) Speech synthesis method, device, equipment and storage medium
US20230169953A1 (en) Phrase-based end-to-end text-to-speech (tts) synthesis
CN112735454A (en) Audio processing method and device, electronic equipment and readable storage medium
CN113707125A (en) Training method and device for multi-language voice synthesis model
CN111223476A (en) Method and device for extracting voice feature vector, computer equipment and storage medium
CN112634866A (en) Speech synthesis model training and speech synthesis method, apparatus, device and medium
CN112820268A (en) Personalized voice conversion training method and device, computer equipment and storage medium
US20240161727A1 (en) Training method for speech synthesis model and speech synthesis method and related apparatuses
CN111370001B (en) Pronunciation correction method, intelligent terminal and storage medium
CN112712789B (en) Cross-language audio conversion method, device, computer equipment and storage medium
CN113362804A (en) Method, device, terminal and storage medium for synthesizing voice
CN113327575A (en) Speech synthesis method, device, computer equipment and storage medium
Zangar et al. Duration modelling and evaluation for Arabic statistical parametric speech synthesis
US20220208180A1 (en) Speech analyser and related method
CN112463921B (en) Prosody hierarchy dividing method, prosody hierarchy dividing device, computer device and storage medium
JP2020013008A (en) Voice processing device, voice processing program, and voice processing method
CN112767912A (en) Cross-language voice conversion method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination