WO2021134591A1 - Speech synthesis method, speech synthesis apparatus, smart terminal and storage medium - Google Patents

Speech synthesis method, speech synthesis apparatus, smart terminal and storage medium Download PDF

Info

Publication number
WO2021134591A1
WO2021134591A1 PCT/CN2019/130766 CN2019130766W WO2021134591A1 WO 2021134591 A1 WO2021134591 A1 WO 2021134591A1 CN 2019130766 W CN2019130766 W CN 2019130766W WO 2021134591 A1 WO2021134591 A1 WO 2021134591A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
feature
synthesized
features
word segmentation
Prior art date
Application number
PCT/CN2019/130766
Other languages
French (fr)
Chinese (zh)
Inventor
李贤�
黄东延
丁万
张皓
白洛玉
熊友军
Original Assignee
深圳市优必选科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市优必选科技股份有限公司 filed Critical 深圳市优必选科技股份有限公司
Priority to CN201980003388.1A priority Critical patent/CN111164674B/en
Priority to PCT/CN2019/130766 priority patent/WO2021134591A1/en
Publication of WO2021134591A1 publication Critical patent/WO2021134591A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L2013/083Special characters, e.g. punctuation marks

Definitions

  • the present invention relates to the field of artificial intelligence technology, in particular to a speech synthesis method, device, intelligent terminal and computer readable storage medium.
  • Speech synthesis can convert text, etc. into natural speech output.
  • speech synthesis mostly uses the statistical parameter synthesis method to model the induced spectral characteristic parameters and generate a parameter synthesizer to construct the mapping relationship between the text sequence and the speech, and then the statistical model is used to generate the speech parameters at all times. (Including fundamental frequency, formant frequency, etc.), and then convert these parameters into relevant features corresponding to the voice, and finally generate the output voice.
  • the calculation results of a single sub-module corresponding to each step are not necessarily all the best results, which leads to the inability to accurately convert the text into a speech suitable for multi-language and multi-tone scenes, which affects the overall
  • the quality of the speech synthesis greatly affects the user experience.
  • the quality of the final synthesized speech is insufficient due to the problem of the calculation result of a single sub-module being non-optimal.
  • a method of speech synthesis including:
  • a text feature of the text to be synthesized where the text feature includes at least one of a word segmentation feature, a polyphonic character feature, and/or a prosodic feature;
  • the voice feature is converted into voice, and a target voice corresponding to the text to be synthesized is generated.
  • the method before the step of obtaining the text characteristics of the text to be synthesized, the method further includes: performing regularization processing on the text to be synthesized.
  • the step of obtaining the text features of the text to be synthesized further includes: inputting the text to be synthesized into a preset word segmentation model to obtain the word segmentation characteristics corresponding to the text to be synthesized; Input the pre-synthesized text and/or the word segmentation feature into a preset polyphonic character prediction model to obtain the polyphonic character features corresponding to the text to be synthesized; input the to-be-synthesized text and/or the word segmentation feature into the preset prosody prediction Model to obtain the prosodic features corresponding to the text to be synthesized.
  • the method further includes: obtaining a training sample set, the training sample set containing a plurality of training texts and corresponding text reference features, duration reference features, and/or voice reference features; and corresponding to the training text
  • the text reference feature of is used as the input of the duration prediction model
  • the duration reference feature is used as the output of the duration prediction model
  • the duration prediction model is trained
  • the text reference feature and the duration reference feature are used as the acoustic
  • the input of the model, the voice reference feature as the output of the acoustic model, and the training of the acoustic model.
  • the training sample set further includes word segmentation reference features, polyphone reference features, and/or prosodic reference features corresponding to the multiple training texts; the method includes: using the training text as the The word segmentation model is input, and the word segmentation reference feature is used as the output of the word segmentation model, and the word segmentation model is trained.
  • the training text and/or the word segmentation reference feature is used as the input of the polyphonic character prediction model, and the polyphonic character reference feature is used as the output of the polyphonic character prediction model, and the polyphonic character prediction model is trained.
  • the training text and/or the word segmentation reference feature is used as the input of the prosody prediction model, and the prosody reference feature is used as the output of the prosody prediction model, and the prosody prediction model is trained.
  • the method further includes: obtaining a plurality of texts to be synthesized through a text iterator, and for each text to be synthesized, respectively performing the step of obtaining the text characteristics of the text to be synthesized;
  • the text features corresponding to the synthesized text are added to a preset feature queue, and when the feature queue meets a preset condition, a preset number of text features in the feature queue are obtained, and the preset number of text features are separated
  • Input the duration prediction model so that the step of acquiring the duration feature corresponding to the text feature is executed synchronously.
  • a speech synthesis device is provided.
  • a speech synthesis device includes:
  • the obtaining module is used to obtain the text to be synthesized
  • a text feature determination module configured to obtain text features of the text to be synthesized, where the text features include at least one of word segmentation features, polyphone features, and/or prosodic features;
  • a duration feature determining module configured to input the text feature into a preset duration prediction model, and obtain a duration feature corresponding to the text feature;
  • a voice feature determination module configured to input the text feature and the duration feature into a preset acoustic model, and obtain a voice feature corresponding to the text to be synthesized
  • the conversion module is used to convert the voice feature into a voice, and generate a target voice corresponding to the text to be synthesized.
  • the text feature determination module further includes: a preprocessing unit, configured to perform regularization processing on the text to be synthesized.
  • the text feature determination module further includes: a word segmentation feature determination unit, configured to input the text to be synthesized into a preset word segmentation model to obtain the word segmentation feature corresponding to the text to be synthesized; polyphonic character feature The determining unit is configured to input the to-be-synthesized text and/or the word segmentation feature into a preset polyphonic character prediction model to obtain the poly-phonic character feature corresponding to the to-be-synthesized text; the prosodic feature determining unit is to input the to-be-synthesized text The synthesized text and/or the word segmentation feature is input into a preset prosody prediction model, and the prosody feature corresponding to the text to be synthesized is obtained.
  • a word segmentation feature determination unit configured to input the text to be synthesized into a preset word segmentation model to obtain the word segmentation feature corresponding to the text to be synthesized
  • polyphonic character feature The determining unit is configured to input the to-be-sy
  • the device further includes: an acquiring training module, configured to acquire a training sample set, the training sample set including a plurality of training texts and corresponding text reference features, duration reference features, and/or voice reference features;
  • the duration training module is configured to use the text reference feature corresponding to the training text as the input of the duration prediction model, and the duration reference feature as the output of the duration prediction model to train the duration prediction model;
  • a voice training module The text reference feature and the duration reference feature are used as the input of the acoustic model, and the speech reference feature is used as the output of the acoustic model to train the acoustic model.
  • the training sample set further includes word segmentation reference features, polyphonic character reference features, and/or prosodic reference features corresponding to the plurality of training texts
  • the device includes: a word segmentation training module for combining all The training text is used as the input of the word segmentation model, and the word segmentation reference feature is used as the output of the word segmentation model, and the word segmentation model is trained.
  • a polyphonic character training module configured to use the training text and/or the word segmentation reference feature as the input of the polyphonic character prediction model, and the polyphonic character reference feature as the output of the polyphonic character prediction model to predict the polyphonic character
  • the model is trained.
  • the prosody training module is configured to use the training text and/or the word segmentation reference feature as the input of the prosody prediction model, and the prosody reference feature as the output of the prosody prediction model to train the prosody prediction model.
  • the device further includes: a text obtaining module, configured to obtain a plurality of texts to be synthesized through a text iterator, and for each text to be synthesized, the method of obtaining the text characteristics of the text to be synthesized is performed respectively.
  • a text prediction module for adding a plurality of text features corresponding to the text to be synthesized to a preset feature queue, and when the feature queue meets a preset condition, obtain a preset number of text features in the feature queue , And input the preset number of text features into the duration prediction model respectively, so that the step of obtaining duration features corresponding to the text features is performed synchronously.
  • an intelligent terminal is proposed.
  • An intelligent terminal includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps:
  • a text feature of the text to be synthesized where the text feature includes at least one of a word segmentation feature, a polyphonic character feature, and/or a prosodic feature;
  • the voice feature is converted into voice, and a target voice corresponding to the text to be synthesized is generated.
  • a computer-readable storage medium is provided.
  • a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps:
  • a text feature of the text to be synthesized where the text feature includes at least one of a word segmentation feature, a polyphonic character feature, and/or a prosodic feature;
  • the voice feature is converted into voice, and a target voice corresponding to the text to be synthesized is generated.
  • the speech synthesis method, device, terminal and storage medium of the present invention After adopting the speech synthesis method, device, terminal and storage medium of the present invention, in the process of speech synthesis, first obtain the text characteristics of the text to be synthesized, and the text characteristics include word segmentation characteristics, polyphonic character characteristics and/ Or text features such as prosodic features; then input the text features into the preset duration prediction model to obtain the corresponding duration features; input the text features and duration features into the preset acoustic model to obtain the corresponding voice features; finally convert the voice features into speech, Generate the target speech corresponding to the text to be synthesized.
  • the text characteristics include word segmentation characteristics, polyphonic character characteristics and/ Or text features such as prosodic features
  • the considered text features include polyphonic character features and prosodic features, and combined with the time-length features predicted by the model, the final speech features needed in the process of synthesizing speech are obtained. That is to say, the speech synthesis method, device, terminal and storage medium provided by the present invention take into account the speech characteristics generated by multiple text characteristics and duration characteristics, so that the synthesized speech is more accurate, the accuracy of speech synthesis is improved, and the user is improved. Experience.
  • Figure 1 is an application environment diagram of a speech synthesis method in an embodiment of this application
  • FIG. 2 is a schematic flowchart of a speech synthesis method in an embodiment of this application
  • FIG. 3 is a schematic flowchart of a process of obtaining text features of a text to be synthesized in an embodiment of this application;
  • FIG. 4 is a schematic flowchart of a method for training a duration prediction model and an acoustic model in an embodiment of this application;
  • FIG. 5 is a schematic flowchart of a word segmentation model, a polyphonic character prediction model, and/or a prosody prediction model in an embodiment of this application;
  • Fig. 6 is a flowchart of a speech synthesis method in an embodiment of the application.
  • Fig. 7 is a structural block diagram of a speech synthesis device in an embodiment of the application.
  • FIG. 8 is a structural block diagram of a text feature determination module in an embodiment of this application.
  • Fig. 9 is a structural block diagram of a speech synthesis device in an embodiment of the application.
  • FIG. 10 is a structural block diagram of a speech synthesis device in an embodiment of this application.
  • FIG. 11 is a structural block diagram of a speech synthesis device in an embodiment of the application.
  • Fig. 12 is a structural block diagram of a computer device that executes the aforementioned speech synthesis method in an embodiment of the application.
  • Fig. 1 is an application environment diagram of a method for speech synthesis in an embodiment.
  • the speech synthesis system includes a terminal 110 and a server 120.
  • the terminal 110 and the server 120 are connected through a network.
  • the terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically be a terminal device such as a PC, a mobile phone, a tablet computer, or a notebook computer.
  • the server 120 may be implemented as an independent server or a server cluster composed of multiple servers.
  • the terminal 110 is used to obtain the text to be synthesized, and the server 120 is used to analyze and process the text to be synthesized, and synthesize the target speech corresponding to the text to be synthesized.
  • the execution of the aforementioned speech synthesis-based method can also be based on a terminal device, which can obtain the text to be synthesized, and can also analyze the text to be synthesized and synthesize the target speech corresponding to the text to be synthesized.
  • this embodiment is applied to the terminal as an example.
  • a method for speech synthesis is provided.
  • the speech synthesis method specifically includes the following steps S102-S110:
  • Step S102 Obtain the text to be synthesized.
  • the text to be synthesized is text information that requires speech synthesis.
  • text messages that need to be converted into voices.
  • the text to be synthesized could be "Since that moment, she will no longer be arrogant.”
  • the above-mentioned text to be synthesized may be obtained by directly inputting text information, or may be obtained by scanning and recognizing the text information through a camera or the like.
  • Step S104 Acquire text features of the text to be synthesized, where the text features include at least one of word segmentation features, polyphone features, and/or prosodic features.
  • the text feature is the regular feature corresponding to the text information in the text to be synthesized.
  • the text feature may be one of a word segmentation feature, a polyphonic character feature, and/or a prosodic feature.
  • the word segmentation feature is a phrase feature obtained by classifying the words that make up the text to be synthesized, which can be nouns, verbs, prepositions, adjectives, etc.
  • the polyphonic character feature is a word or word with multiple pronunciations included in the text to be synthesized. Because the pronunciation has the function of distinguishing part of speech and word meaning, the pronunciation is different due to different usage conditions or environments.
  • Prosodic feature is a kind of prosodic structure of language, which is closely related to syntax, text structure, information structure and other linguistic structures.
  • Prosodic features are typical features of natural language, and are common features of different languages, such as: pitch down, accent, pause, etc.
  • Prosodic features can be divided into three main aspects: intonation, temporal distribution and stress, which are realized by supersegment features.
  • Supersegment features include pitch, intensity, and time characteristics, which are loaded by phonemes or groups of phonemes.
  • Prosodic features are important forms of language and corresponding emotional expression.
  • the text to be synthesized can also be preprocessed to avoid some minor influences (such as format problems) causing deviations in the output text characteristics.
  • regularization processing is performed on the text to be synthesized.
  • the regularization process is to normalize the synthesized text and convert the language text into a preset form of language text.
  • the normalization processing of the text to be synthesized also includes converting the text such as numbers and symbols in the text to be synthesized into Chinese, so as to facilitate subsequent extraction of word segmentation features, polyphonic character features and/or prosodic features. , To reduce the error of feature extraction.
  • the acquisition of the text features of the text to be synthesized may be inputting the text to be synthesized into a preset neural network model, and the preset neural network model calculates the corresponding text features according to the corresponding algorithm; or, according to the preset feature extraction algorithm , Extract the corresponding text features from the text to be synthesized.
  • the process of obtaining the text features of the text to be synthesized through the neural network model is described.
  • FIG. 3 a schematic flowchart of a process of obtaining the text features of the text to be synthesized is given.
  • the foregoing process of obtaining the text features of the text to be synthesized includes steps S1041-S1043 as shown in FIG. 3.
  • Step S1041 Input the text to be synthesized into the preset word segmentation model, and obtain the word segmentation features corresponding to the text to be synthesized, where the word segmentation features include where in the text to be synthesized should be segmented or broken, so as to determine the corresponding text to be synthesized The word segmentation feature corresponding to the word segmentation result of;
  • Step S1042 Input the text to be synthesized and/or the feature of word segmentation into the preset polyphonic character prediction model, and obtain the feature of the polyphonic character corresponding to the text to be synthesized;
  • Step S1043 Input the text to be synthesized and/or the word segmentation feature into a preset prosody prediction model, and obtain the prosody feature corresponding to the text to be synthesized.
  • the word segmentation model is a neural network model that performs word segmentation processing on the text to be synthesized to obtain word segmentation features. Through the word segmentation model, the word segmentation features of the synthesized text can be predicted. Among them, the word segmentation feature is determined by the word vector obtained by word segmentation.
  • the word vector is a vector corresponding to the word or phrase divided according to the word segmentation model, and is used to determine the word segmentation feature of the text to be synthesized.
  • the polyphonic character prediction model can predict the polyphonic character feature in the text to be synthesized or the word segmentation feature, and can be a neural network model.
  • the prosody prediction model is a neural network model that predicts the prosody features of the text to be synthesized or the word segmentation features, and can predict the prosody features of the text to be synthesized, such as prosodic word features, prosodic phrase features, and intonation phrase features.
  • the text features in the text to be synthesized in this embodiment are not limited to the text features such as word segmentation features, polyphone features, and prosodic features in this embodiment.
  • the text features involved are not only word segmentation features, polyphonic character features, and prosodic features, but also other features such as pre- and post-word correlation features.
  • the user can also establish the structure of the general neural network model by establishing the calculation graph, and select the input data.
  • Step S106 Input the text feature into a preset duration prediction model, and obtain a duration feature corresponding to the text feature.
  • the time length feature is the time length corresponding to the phoneme text feature included in the text to be synthesized and the text feature corresponding to the text to be synthesized.
  • the preset duration prediction model is a neural network model that predicts the length of time corresponding to the phoneme text feature. It determines the time length corresponding to each phoneme contained in the text to be synthesized, including the process of converting Pinyin into phoneme, through the multi-phone word prediction model Get the pronunciation of the word (such as ou3), then convert the pronunciation into phonemes, and then use the duration prediction model to predict the duration of the phonemes.
  • the pronunciation is converted into phonemes, the pronunciation ou3 of " ⁇ " in "I am in China” can be converted into 1 phoneme of ou, and the pronunciation of "guo" guo2 can be converted into two phonemes of g and uo.
  • Step S108 Input the text feature and the duration feature into a preset acoustic model, and obtain a voice feature corresponding to the text to be synthesized.
  • voice features are features generated based on text features and duration features, and voice features include features such as sound intensity, loudness, pitch, and/or pitch period.
  • sound intensity is the average sound energy per unit time passing through a unit area perpendicular to the direction of sound wave propagation; loudness reflects the subjectively perceived sound strength; pitch reflects the subjectively perceived sound frequency; the pitch period is The quasi-period of the voiced sound waveform during the pronunciation reflects the time interval between two adjacent openings and closings of the glottis or the frequency of opening and closing.
  • the text feature obtained in step S104 and the duration feature obtained in step S106 are input into a preset acoustic model, and the voice feature corresponding to the text to be synthesized is used through the acoustic model.
  • the aforementioned acoustic model for predicting voice features is a neural network model, and the acoustic model has the ability to calculate corresponding voice features based on text features and duration features through prior training.
  • Step S110 Convert the voice feature into a voice, and generate a target voice corresponding to the text to be synthesized.
  • the target speech is the speech generated by the text to be synthesized.
  • the voice features can be synthesized by a vocoder, and the corresponding voice and voice duration of the voice feature in the vocoder are output to obtain the target voice.
  • the vocoder can be parallel wavenet vocoder.
  • the voice feature is used as the input, and the voice feature corresponding to the text to be synthesized is synthesized through a preset vocoder, and the corresponding target voice is output.
  • the aforementioned duration prediction model and acoustic model can make good predictions about the relevant features of the synthesized text, and before using the corresponding model for prediction, the corresponding model must be trained based on the training data. In other words, before predicting the text features to obtain the corresponding duration features and voice features, the duration prediction model and acoustic model need to be trained, so that the corresponding models can accurately predict the duration features and voice features corresponding to the text features.
  • the duration prediction model and acoustic model need to be trained, so that the corresponding models can accurately predict the duration features and voice features corresponding to the text features.
  • the above-mentioned speech synthesis method further includes steps S1101-S1103 as shown in FIG. 4.
  • Step S1101 Obtain a training sample set, the training sample set including multiple training texts and corresponding text reference features, duration reference features, and/or voice reference features;
  • Step S1102 Use the text reference feature corresponding to the training text as the input of the duration prediction model, and the duration reference feature as the output of the duration prediction model, and train the duration prediction model;
  • Step S1103 Use the text reference feature and the duration reference feature as the input of the acoustic model, and the speech reference feature as the output of the acoustic model, and train the acoustic model.
  • the duration reference feature is a duration feature corresponding to the text to be synthesized
  • the speech reference feature is a voice feature corresponding to the text to be synthesized.
  • the duration prediction model and the acoustic model are trained through the pre-training sample set, so that the model has the ability to accurately predict the duration characteristics and voice characteristics corresponding to the text to be synthesized.
  • the text reference feature and duration reference feature corresponding to the training text are used as input, and the corresponding voice reference feature is used as output, and the preset acoustic model is trained to make the acoustic model have voice Feature prediction function.
  • model training for each model of text feature prediction which specifically includes training of a word segmentation model, a polyphonic character prediction model, and a prosody prediction model.
  • the word segmentation model, polyphone prediction model, and prosody prediction model involved in text feature prediction are trained through the training sample set, so that the word segmentation model, polyphone prediction model, and prosody prediction model respectively have predictive word segmentation features based on the text to be synthesized , Polyphonic character features and prosodic features and other text features.
  • the above-mentioned speech synthesis method further includes steps S2101-S2103 as shown in FIG. 5:
  • the training sample set further includes word segmentation reference features, polyphone reference features, and/or prosodic reference features corresponding to the plurality of training texts; the method includes:
  • Step S2101 The training text is used as the input of the word segmentation model, and the word segmentation reference feature is used as the output of the word segmentation model, and the word segmentation model is trained.
  • Step S2102 Use the training text and/or the word segmentation reference feature as the input of the polyphone prediction model, and the polyphone reference feature as the output of the polyphone prediction model, and train the polyphone prediction model.
  • Step S2103 Use the training text and/or the word segmentation reference feature as the input of the prosody prediction model, and the prosody reference feature as the output of the prosody prediction model, and train the prosody prediction model.
  • the training sample set may also include multiple training texts and word segmentation features, polyphonic character features, and prosodic features that are expected to be output by the model.
  • the word segmentation reference feature is the word segmentation feature output by the expected word segmentation model according to the training text
  • the polyphone reference feature is the polyphone feature output by the expected polyphone prediction model according to the training text and the corresponding word segmentation feature.
  • the prosodic reference feature is the prosodic feature output by the expected prosody prediction model according to the training text and the corresponding word segmentation feature.
  • the training text For each training text contained in the training data set, the training text is used as input, and the corresponding word segmentation reference feature is used as output, and the preset word segmentation model is trained so that the word segmentation model has the function of word segmentation feature prediction.
  • the training text and the corresponding word segmentation reference feature are used as input, and the corresponding polyphonic character reference feature is used as output, and the preset polyphonic word prediction model is trained to make the polyphonic word prediction model It has the function of predicting the features of polyphonic characters.
  • the training text and the corresponding word segmentation reference feature are used as input, and the corresponding prosody reference feature is used as output, and the preset prosody prediction model is trained to make the prosody prediction model have prosody features Forecast function.
  • the word segmentation model, the polyphonic character prediction model, and the prosody prediction model are trained through pre-processed data, so that the model can accurately predict the word segmentation feature, the polyphonic character feature, and the prosody feature corresponding to the training text.
  • multiple texts to be synthesized can be obtained at the same time, and text characteristics corresponding to each text to be synthesized can be obtained.
  • the text features corresponding to multiple texts to be synthesized are filtered and sorted, and input into a preset feature queue. Obtain a preset number of text features in the feature queue, input the duration prediction model and the acoustic model for prediction, and obtain the corresponding features.
  • the steps of generating text features corresponding to each text to be synthesized and predicting a preset number of text features are performed simultaneously.
  • the above-mentioned speech synthesis method further includes steps S3101-S3102 as shown in FIG. 6:
  • Step S3101 Obtain a plurality of texts to be synthesized through a text iterator, and for each text to be synthesized, respectively execute the step of obtaining the text characteristics of the text to be synthesized;
  • Step S3102 Add a plurality of text features corresponding to the text to be synthesized to a preset feature queue, and when the feature queue meets a preset condition, obtain a preset number of text features in the feature queue, and combine the A preset number of text features are respectively input to the duration prediction model, so that the step of obtaining duration features corresponding to the text features is performed synchronously.
  • the text iterator is used to obtain continuous data in multiple texts to be synthesized and corresponding text features.
  • the text features can be continuously and iterated in multiple processes of obtaining text features in the text to be synthesized.
  • the feature queue contains multiple text features.
  • the preset condition is a condition for inputting the text feature into the time-length prediction model, which can be the text feature reaching a certain number, or it can be the acquisition time of the preset text feature.
  • the preset number is the number of text features output by the feature queue, which can be a fixed value or a value that changes according to a certain rule.
  • a queue feature and a text iterator are added to process multiple texts to be synthesized, so that text conversion of the text to be synthesized is more effective and faster, and the efficiency of speech synthesis and model training is improved.
  • the duration prediction model, the acoustic model, the word segmentation model, the polyphonic character prediction model, and/or the prosody prediction model are neural network models.
  • they are two-way Long and short-term memory application network model (BiLSTM model). .
  • BiLSTM Bi-directional Long Short-Term Memory
  • a speech synthesis device As shown in FIG. 7, in one embodiment, a speech synthesis device is provided, and the device includes:
  • the obtaining module 702 is used to obtain the text to be synthesized
  • the text feature determining module 704 is configured to obtain text features of the text to be synthesized, where the text features include at least one of word segmentation features, polyphone features, and/or prosodic features;
  • the duration feature determining module 706 is configured to input the text feature into a preset duration prediction model, and obtain the duration feature corresponding to the text feature;
  • a voice feature determining module 708, configured to input the text feature and the duration feature into a preset acoustic model, and obtain a voice feature corresponding to the text to be synthesized;
  • the conversion module 710 is configured to convert the voice feature into a voice, and generate a target voice corresponding to the text to be synthesized.
  • the text feature determination module 704 further includes: a preprocessing unit, configured to perform regularization processing on the text to be synthesized.
  • the text feature determination module 704 further includes: a word segmentation feature determination unit, configured to input the text to be synthesized into a preset word segmentation model, and obtain the text corresponding to the text to be synthesized The word segmentation feature; a polyphonic character feature determination unit for inputting the text to be synthesized and/or the word segmentation feature into a preset polyphonic word prediction model to obtain the polyphonic character feature corresponding to the text to be synthesized; prosodic feature determining unit , Used to input the text to be synthesized and/or the word segmentation feature into a preset prosody prediction model to obtain the prosodic feature corresponding to the text to be synthesized.
  • a word segmentation feature determination unit configured to input the text to be synthesized into a preset word segmentation model, and obtain the text corresponding to the text to be synthesized The word segmentation feature
  • a polyphonic character feature determination unit for inputting the text to be synthesized and/or the word segmentation feature into a preset poly
  • the device further includes: an acquiring training module 703, configured to acquire a training sample set, the training sample set includes a plurality of training texts and corresponding text reference features and duration reference features And/or speech reference features; duration training module 705, configured to use text reference features corresponding to the training text as the input of the duration prediction model, and the duration reference features as the output of the duration prediction model to predict the duration Model training; a voice training module 707, configured to use the text reference feature and the duration reference feature as the input of the acoustic model, and the voice reference feature as the output of the acoustic model to train the acoustic model.
  • an acquiring training module 703 configured to acquire a training sample set, the training sample set includes a plurality of training texts and corresponding text reference features and duration reference features And/or speech reference features
  • duration training module 705 configured to use text reference features corresponding to the training text as the input of the duration prediction model, and the duration reference features as the output of the duration prediction model to predict the duration Model training
  • voice training module 707
  • the training sample set further includes word segmentation reference features, polyphonic character reference features, and/or prosodic reference features corresponding to the plurality of training texts
  • the device includes: word segmentation training
  • the module 7041 is configured to use the training text as the input of the word segmentation model, and use the word segmentation reference feature as the output of the word segmentation model to train the word segmentation model.
  • the polyphonic character training module 7043 is configured to use the training text and/or the word segmentation reference feature as the input of the polyphonic character prediction model, and the polyphonic character reference feature is used as the output of the polyphonic character prediction model.
  • the prediction model is trained.
  • the prosody training module 7045 is configured to use the training text and/or the word segmentation reference feature as the input of the prosody prediction model, and the prosody reference feature as the output of the prosody prediction model to train the prosody prediction model.
  • the device further includes: a text obtaining module 709, configured to obtain a plurality of texts to be synthesized through a text iterator, and for each text to be synthesized, the obtaining of the The step of the text feature of the text to be synthesized; the text prediction module 711 is used to add a plurality of text features corresponding to the text to be synthesized to a preset feature queue, and when the feature queue meets a preset condition, the feature queue is acquired And input the preset number of text features into the duration prediction model respectively, so that the step of acquiring the duration feature corresponding to the text feature is performed synchronously.
  • a text obtaining module 709 configured to obtain a plurality of texts to be synthesized through a text iterator, and for each text to be synthesized, the obtaining of the The step of the text feature of the text to be synthesized
  • the text prediction module 711 is used to add a plurality of text features corresponding to the text to be synthesized to a preset feature queue,
  • the duration prediction model, acoustic model, word segmentation model, polyphonic character prediction model and/or prosody prediction model is a BiLSTM model.
  • Fig. 12 shows an internal structure diagram of a smart terminal in an embodiment.
  • the smart terminal may specifically be a terminal or a server.
  • the smart terminal includes a processor, a memory and a network interface connected through a system bus.
  • the memory includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the smart terminal stores an operating system and may also store a computer program.
  • the processor can realize the age identification method.
  • a computer program may also be stored in the internal memory, and when the computer program is executed by the processor, the processor can execute the age identification method.
  • FIG. 12 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • an intelligent terminal including a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps: Synthesizing text; acquiring text features of the text to be synthesized, the text features including at least one of word segmentation features, polyphonic character features, and/or prosodic features; inputting the text features into a preset duration prediction model, and acquiring The duration feature corresponding to the text feature; input the text feature and the duration feature into a preset acoustic model to obtain the voice feature corresponding to the text to be synthesized; convert the voice feature into speech, and generate and The target voice corresponding to the text to be synthesized.
  • the method before the step of obtaining the text characteristics of the text to be synthesized, the method further includes: performing regularization processing on the text to be synthesized.
  • the step of obtaining the text features of the text to be synthesized further includes: inputting the text to be synthesized into a preset word segmentation model to obtain the word segmentation characteristics corresponding to the text to be synthesized; Input the pre-synthesized text and/or the word segmentation feature into a preset polyphonic character prediction model to obtain the polyphonic character features corresponding to the text to be synthesized; input the to-be-synthesized text and/or the word segmentation feature into the preset prosody prediction Model to obtain the prosodic features corresponding to the text to be synthesized.
  • the method further includes: obtaining a training sample set, the training sample set containing a plurality of training texts and corresponding text reference features, duration reference features, and/or voice reference features; and corresponding to the training text
  • the text reference feature of is used as the input of the duration prediction model
  • the duration reference feature is used as the output of the duration prediction model to train the duration prediction model.
  • the text reference feature and the duration reference feature are used as the input of the acoustic model, and the speech reference feature is used as the output of the acoustic model, and the acoustic model is trained.
  • the training sample set further includes word segmentation reference features, polyphone reference features, and/or prosodic reference features corresponding to the multiple training texts; the method includes: using the training text as the The word segmentation model is input, and the word segmentation reference feature is used as the output of the word segmentation model, and the word segmentation model is trained.
  • the training text and/or the word segmentation reference feature is used as the input of the polyphonic character prediction model, and the polyphonic character reference feature is used as the output of the polyphonic character prediction model, and the polyphonic character prediction model is trained.
  • the training text and/or the word segmentation reference feature is used as the input of the prosody prediction model, and the prosody reference feature is used as the output of the prosody prediction model, and the prosody prediction model is trained.
  • the method further includes: obtaining a plurality of texts to be synthesized through a text iterator, and for each text to be synthesized, respectively performing the step of obtaining the text characteristics of the text to be synthesized;
  • the text features corresponding to the synthesized text are added to a preset feature queue, and when the feature queue meets a preset condition, a preset number of text features in the feature queue are obtained, and the preset number of text features are separated
  • Input the duration prediction model so that the step of acquiring the duration feature corresponding to the text feature is executed synchronously.
  • a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps: obtain the text to be synthesized; obtain the text to be synthesized
  • the text feature of the text includes at least one of word segmentation feature, polyphonic character feature, and/or prosodic feature; inputting the text feature into a preset duration prediction model to obtain the duration feature corresponding to the text feature;
  • the text feature and the duration feature are input into a preset acoustic model to obtain a voice feature corresponding to the text to be synthesized; the voice feature is converted into speech to generate a target voice corresponding to the text to be synthesized.
  • the method before the step of obtaining the text characteristics of the text to be synthesized, the method further includes: performing regularization processing on the text to be synthesized.
  • the step of obtaining the text features of the text to be synthesized further includes: inputting the text to be synthesized into a preset word segmentation model to obtain the word segmentation characteristics corresponding to the text to be synthesized; Input the pre-synthesized text and/or the word segmentation feature into a preset polyphonic character prediction model to obtain the polyphonic character features corresponding to the text to be synthesized; input the to-be-synthesized text and/or the word segmentation feature into the preset prosody prediction Model to obtain the prosodic features corresponding to the text to be synthesized.
  • the method further includes: obtaining a training sample set, the training sample set containing a plurality of training texts and corresponding text reference features, duration reference features, and/or voice reference features; and corresponding to the training text
  • the text reference feature of is used as the input of the duration prediction model
  • the duration reference feature is used as the output of the duration prediction model to train the duration prediction model.
  • the text reference feature and the duration reference feature are used as the input of the acoustic model, and the speech reference feature is used as the output of the acoustic model, and the acoustic model is trained.
  • the training sample set further includes word segmentation reference features, polyphone reference features, and/or prosodic reference features corresponding to the multiple training texts; the method includes: using the training text as the The word segmentation model is input, and the word segmentation reference feature is used as the output of the word segmentation model, and the word segmentation model is trained.
  • the training text and/or the word segmentation reference feature is used as the input of the polyphonic character prediction model, and the polyphonic character reference feature is used as the output of the polyphonic character prediction model, and the polyphonic character prediction model is trained.
  • the training text and/or the word segmentation reference feature is used as the input of the prosody prediction model, and the prosody reference feature is used as the output of the prosody prediction model, and the prosody prediction model is trained.
  • the method further includes: obtaining a plurality of texts to be synthesized through a text iterator, and for each text to be synthesized, respectively performing the step of obtaining the text characteristics of the text to be synthesized;
  • the text features corresponding to the synthesized text are added to a preset feature queue, and when the feature queue meets a preset condition, a preset number of text features in the feature queue are obtained, and the preset number of text features are separated
  • Input the duration prediction model so that the step of acquiring the duration feature corresponding to the text feature is executed synchronously.
  • the speech synthesis method, device, terminal and storage medium of the present invention After adopting the speech synthesis method, device, terminal and storage medium of the present invention, in the process of speech synthesis, first obtain the text characteristics of the text to be synthesized, and the text characteristics include word segmentation characteristics, polyphonic character characteristics and/ Or text features such as prosodic features; then input the text features into the preset duration prediction model to obtain the corresponding duration features; input the text features and duration features into the preset acoustic model to obtain the corresponding voice features; finally convert the voice features into speech, Generate the target speech corresponding to the text to be synthesized.
  • the text characteristics include word segmentation characteristics, polyphonic character characteristics and/ Or text features such as prosodic features
  • the considered text features include polyphonic character features and prosodic features, and combined with the time-length features predicted by the model, the final speech features needed in the process of synthesizing speech are obtained. That is to say, the speech synthesis method, device, terminal and storage medium provided by the present invention take into account the speech characteristics generated by multiple text characteristics and duration characteristics, so that the synthesized speech is more accurate, the accuracy of speech synthesis is improved, and the user is improved. Experience.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Road (Synchlink) DRAM
  • SLDRAM synchronous chain Road (Synchlink) DRAM
  • Rambus direct RAM
  • DRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Abstract

A speech synthesis method, a speech synthesis apparatus, a smart terminal and a storage medium. The method comprises: acquiring a text to be synthesized (S102); acquiring text features of the text to be synthesized, wherein the text features comprise at least one of a word segmentation feature, a polyphone feature and/or a prosodic feature (S104); inputting the text features into a preset duration prediction model, and acquiring duration features corresponding to the text features (S106); inputting the text features and the duration features into a preset acoustic model, and acquiring the speech features corresponding to the text to be synthesized (S108); and converting the speech features to speech, and generating target speech corresponding to the text to be synthesized (S110). According to the speech synthesis method, the speech features generated by various text features and duration features are considered, such that the synthesized speech is more accurate, thereby improving the speech synthesis accuracy, and improving the user experience.

Description

语音合成方法、装置、终端及存储介质Speech synthesis method, device, terminal and storage medium 技术领域Technical field
本发明涉及人工智能技术领域,尤其涉及一种语音合成方法、装置、智能终端及计算机可读存储介质。The present invention relates to the field of artificial intelligence technology, in particular to a speech synthesis method, device, intelligent terminal and computer readable storage medium.
背景技术Background technique
随着移动互联网和人工智能技术的快速发展,语音播报、听小说、听新闻、智能交互等一系列语音合成的场景越来越多。语音合成可以将文本等转换成自然语音输出。With the rapid development of mobile Internet and artificial intelligence technology, there are more and more scenarios for speech synthesis such as voice broadcasting, listening to novels, listening to news, and intelligent interaction. Speech synthesis can convert text, etc. into natural speech output.
技术问题technical problem
现有技术中语音合成多采用统计参数合成法,对于引得频谱特性参数进行建模,生成参数合成器,来构建文本序列映射到语音的映射关系,然后统计模型来产生每时每刻的语音参数(包括基频、共振峰频率等),然后把这些参数转化为语音对应的相关特征,最后生成输出的语音。但是上述语音合成方法中,每个步骤对应的单一子模块的计算结果不一定全部都是最优效果,从而导致了无法将文本准确转换为适应多语言、多音色场景的语音,影响了整体上的语音合成的质量,极大影响用户体验。In the prior art, speech synthesis mostly uses the statistical parameter synthesis method to model the induced spectral characteristic parameters and generate a parameter synthesizer to construct the mapping relationship between the text sequence and the speech, and then the statistical model is used to generate the speech parameters at all times. (Including fundamental frequency, formant frequency, etc.), and then convert these parameters into relevant features corresponding to the voice, and finally generate the output voice. However, in the above-mentioned speech synthesis method, the calculation results of a single sub-module corresponding to each step are not necessarily all the best results, which leads to the inability to accurately convert the text into a speech suitable for multi-language and multi-tone scenes, which affects the overall The quality of the speech synthesis greatly affects the user experience.
也就是说,上述语音合成的方案中,因为单一子模块计算结果非最优的问题导致了最终合成的语音的质量不足。That is to say, in the above-mentioned speech synthesis solution, the quality of the final synthesized speech is insufficient due to the problem of the calculation result of a single sub-module being non-optimal.
技术解决方案Technical solutions
基于此,有必要针对上述问题,提出了一种语音合成方法、装置、智能终端及计算机可读存储介质。Based on this, it is necessary to propose a speech synthesis method, device, smart terminal, and computer-readable storage medium to address the above-mentioned problems.
在本发明的第一方面,提出了一种语音合成方法。In the first aspect of the present invention, a speech synthesis method is proposed.
一种语音合成方法,包括:A method of speech synthesis, including:
获取待合成文本;Obtain the text to be synthesized;
获取所述待合成文本的文本特征,所述文本特征包括分词特征、多音字特征和/或韵律特征中的至少一个;Acquiring a text feature of the text to be synthesized, where the text feature includes at least one of a word segmentation feature, a polyphonic character feature, and/or a prosodic feature;
将所述文本特征输入预设的时长预测模型,获取与所述文本特征对应的时长特征;Input the text feature into a preset duration prediction model, and obtain the duration feature corresponding to the text feature;
将所述文本特征和所述时长特征输入预设的声学模型,获取与所述待合成文本对应的语音特征;Inputting the text feature and the duration feature into a preset acoustic model, and obtaining a voice feature corresponding to the text to be synthesized;
将所述语音特征转换成语音,生成与所述待合成文本对应的目标语音。The voice feature is converted into voice, and a target voice corresponding to the text to be synthesized is generated.
在一个实施例中,所述获取所述待合成文本的文本特征的步骤之前,还包括:对所述待合成文本进行正则化处理。In one embodiment, before the step of obtaining the text characteristics of the text to be synthesized, the method further includes: performing regularization processing on the text to be synthesized.
在一个实施例中,所述获取所述待合成文本的文本特征的步骤还包括:将所述待合成文本输入预设的分词模型,获取与所述待合成文本对应的分词特征;将所述待合成文本和/或所述分词特征输入预设的多音字预测模型,获取所述待合成文本对应的多音字特征;将所述待合成文本和/或所述分词特征输入预设的韵律预测模型,获取所述待合成文本对应的韵律特征。In one embodiment, the step of obtaining the text features of the text to be synthesized further includes: inputting the text to be synthesized into a preset word segmentation model to obtain the word segmentation characteristics corresponding to the text to be synthesized; Input the pre-synthesized text and/or the word segmentation feature into a preset polyphonic character prediction model to obtain the polyphonic character features corresponding to the text to be synthesized; input the to-be-synthesized text and/or the word segmentation feature into the preset prosody prediction Model to obtain the prosodic features corresponding to the text to be synthesized.
在一个实施例中,所述方法还包括:获取训练样本集,所述训练样本集包含多个训练文本以及对应的文本参考特征、时长参考特征和/或语音参考特征;将所述训练文本对应的文本参考特征作为所述时长预测模型的输入,所述时长参考特征作为时长预测模型的输出,对所述时长预测模型进行训练;将所述文本参考特征和所述时长参考特征作为所述声学模型的输入,所述语音参考特征作为声学模型的输出,对所述声学模型进行训练。In one embodiment, the method further includes: obtaining a training sample set, the training sample set containing a plurality of training texts and corresponding text reference features, duration reference features, and/or voice reference features; and corresponding to the training text The text reference feature of is used as the input of the duration prediction model, the duration reference feature is used as the output of the duration prediction model, and the duration prediction model is trained; the text reference feature and the duration reference feature are used as the acoustic The input of the model, the voice reference feature as the output of the acoustic model, and the training of the acoustic model.
在一个实施例中,所述训练样本集还包含与所述多个训练文本对应的分词参考特征、多音字参考特征和/或韵律参考特征;所述方法包括:将所述训练文本作为所述分词模型的输入,所述分词参考特征作为分词模型的输出,对所述分词模型进行训练。将所述训练文本和/或所述分词参考特征作为所述多音字预测模型的输入,所述多音字参考特征作为多音字预测模型的输出,对所述多音字预测模型进行训练。将所述训练文本和/或所述分词参考特征作为所述韵律预测模型的输入,所述韵律参考特征作为韵律预测模型的输出,对所述韵律预测模型进行训练。In an embodiment, the training sample set further includes word segmentation reference features, polyphone reference features, and/or prosodic reference features corresponding to the multiple training texts; the method includes: using the training text as the The word segmentation model is input, and the word segmentation reference feature is used as the output of the word segmentation model, and the word segmentation model is trained. The training text and/or the word segmentation reference feature is used as the input of the polyphonic character prediction model, and the polyphonic character reference feature is used as the output of the polyphonic character prediction model, and the polyphonic character prediction model is trained. The training text and/or the word segmentation reference feature is used as the input of the prosody prediction model, and the prosody reference feature is used as the output of the prosody prediction model, and the prosody prediction model is trained.
在一个实施例中,所述方法还包括:通过文本迭代器获取多个待合成文本,针对每一个待合成文本,分别执行所述获取所述待合成文本的文本特征的步骤;将多个待合成文本对应的文本特征添加至预设的特征队列,当所述特征队列满足预设条件时,获取所述特征队列中的预设数量的文本特征,并将所述预设数量的文本特征分别输入所述时长预测模型,以使同步执行所述获取与所述文本特征对应的时长特征的步骤。In one embodiment, the method further includes: obtaining a plurality of texts to be synthesized through a text iterator, and for each text to be synthesized, respectively performing the step of obtaining the text characteristics of the text to be synthesized; The text features corresponding to the synthesized text are added to a preset feature queue, and when the feature queue meets a preset condition, a preset number of text features in the feature queue are obtained, and the preset number of text features are separated Input the duration prediction model so that the step of acquiring the duration feature corresponding to the text feature is executed synchronously.
在本发明的第二方面,提出了一种语音合成装置。In the second aspect of the present invention, a speech synthesis device is provided.
一种语音合成装置,包括:A speech synthesis device includes:
获取模块,用于获取待合成文本;The obtaining module is used to obtain the text to be synthesized;
文本特征确定模块,用于获取所述待合成文本的文本特征,所述文本特征包括分词特征、多音字特征和/或韵律特征中的至少一个;A text feature determination module, configured to obtain text features of the text to be synthesized, where the text features include at least one of word segmentation features, polyphone features, and/or prosodic features;
时长特征确定模块,用于将所述文本特征输入预设的时长预测模型,获取与所述文本特征对应的时长特征;A duration feature determining module, configured to input the text feature into a preset duration prediction model, and obtain a duration feature corresponding to the text feature;
语音特征确定模块,用于将所述文本特征和所述时长特征输入预设的声学模型,获取与所述待合成文本对应的语音特征;A voice feature determination module, configured to input the text feature and the duration feature into a preset acoustic model, and obtain a voice feature corresponding to the text to be synthesized;
转换模块,用于将所述语音特征转换成语音,生成与所述待合成文本对应的目标语音。The conversion module is used to convert the voice feature into a voice, and generate a target voice corresponding to the text to be synthesized.
在一个实施例中,所述文本特征确定模块还包括:预处理单元,用于对所述待合成文本进行正则化处理。In an embodiment, the text feature determination module further includes: a preprocessing unit, configured to perform regularization processing on the text to be synthesized.
在一个实施例中,所述文本特征确定模块还包括:分词特征确定单元,用于将所述待合成文本输入预设的分词模型,获取与所述待合成文本对应的分词特征;多音字特征确定单元,用于将所述待合成文本和/或所述分词特征输入预设的多音字预测模型,获取所述待合成文本对应的多音字特征;韵律特征确定单元,用于将所述待合成文本和/或所述分词特征输入预设的韵律预测模型,获取所述待合成文本对应的韵律特征。In one embodiment, the text feature determination module further includes: a word segmentation feature determination unit, configured to input the text to be synthesized into a preset word segmentation model to obtain the word segmentation feature corresponding to the text to be synthesized; polyphonic character feature The determining unit is configured to input the to-be-synthesized text and/or the word segmentation feature into a preset polyphonic character prediction model to obtain the poly-phonic character feature corresponding to the to-be-synthesized text; the prosodic feature determining unit is to input the to-be-synthesized text The synthesized text and/or the word segmentation feature is input into a preset prosody prediction model, and the prosody feature corresponding to the text to be synthesized is obtained.
在一个实施例中,所述装置还包括:获取训练模块,用于获取训练样本集,所述训练样本集包含多个训练文本以及对应的文本参考特征、时长参考特征和/或语音参考特征;时长训练模块,用于将所述训练文本对应的文本参考特征作为所述时长预测模型的输入,所述时长参考特征作为时长预测模型的输出,对所述时长预测模型进行训练;语音训练模块,用于将所述文本参考特征和所述时长参考特征作为所述声学模型的输入,所述语音参考特征作为声学模型的输出,对所述声学模型进行训练。In one embodiment, the device further includes: an acquiring training module, configured to acquire a training sample set, the training sample set including a plurality of training texts and corresponding text reference features, duration reference features, and/or voice reference features; The duration training module is configured to use the text reference feature corresponding to the training text as the input of the duration prediction model, and the duration reference feature as the output of the duration prediction model to train the duration prediction model; a voice training module, The text reference feature and the duration reference feature are used as the input of the acoustic model, and the speech reference feature is used as the output of the acoustic model to train the acoustic model.
在一个实施例中,所述训练样本集还包含与所述多个训练文本对应的分词参考特征、多音字参考特征和/或韵律参考特征,所述装置包括:分词训练模块,用于将所述训练文本作为所述分词模型的输入,所述分词参考特征作为分词模型的输出,对所述分词模型进行训练。多音字训练模块,用于将所述训练文本和/或所述分词参考特征作为所述多音字预测模型的输入,所述多音字参考特征作为多音字预测模型的输出,对所述多音字预测模型进行训练。韵律训练模块,用于将所述训练文本和/或所述分词参考特征作为所述韵律预测模型的输入,所述韵律参考特征作为韵律预测模型的输出,对所述韵律预测模型进行训练。In one embodiment, the training sample set further includes word segmentation reference features, polyphonic character reference features, and/or prosodic reference features corresponding to the plurality of training texts, and the device includes: a word segmentation training module for combining all The training text is used as the input of the word segmentation model, and the word segmentation reference feature is used as the output of the word segmentation model, and the word segmentation model is trained. A polyphonic character training module, configured to use the training text and/or the word segmentation reference feature as the input of the polyphonic character prediction model, and the polyphonic character reference feature as the output of the polyphonic character prediction model to predict the polyphonic character The model is trained. The prosody training module is configured to use the training text and/or the word segmentation reference feature as the input of the prosody prediction model, and the prosody reference feature as the output of the prosody prediction model to train the prosody prediction model.
在一个实施例中,所述装置还包括:文本获取模块,用于通过文本迭代器获取多个待合成文本,针对每一个待合成文本,分别执行所述获取所述待合成文本的文本特征的步骤;文本预测模块,用于将多个待合成文本对应的文本特征添加至预设的特征队列,当所述特征队列满足预设条件时,获取所述特征队列中的预设数量的文本特征,并将所述预设数量的文本特征分别输入所述时长预测模型,以使同步执行所述获取与所述文本特征对应的时长特征的步骤。In one embodiment, the device further includes: a text obtaining module, configured to obtain a plurality of texts to be synthesized through a text iterator, and for each text to be synthesized, the method of obtaining the text characteristics of the text to be synthesized is performed respectively. Step; a text prediction module for adding a plurality of text features corresponding to the text to be synthesized to a preset feature queue, and when the feature queue meets a preset condition, obtain a preset number of text features in the feature queue , And input the preset number of text features into the duration prediction model respectively, so that the step of obtaining duration features corresponding to the text features is performed synchronously.
在本发明的第三方面,提出了一种智能终端。In the third aspect of the present invention, an intelligent terminal is proposed.
一种智能终端,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行以下步骤:An intelligent terminal includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps:
获取待合成文本;Obtain the text to be synthesized;
获取所述待合成文本的文本特征,所述文本特征包括分词特征、多音字特征和/或韵律特征中的至少一个;Acquiring a text feature of the text to be synthesized, where the text feature includes at least one of a word segmentation feature, a polyphonic character feature, and/or a prosodic feature;
将所述文本特征输入预设的时长预测模型,获取与所述文本特征对应的时长特征;Input the text feature into a preset duration prediction model, and obtain the duration feature corresponding to the text feature;
将所述文本特征和所述时长特征输入预设的声学模型,获取与所述待合成文本对应的语音特征;Inputting the text feature and the duration feature into a preset acoustic model, and obtaining a voice feature corresponding to the text to be synthesized;
将所述语音特征转换成语音,生成与所述待合成文本对应的目标语音。The voice feature is converted into voice, and a target voice corresponding to the text to be synthesized is generated.
在本发明的第四方面,提出了一种计算机可读存储介质。In the fourth aspect of the present invention, a computer-readable storage medium is provided.
一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:A computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps:
获取待合成文本;Obtain the text to be synthesized;
获取所述待合成文本的文本特征,所述文本特征包括分词特征、多音字特征和/或韵律特征中的至少一个;Acquiring a text feature of the text to be synthesized, where the text feature includes at least one of a word segmentation feature, a polyphonic character feature, and/or a prosodic feature;
将所述文本特征输入预设的时长预测模型,获取与所述文本特征对应的时长特征;Input the text feature into a preset duration prediction model, and obtain the duration feature corresponding to the text feature;
将所述文本特征和所述时长特征输入预设的声学模型,获取与所述待合成文本对应的语音特征;Inputting the text feature and the duration feature into a preset acoustic model, and obtaining a voice feature corresponding to the text to be synthesized;
将所述语音特征转换成语音,生成与所述待合成文本对应的目标语音。The voice feature is converted into voice, and a target voice corresponding to the text to be synthesized is generated.
有益效果Beneficial effect
实施本发明实施例,将具有如下有益效果:Implementing the embodiments of the present invention will have the following beneficial effects:
采用本发明的语音合成方法、装置、终端及存储介质之后,在进行语音合成的过程中,首先获取待合成文本的待合成文本的文本特征,所述文本特征包括分词特征、多音字特征和/或韵律特征等文本特征;然后将文本特征输入预设的时长预测模型获取对应的时长特征;将文本特征和时长特征输入预设的声学模型获取对应的语音特征;最后将语音特征转换成语音,生成与待合成文本对应的目标语音。在进行语音合成的特征提取的过程中,考虑的文本特征包括了多音字特征和韵律特征等,并结合模型预测的时长特征,得到最终合成语音的过程中所需要的语音特征。也就是说,本发明提供的语音合成方法、装置、终端及存储介质考虑了多种文本特征和时长特征生成的语音特征,使得合成的语音更加准确,提高了语音合成的准确性,提高了用户体验。After adopting the speech synthesis method, device, terminal and storage medium of the present invention, in the process of speech synthesis, first obtain the text characteristics of the text to be synthesized, and the text characteristics include word segmentation characteristics, polyphonic character characteristics and/ Or text features such as prosodic features; then input the text features into the preset duration prediction model to obtain the corresponding duration features; input the text features and duration features into the preset acoustic model to obtain the corresponding voice features; finally convert the voice features into speech, Generate the target speech corresponding to the text to be synthesized. In the process of feature extraction for speech synthesis, the considered text features include polyphonic character features and prosodic features, and combined with the time-length features predicted by the model, the final speech features needed in the process of synthesizing speech are obtained. That is to say, the speech synthesis method, device, terminal and storage medium provided by the present invention take into account the speech characteristics generated by multiple text characteristics and duration characteristics, so that the synthesized speech is more accurate, the accuracy of speech synthesis is improved, and the user is improved. Experience.
附图说明Description of the drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.
其中:among them:
图1为本申请中一个实施例中语音合成方法的应用环境图;Figure 1 is an application environment diagram of a speech synthesis method in an embodiment of this application;
图2为本申请中一个实施例中语音合成方法的流程示意图;FIG. 2 is a schematic flowchart of a speech synthesis method in an embodiment of this application;
图3为本申请中一个实施例中一种获取待合成文本的文本特征的过程的流程图示意图;FIG. 3 is a schematic flowchart of a process of obtaining text features of a text to be synthesized in an embodiment of this application;
图4为本申请中一个实施例中时长预测模型和声学模型训练的方法的流程示意图;4 is a schematic flowchart of a method for training a duration prediction model and an acoustic model in an embodiment of this application;
图5为本申请中一个实施例中分词模型、多音字预测模型和/或韵律预测模型的流程示意图;FIG. 5 is a schematic flowchart of a word segmentation model, a polyphonic character prediction model, and/or a prosody prediction model in an embodiment of this application;
图6为本申请中一个实施例中语音合成方法的流程图;Fig. 6 is a flowchart of a speech synthesis method in an embodiment of the application;
图7为本申请中一个实施例中语音合成装置的结构框图;Fig. 7 is a structural block diagram of a speech synthesis device in an embodiment of the application;
图8为本申请中一个实施例中文本特征确定模块的结构框图;FIG. 8 is a structural block diagram of a text feature determination module in an embodiment of this application;
图9为本申请中一个实施例中语音合成装置的结构框图;Fig. 9 is a structural block diagram of a speech synthesis device in an embodiment of the application;
图10为本申请中一个实施例中语音合成装置的结构框图;FIG. 10 is a structural block diagram of a speech synthesis device in an embodiment of this application;
图11为本申请中一个实施例中语音合成装置的结构框图;FIG. 11 is a structural block diagram of a speech synthesis device in an embodiment of the application;
图12为本申请中一个实施例中执行前述语音合成方法的计算机设备的结构框图。Fig. 12 is a structural block diagram of a computer device that executes the aforementioned speech synthesis method in an embodiment of the application.
本发明的实施方式Embodiments of the present invention
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
图1为一个实施例中语音合成的方法的应用环境图。参照图1,该语音合成的方法应用于语音合成系统。该语音合成系统包括终端110和服务器120。终端110和服务器120通过网络连接,终端110具体可以是台式终端或移动终端,移动终端具体可以是PC、手机、平板电脑、笔记本电脑等终端设备。服务器120可以用独立的服务器或者是多个服务器组成的服务器集群来实现。终端110用于获取待合成文本,服务器120用于对待合成文本进行分析和处理,合成待合成文本对应的目标语音。Fig. 1 is an application environment diagram of a method for speech synthesis in an embodiment. Referring to Figure 1, the method of speech synthesis is applied to a speech synthesis system. The speech synthesis system includes a terminal 110 and a server 120. The terminal 110 and the server 120 are connected through a network. The terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically be a terminal device such as a PC, a mobile phone, a tablet computer, or a notebook computer. The server 120 may be implemented as an independent server or a server cluster composed of multiple servers. The terminal 110 is used to obtain the text to be synthesized, and the server 120 is used to analyze and process the text to be synthesized, and synthesize the target speech corresponding to the text to be synthesized.
在另一个实施例中,上述基于语音合成的方法的执行还可以是基于一终端设备,该终端可获取待合成文本、也可以对待合成文本进行分析,合成待合成文本对应的目标语音。In another embodiment, the execution of the aforementioned speech synthesis-based method can also be based on a terminal device, which can obtain the text to be synthesized, and can also analyze the text to be synthesized and synthesize the target speech corresponding to the text to be synthesized.
考虑到该方法既可以应用于终端,也可以应用于服务器,且在具体的语音合成的过程是相同的,本实施例以应用于终端举例说明。Considering that the method can be applied to both the terminal and the server, and the specific speech synthesis process is the same, this embodiment is applied to the terminal as an example.
如图2所示,在一个实施例中,提供了一种语音合成的方法。该语音合成的方法具体包括如下步骤S102-S110:As shown in Figure 2, in one embodiment, a method for speech synthesis is provided. The speech synthesis method specifically includes the following steps S102-S110:
步骤S102,获取待合成文本。Step S102: Obtain the text to be synthesized.
具体的,待合成文本为需要进行语音合成的文本信息。例如,在语音聊天机器人、语音读报等场景下,需要转换成语音的文本信息。示例性的,待合成文本可以是“自从那一刻起,她便不再妄自菲薄。”。Specifically, the text to be synthesized is text information that requires speech synthesis. For example, in scenarios such as voice chat robots and voice newspaper reading, text messages that need to be converted into voices. Exemplarily, the text to be synthesized could be "Since that moment, she will no longer be arrogant.".
上述待合成文本可以是获取直接输入文本信息,也可以是通过摄像头等扫描识别得到文本信息。The above-mentioned text to be synthesized may be obtained by directly inputting text information, or may be obtained by scanning and recognizing the text information through a camera or the like.
步骤S104,获取所述待合成文本的文本特征,所述文本特征包括分词特征、多音字特征和/或韵律特征中的至少一个。Step S104: Acquire text features of the text to be synthesized, where the text features include at least one of word segmentation features, polyphone features, and/or prosodic features.
具体的,文本特征是待合成文本中文本信息对应的规律特征。Specifically, the text feature is the regular feature corresponding to the text information in the text to be synthesized.
在一个具体的实施例中,文本特征可以是分词特征、多音字特征和/或韵律特征中的一个。In a specific embodiment, the text feature may be one of a word segmentation feature, a polyphonic character feature, and/or a prosodic feature.
分词特征是将组成待合成文本的词语进行分类得到的词组特征,可以是名词、动词、介词和形容词等。The word segmentation feature is a phrase feature obtained by classifying the words that make up the text to be synthesized, which can be nouns, verbs, prepositions, adjectives, etc.
多音字特征是待合成文本中包括的存在多种读音的字或词,由于读音有区别词性和词义的作用,因此使用情况或者环境不同,读音也不同。The polyphonic character feature is a word or word with multiple pronunciations included in the text to be synthesized. Because the pronunciation has the function of distinguishing part of speech and word meaning, the pronunciation is different due to different usage conditions or environments.
韵律特征是语言的一种韵律结构,与句法和语篇结构、信息结构等其他语言学结构密切相关。韵律特征是自然语言的典型特征,是不同语言具有的共同特点,比如:音高下倾、重读、停顿等特征。韵律特征可以分为三个主要方面:语调、时域分布和重音,通过超音段特征实现。超音段特征包括音高,强度以及时间特性,由音位或音位群负载。韵律特征是语言和对应的情绪表达的重要形式。Prosodic feature is a kind of prosodic structure of language, which is closely related to syntax, text structure, information structure and other linguistic structures. Prosodic features are typical features of natural language, and are common features of different languages, such as: pitch down, accent, pause, etc. Prosodic features can be divided into three main aspects: intonation, temporal distribution and stress, which are realized by supersegment features. Supersegment features include pitch, intensity, and time characteristics, which are loaded by phonemes or groups of phonemes. Prosodic features are important forms of language and corresponding emotional expression.
在获取待合成文本的文本特征前,还可以先对待合成文本进行预处理,避免一些微小的影响(例如格式问题)导致输出的文本特征存在偏差。Before obtaining the text features of the text to be synthesized, the text to be synthesized can also be preprocessed to avoid some minor influences (such as format problems) causing deviations in the output text characteristics.
在一个实施例中,在获取所述待合成文本的文本特征之前,对所述待合成文本进行正则化处理。In one embodiment, before acquiring the text characteristics of the text to be synthesized, regularization processing is performed on the text to be synthesized.
其中,正则化处理是对待合成文本进行规范化,将语言文字转换为预设形式的语言文字,例如,英文处理字母的大小写,可以根据需要去除标点符号,从而避免文本格式等问题导致输出的文本特征存在偏差。在另一个具体的实施例中,对待合成文本进行规范化处理还包括了将待合成文本中的数字、符号等文本转换成中文,以便于后续进行分词特征、多音字特征和/或韵律特征的提取,减少特征提取的误差。Among them, the regularization process is to normalize the synthesized text and convert the language text into a preset form of language text. For example, the upper and lower case of the English processing letter can be removed as needed to avoid the output text caused by problems such as text format. The characteristics are biased. In another specific embodiment, the normalization processing of the text to be synthesized also includes converting the text such as numbers and symbols in the text to be synthesized into Chinese, so as to facilitate subsequent extraction of word segmentation features, polyphonic character features and/or prosodic features. , To reduce the error of feature extraction.
上述待合成文本的文本特征的获取,可以是将待合成文本输入预设的神经网络模型,预设的神经网络模型根据相应的算法计算得到相应的文本特征;或者,按照预设的特征提取算法,从待合成文本中提取对应的文本特征。The acquisition of the text features of the text to be synthesized may be inputting the text to be synthesized into a preset neural network model, and the preset neural network model calculates the corresponding text features according to the corresponding algorithm; or, according to the preset feature extraction algorithm , Extract the corresponding text features from the text to be synthesized.
在一个实施例中,对通过神经网络模型获取待合成文本的文本特征的过程进行描述。In one embodiment, the process of obtaining the text features of the text to be synthesized through the neural network model is described.
具体的,如图3所示,给出了一种获取待合成文本的文本特征的过程的流程示意图。Specifically, as shown in FIG. 3, a schematic flowchart of a process of obtaining the text features of the text to be synthesized is given.
如图3所述,上述获取待合成文本的文本特征的过程包括如图3所示的步骤S1041-S1043:As shown in FIG. 3, the foregoing process of obtaining the text features of the text to be synthesized includes steps S1041-S1043 as shown in FIG. 3.
步骤S1041:将待合成文本输入预设的分词模型,获取与待合成文本对应的分词特征,其中分词特征包含了待合成文本中应该从哪些地方进行断句、或者断开,从而确定待合成文本对应的分词结果对应的分词特征;Step S1041: Input the text to be synthesized into the preset word segmentation model, and obtain the word segmentation features corresponding to the text to be synthesized, where the word segmentation features include where in the text to be synthesized should be segmented or broken, so as to determine the corresponding text to be synthesized The word segmentation feature corresponding to the word segmentation result of;
步骤S1042:将待合成文本和/或分词特征输入预设的多音字预测模型,获取待合成文本对应的多音字特征;Step S1042: Input the text to be synthesized and/or the feature of word segmentation into the preset polyphonic character prediction model, and obtain the feature of the polyphonic character corresponding to the text to be synthesized;
步骤S1043:将待合成文本和/或分词特征输入预设的韵律预测模型,获取待合成文本对应的韵律特征。Step S1043: Input the text to be synthesized and/or the word segmentation feature into a preset prosody prediction model, and obtain the prosody feature corresponding to the text to be synthesized.
分词模型是将待合成文本进行分词处理得出分词特征的神经网络模型,通过分词模型可以对待合成文本的分词特征进行预测。其中,分词特征的确定是通过分词得到的字向量确定的,字向量是根据分词模型划分的单词或者词组对应的向量,用于确定待合成文本的分词特征。The word segmentation model is a neural network model that performs word segmentation processing on the text to be synthesized to obtain word segmentation features. Through the word segmentation model, the word segmentation features of the synthesized text can be predicted. Among them, the word segmentation feature is determined by the word vector obtained by word segmentation. The word vector is a vector corresponding to the word or phrase divided according to the word segmentation model, and is used to determine the word segmentation feature of the text to be synthesized.
多音字预测模型可以预测待合成文本中或者分词特征中的多音字特征,可以为一神经网络模型。The polyphonic character prediction model can predict the polyphonic character feature in the text to be synthesized or the word segmentation feature, and can be a neural network model.
韵律预测模型是预测待合成文本或者分词特征中的韵律特征的神经网络模型,可以对待合成文本的韵律特征进行预测,例如,韵律词特征、韵律短语特征以及语调短语特征。The prosody prediction model is a neural network model that predicts the prosody features of the text to be synthesized or the word segmentation features, and can predict the prosody features of the text to be synthesized, such as prosodic word features, prosodic phrase features, and intonation phrase features.
本实施例中的待合成文本中的文本特征不限于本实施例中的分词特征、多音字特征和韵律特征等文本特征。The text features in the text to be synthesized in this embodiment are not limited to the text features such as word segmentation features, polyphone features, and prosodic features in this embodiment.
用户可以对本文中涉及的文本特征进行设置,涉及的文本特征不仅仅是分词特征、多音字特征和韵律特征,还可以是前后词关联特征等其他的特征。用户也可以通过对计算图的建立对总的神经网络模型的结构进行建立,对输入数据进行选择。Users can set the text features involved in this article. The text features involved are not only word segmentation features, polyphonic character features, and prosodic features, but also other features such as pre- and post-word correlation features. The user can also establish the structure of the general neural network model by establishing the calculation graph, and select the input data.
步骤S106,将所述文本特征输入预设的时长预测模型,获取与所述文本特征对应的时长特征。Step S106: Input the text feature into a preset duration prediction model, and obtain a duration feature corresponding to the text feature.
具体的,时长特征是待合成文本以及待合成文本对应的文本特征中包含的音素文本特征对应的时间长度。预设的时长预测模型是预测音素文本特征对应的时间长度的神经网络模型,确定待合成文本包含的每一个音素对应的时间长度,其中包含将拼音转化成音素的过程,是通过多音字预测模型得到字的读音(例如ou3),然后将读音转化为音素,然后用时长预测模型去预测音素的时长。示例性的,将读音转换为音素,“我在中国”中的“我”的读音ou3可以转换为ou的1个音素,“国”的读音guo2可以转换为g,uo的两个音素。Specifically, the time length feature is the time length corresponding to the phoneme text feature included in the text to be synthesized and the text feature corresponding to the text to be synthesized. The preset duration prediction model is a neural network model that predicts the length of time corresponding to the phoneme text feature. It determines the time length corresponding to each phoneme contained in the text to be synthesized, including the process of converting Pinyin into phoneme, through the multi-phone word prediction model Get the pronunciation of the word (such as ou3), then convert the pronunciation into phonemes, and then use the duration prediction model to predict the duration of the phonemes. Exemplarily, the pronunciation is converted into phonemes, the pronunciation ou3 of "我" in "I am in China" can be converted into 1 phoneme of ou, and the pronunciation of "guo" guo2 can be converted into two phonemes of g and uo.
步骤S108,将所述文本特征和所述时长特征输入预设的声学模型,获取与所述待合成文本对应的语音特征。Step S108: Input the text feature and the duration feature into a preset acoustic model, and obtain a voice feature corresponding to the text to be synthesized.
具体的,语音特征是根据文本特征和时长特征生成的特征,语音特征包括声强、响度、音高和/或基音周期等特征。其中,声强是单位时间内通过垂直于声波传播方向的单位面积的平均声能;响度反映了主观感觉到的声音强弱程度;音高反映了主观感觉到的声音频率高低;基音周期,是在发音时浊音波形呈现的准周期,反映了声门相邻两次开闭之间的时间间隔或开闭的频率。Specifically, voice features are features generated based on text features and duration features, and voice features include features such as sound intensity, loudness, pitch, and/or pitch period. Among them, sound intensity is the average sound energy per unit time passing through a unit area perpendicular to the direction of sound wave propagation; loudness reflects the subjectively perceived sound strength; pitch reflects the subjectively perceived sound frequency; the pitch period is The quasi-period of the voiced sound waveform during the pronunciation reflects the time interval between two adjacent openings and closings of the glottis or the frequency of opening and closing.
在本实施例中,将上述步骤S104中获取的文本特征以及步骤S106中获取的时长特征输入预设的声学模型,通过声学模型对待合成文本对应的语音特征。In this embodiment, the text feature obtained in step S104 and the duration feature obtained in step S106 are input into a preset acoustic model, and the voice feature corresponding to the text to be synthesized is used through the acoustic model.
上述对语音特征进行预测的声学模型,为一神经网络模型,通过事先的训练声学模型具备根据文本特征和时长特征计算对应的语音特征的能力。The aforementioned acoustic model for predicting voice features is a neural network model, and the acoustic model has the ability to calculate corresponding voice features based on text features and duration features through prior training.
步骤S110,将所述语音特征转换成语音,生成与所述待合成文本对应的目标语音。Step S110: Convert the voice feature into a voice, and generate a target voice corresponding to the text to be synthesized.
目标语音是通过对待合成文本生成的语音。将语音特征转换为语音,可以将语音特征通过声码器进行合成,输出语音特征在声码器中对应的语音和语音时长等,得到目标语音,其中,声码器可以是parallel wavenet声码器。具体的,将语音特征作为输入,通过预设的声码器对待合成文本对应的语音特征进行语音合成,输出对应的目标语音。The target speech is the speech generated by the text to be synthesized. To convert voice features into voice, the voice features can be synthesized by a vocoder, and the corresponding voice and voice duration of the voice feature in the vocoder are output to obtain the target voice. Among them, the vocoder can be parallel wavenet vocoder. Specifically, the voice feature is used as the input, and the voice feature corresponding to the text to be synthesized is synthesized through a preset vocoder, and the corresponding target voice is output.
进一步的,上述时长预测模型和声学模型可以对待合成文本的相关特征进行很好的预测,并且,在使用相应的模型进行预测之前,还要根据训练数据对相应的模型进行训练。也就是说,对文本特征进行预测得到对应的时长特征和语音特征之前,还需要对时长预测模型和声学模型进行训练,使得相应的模型具备对文本特征对应的时长特征和语音特征进行准确的预测的能力。Further, the aforementioned duration prediction model and acoustic model can make good predictions about the relevant features of the synthesized text, and before using the corresponding model for prediction, the corresponding model must be trained based on the training data. In other words, before predicting the text features to obtain the corresponding duration features and voice features, the duration prediction model and acoustic model need to be trained, so that the corresponding models can accurately predict the duration features and voice features corresponding to the text features. Ability.
如图4所述,上述语音合成方法还包括如图4所示的步骤S1101-S1103:As shown in FIG. 4, the above-mentioned speech synthesis method further includes steps S1101-S1103 as shown in FIG. 4.
步骤S1101:获取训练样本集,所述训练样本集包含多个训练文本以及对应的文本参考特征、时长参考特征和/或语音参考特征;Step S1101: Obtain a training sample set, the training sample set including multiple training texts and corresponding text reference features, duration reference features, and/or voice reference features;
步骤S1102:将所述训练文本对应的文本参考特征作为所述时长预测模型的输入,所述时长参考特征作为时长预测模型的输出,对所述时长预测模型进行训练;Step S1102: Use the text reference feature corresponding to the training text as the input of the duration prediction model, and the duration reference feature as the output of the duration prediction model, and train the duration prediction model;
步骤S1103:将所述文本参考特征和所述时长参考特征作为所述声学模型的输入,所述语音参考特征作为声学模型的输出,对所述声学模型进行训练。Step S1103: Use the text reference feature and the duration reference feature as the input of the acoustic model, and the speech reference feature as the output of the acoustic model, and train the acoustic model.
在进行模型训练之前,首先需要对数据进行标识,确定文本对应的时长参考特征和语音参考特征。其中,时长参考特征是与待合成文本对应的时长特征,语音参考特征是与待合成文本对应的语音特征。本实施例中通过预先训练样本集对时长预测模型和声学模型进行训练,使得模型具备准确预测待合成文本对应的时长特征和语音特征的能力。Before model training, the data needs to be identified first, and the duration reference feature and voice reference feature corresponding to the text are determined. Among them, the duration reference feature is a duration feature corresponding to the text to be synthesized, and the speech reference feature is a voice feature corresponding to the text to be synthesized. In this embodiment, the duration prediction model and the acoustic model are trained through the pre-training sample set, so that the model has the ability to accurately predict the duration characteristics and voice characteristics corresponding to the text to be synthesized.
针对训练数据集包含的每一条训练文本,将训练文本对应的文本参考特征作为输入,将对应的时长参考特征作为输出,对预设的时长预测模型进行训练,以使时长训练模型具备时长特征预测的功能。For each training text contained in the training data set, take the text reference feature corresponding to the training text as input and the corresponding duration reference feature as output, and train the preset duration prediction model so that the duration training model has duration feature prediction Function.
针对训练数据集包含的每一条训练文本,将训练文本对应的文本参考特征和时长参考特征作为输入,将对应的语音参考特征作为输出,对预设的声学模型进行训练,以使声学模型具备语音特征预测的功能。For each training text contained in the training data set, the text reference feature and duration reference feature corresponding to the training text are used as input, and the corresponding voice reference feature is used as output, and the preset acoustic model is trained to make the acoustic model have voice Feature prediction function.
进一步的,在一个实施例中,还需要对文本特征预测的各个模型进行模型的训练,具体包含对分词模型、多音字预测模型、韵律预测模型的训练。Further, in one embodiment, it is also necessary to perform model training for each model of text feature prediction, which specifically includes training of a word segmentation model, a polyphonic character prediction model, and a prosody prediction model.
也就是说,通过训练样本集对文本特征预测涉及的分词模型、多音字预测模型和韵律预测模型进行训练,使得分词模型、多音字预测模型和韵律预测模型分别具备有根据待合成文本预测分词特征、多音字特征和韵律特征等文本特征的能力。In other words, the word segmentation model, polyphone prediction model, and prosody prediction model involved in text feature prediction are trained through the training sample set, so that the word segmentation model, polyphone prediction model, and prosody prediction model respectively have predictive word segmentation features based on the text to be synthesized , Polyphonic character features and prosodic features and other text features.
如图5所述,上述语音合成方法还包括如图5所示的步骤S2101-S2103:As shown in FIG. 5, the above-mentioned speech synthesis method further includes steps S2101-S2103 as shown in FIG. 5:
在一个实施例中,所述训练样本集还包含与所述多个训练文本对应的分词参考特征、多音字参考特征和/或韵律参考特征;所述方法包括:In an embodiment, the training sample set further includes word segmentation reference features, polyphone reference features, and/or prosodic reference features corresponding to the plurality of training texts; the method includes:
步骤S2101:将所述训练文本作为所述分词模型的输入,所述分词参考特征作为分词模型的输出,对所述分词模型进行训练。Step S2101: The training text is used as the input of the word segmentation model, and the word segmentation reference feature is used as the output of the word segmentation model, and the word segmentation model is trained.
步骤S2102:将所述训练文本和/或所述分词参考特征作为所述多音字预测模型的输入,所述多音字参考特征作为多音字预测模型的输出,对所述多音字预测模型进行训练。Step S2102: Use the training text and/or the word segmentation reference feature as the input of the polyphone prediction model, and the polyphone reference feature as the output of the polyphone prediction model, and train the polyphone prediction model.
步骤S2103:将所述训练文本和/或所述分词参考特征作为所述韵律预测模型的输入,所述韵律参考特征作为韵律预测模型的输出,对所述韵律预测模型进行训练。Step S2103: Use the training text and/or the word segmentation reference feature as the input of the prosody prediction model, and the prosody reference feature as the output of the prosody prediction model, and train the prosody prediction model.
其中,训练样本集还可以包括多个训练文本以及期望模型输出的分词特征、多音字特征和韵律特征。分词参考特征是期望分词模型根据训练文本输出的分词特征,多音字参考特征是期望多音字预测模型根据训练文本和对应的分词特征输出的多音字特征。韵律参考特征是期望韵律预测模型根据训练文本和对应的分词特征输出的韵律特征。Among them, the training sample set may also include multiple training texts and word segmentation features, polyphonic character features, and prosodic features that are expected to be output by the model. The word segmentation reference feature is the word segmentation feature output by the expected word segmentation model according to the training text, and the polyphone reference feature is the polyphone feature output by the expected polyphone prediction model according to the training text and the corresponding word segmentation feature. The prosodic reference feature is the prosodic feature output by the expected prosody prediction model according to the training text and the corresponding word segmentation feature.
针对训练数据集包含的每一条训练文本,将训练文本作为输入,将对应的分词参考特征作为输出,对预设的分词模型进行训练,以使分词模型具备分词特征预测的功能。For each training text contained in the training data set, the training text is used as input, and the corresponding word segmentation reference feature is used as output, and the preset word segmentation model is trained so that the word segmentation model has the function of word segmentation feature prediction.
针对训练数据集包含的每一条训练文本,将训练文本以及对应的分词参考特征作为输入,将对应的多音字参考特征作为输出,对预设的多音字预测模型进行训练,以使多音字预测模型具备多音字特征预测的功能。For each training text contained in the training data set, the training text and the corresponding word segmentation reference feature are used as input, and the corresponding polyphonic character reference feature is used as output, and the preset polyphonic word prediction model is trained to make the polyphonic word prediction model It has the function of predicting the features of polyphonic characters.
针对训练数据集包含的每一条训练文本,将训练文本以及对应的分词参考特征作为输入,将对应的韵律参考特征作为输出,对预设的韵律预测模型进行训练,以使韵律预测模型具备韵律特征预测的功能。For each training text contained in the training data set, the training text and the corresponding word segmentation reference feature are used as input, and the corresponding prosody reference feature is used as output, and the preset prosody prediction model is trained to make the prosody prediction model have prosody features Forecast function.
本实施例中通过预先处理好的数据对分词模型、多音字预测模型和韵律预测模型进行训练,使得模型能够准确预测出训练文本对应的分词特征、多音字特征和韵律特征。In this embodiment, the word segmentation model, the polyphonic character prediction model, and the prosody prediction model are trained through pre-processed data, so that the model can accurately predict the word segmentation feature, the polyphonic character feature, and the prosody feature corresponding to the training text.
在具体的预测过程中,可以同时获取多个待合成文本,获取每一个待合成文本对应的文本特征。将多个待合成文本对应的文本特征进行筛选和排序,输入预设的特征队列。获取特征队列中预设数量的文本特征,输入时长预测模型和声学模型进行预测,得到对应的特征。其中,生成每一个待合成文本对应的文本特征和对预设数量的文本特征进行预测的步骤是同步进行的。In the specific prediction process, multiple texts to be synthesized can be obtained at the same time, and text characteristics corresponding to each text to be synthesized can be obtained. The text features corresponding to multiple texts to be synthesized are filtered and sorted, and input into a preset feature queue. Obtain a preset number of text features in the feature queue, input the duration prediction model and the acoustic model for prediction, and obtain the corresponding features. Among them, the steps of generating text features corresponding to each text to be synthesized and predicting a preset number of text features are performed simultaneously.
如图6所述,上述语音合成方法还包括如图6所示的步骤S3101-S3102:As shown in FIG. 6, the above-mentioned speech synthesis method further includes steps S3101-S3102 as shown in FIG. 6:
步骤S3101:通过文本迭代器获取多个待合成文本,针对每一个待合成文本,分别执行所述获取所述待合成文本的文本特征的步骤;Step S3101: Obtain a plurality of texts to be synthesized through a text iterator, and for each text to be synthesized, respectively execute the step of obtaining the text characteristics of the text to be synthesized;
步骤S3102:将多个待合成文本对应的文本特征添加至预设的特征队列,当所述特征队列满足预设条件时,获取所述特征队列中的预设数量的文本特征,并将所述预设数量的文本特征分别输入所述时长预测模型,以使同步执行所述获取与所述文本特征对应的时长特征的步骤。Step S3102: Add a plurality of text features corresponding to the text to be synthesized to a preset feature queue, and when the feature queue meets a preset condition, obtain a preset number of text features in the feature queue, and combine the A preset number of text features are respectively input to the duration prediction model, so that the step of obtaining duration features corresponding to the text features is performed synchronously.
其中,文本迭代器用于获取多个待合成文本和对应的文本特征中连续的数据,可以在多个获取待合成文本中的文本特征的进程中连续、迭代出文本特征,特征队列是包含多个文本特征的有序的集合。预设条件是进行将文本特征输入时长预测模型的条件,可以是文本特征达到一定的数量,也可以是达到预设的文本特征的获取时间。预设数量是特征队列输出文本特征的数量,可以是固定数值,也可以是依据某种规律变换的数值。Among them, the text iterator is used to obtain continuous data in multiple texts to be synthesized and corresponding text features. The text features can be continuously and iterated in multiple processes of obtaining text features in the text to be synthesized. The feature queue contains multiple text features. An ordered collection of text features. The preset condition is a condition for inputting the text feature into the time-length prediction model, which can be the text feature reaching a certain number, or it can be the acquisition time of the preset text feature. The preset number is the number of text features output by the feature queue, which can be a fixed value or a value that changes according to a certain rule.
本实施例中添加了队列特征和文本迭代器对多个待合成文本进行处理,使得更加有效和快速对待合成文本进行文本转换,提高了语音合成和模型训练的效率。In this embodiment, a queue feature and a text iterator are added to process multiple texts to be synthesized, so that text conversion of the text to be synthesized is more effective and faster, and the efficiency of speech synthesis and model training is improved.
示例性的,在上述模型训练以及模型训练的过程中,时长预测模型、声学模型、分词模型、多音字预测模型和/或韵律预测模型为神经网络模型,在一个具体的实施例中,为双向长短期记忆申请网络模型(BiLSTM模型)。。Exemplarily, in the foregoing model training and model training processes, the duration prediction model, the acoustic model, the word segmentation model, the polyphonic character prediction model, and/or the prosody prediction model are neural network models. In a specific embodiment, they are two-way Long and short-term memory application network model (BiLSTM model). .
其中,BiLSTM(Bi-directional Long Short-Term Memory,双向长短期记忆网络)模型,使得数据具有时间依赖性,并且对数据进行全局化处理。能通过前后词等特征来更好的预测结果。Among them, the BiLSTM (Bi-directional Long Short-Term Memory) model makes the data time-dependent and processes the data globally. It can better predict the results through features such as before and after words.
如图7所示,在一个实施例中,提出了一种语音合成装置,该装置包括:As shown in FIG. 7, in one embodiment, a speech synthesis device is provided, and the device includes:
获取模块702,用于获取待合成文本;The obtaining module 702 is used to obtain the text to be synthesized;
文本特征确定模块704,用于获取所述待合成文本的文本特征,所述文本特征包括分词特征、多音字特征和/或韵律特征中的至少一个;The text feature determining module 704 is configured to obtain text features of the text to be synthesized, where the text features include at least one of word segmentation features, polyphone features, and/or prosodic features;
时长特征确定模块706,用于将所述文本特征输入预设的时长预测模型,获取与所述文本特征对应的时长特征;The duration feature determining module 706 is configured to input the text feature into a preset duration prediction model, and obtain the duration feature corresponding to the text feature;
语音特征确定模块708,用于将所述文本特征和所述时长特征输入预设的声学模型,获取与所述待合成文本对应的语音特征;A voice feature determining module 708, configured to input the text feature and the duration feature into a preset acoustic model, and obtain a voice feature corresponding to the text to be synthesized;
转换模块710,用于将所述语音特征转换成语音,生成与所述待合成文本对应的目标语音。The conversion module 710 is configured to convert the voice feature into a voice, and generate a target voice corresponding to the text to be synthesized.
如图8所示,在一个实施例中,所述文本特征确定模块704还包括:预处理单元,用于对所述待合成文本进行正则化处理。As shown in FIG. 8, in one embodiment, the text feature determination module 704 further includes: a preprocessing unit, configured to perform regularization processing on the text to be synthesized.
如图8所示,在一个实施例中,所述文本特征确定模块704还包括:分词特征确定单元,用于将所述待合成文本输入预设的分词模型,获取与所述待合成文本对应的分词特征;多音字特征确定单元,用于将所述待合成文本和/或所述分词特征输入预设的多音字预测模型,获取所述待合成文本对应的多音字特征;韵律特征确定单元,用于将所述待合成文本和/或所述分词特征输入预设的韵律预测模型,获取所述待合成文本对应的韵律特征。As shown in FIG. 8, in one embodiment, the text feature determination module 704 further includes: a word segmentation feature determination unit, configured to input the text to be synthesized into a preset word segmentation model, and obtain the text corresponding to the text to be synthesized The word segmentation feature; a polyphonic character feature determination unit for inputting the text to be synthesized and/or the word segmentation feature into a preset polyphonic word prediction model to obtain the polyphonic character feature corresponding to the text to be synthesized; prosodic feature determining unit , Used to input the text to be synthesized and/or the word segmentation feature into a preset prosody prediction model to obtain the prosodic feature corresponding to the text to be synthesized.
如图9所示,在一个实施例中,所述装置还包括:获取训练模块703,用于获取训练样本集,所述训练样本集包含多个训练文本以及对应的文本参考特征、时长参考特征和/或语音参考特征;时长训练模块705,用于将所述训练文本对应的文本参考特征作为所述时长预测模型的输入,所述时长参考特征作为时长预测模型的输出,对所述时长预测模型进行训练;语音训练模块707,用于将所述文本参考特征和所述时长参考特征作为所述声学模型的输入,所述语音参考特征作为声学模型的输出,对所述声学模型进行训练。As shown in FIG. 9, in one embodiment, the device further includes: an acquiring training module 703, configured to acquire a training sample set, the training sample set includes a plurality of training texts and corresponding text reference features and duration reference features And/or speech reference features; duration training module 705, configured to use text reference features corresponding to the training text as the input of the duration prediction model, and the duration reference features as the output of the duration prediction model to predict the duration Model training; a voice training module 707, configured to use the text reference feature and the duration reference feature as the input of the acoustic model, and the voice reference feature as the output of the acoustic model to train the acoustic model.
如图10所示,在一个实施例中,所述训练样本集还包含与所述多个训练文本对应的分词参考特征、多音字参考特征和/或韵律参考特征,所述装置包括:分词训练模块7041,用于将所述训练文本作为所述分词模型的输入,所述分词参考特征作为分词模型的输出,对所述分词模型进行训练。多音字训练模块7043,用于将所述训练文本和/或所述分词参考特征作为所述多音字预测模型的输入,所述多音字参考特征作为多音字预测模型的输出,对所述多音字预测模型进行训练。韵律训练模块7045,用于将所述训练文本和/或所述分词参考特征作为所述韵律预测模型的输入,所述韵律参考特征作为韵律预测模型的输出,对所述韵律预测模型进行训练。As shown in FIG. 10, in one embodiment, the training sample set further includes word segmentation reference features, polyphonic character reference features, and/or prosodic reference features corresponding to the plurality of training texts, and the device includes: word segmentation training The module 7041 is configured to use the training text as the input of the word segmentation model, and use the word segmentation reference feature as the output of the word segmentation model to train the word segmentation model. The polyphonic character training module 7043 is configured to use the training text and/or the word segmentation reference feature as the input of the polyphonic character prediction model, and the polyphonic character reference feature is used as the output of the polyphonic character prediction model. The prediction model is trained. The prosody training module 7045 is configured to use the training text and/or the word segmentation reference feature as the input of the prosody prediction model, and the prosody reference feature as the output of the prosody prediction model to train the prosody prediction model.
如图11所示,在一个实施例中,所述装置还包括:文本获取模块709,用于通过文本迭代器获取多个待合成文本,针对每一个待合成文本,分别执行所述获取所述待合成文本的文本特征的步骤;文本预测模块711,用于将多个待合成文本对应的文本特征添加至预设的特征队列,当所述特征队列满足预设条件时,获取所述特征队列中的预设数量的文本特征,并将所述预设数量的文本特征分别输入所述时长预测模型,以使同步执行所述获取与所述文本特征对应的时长特征的步骤。As shown in FIG. 11, in one embodiment, the device further includes: a text obtaining module 709, configured to obtain a plurality of texts to be synthesized through a text iterator, and for each text to be synthesized, the obtaining of the The step of the text feature of the text to be synthesized; the text prediction module 711 is used to add a plurality of text features corresponding to the text to be synthesized to a preset feature queue, and when the feature queue meets a preset condition, the feature queue is acquired And input the preset number of text features into the duration prediction model respectively, so that the step of acquiring the duration feature corresponding to the text feature is performed synchronously.
在一个实施例中,所述时长预测模型、声学模型、分词模型、多音字预测模型和/或韵律预测模型为BiLSTM模型。In one embodiment, the duration prediction model, acoustic model, word segmentation model, polyphonic character prediction model and/or prosody prediction model is a BiLSTM model.
图12示出了一个实施例中智能终端的内部结构图。该智能终端具体可以是终端,也可以是服务器。如图12所示,该智能终端包括通过系统总线连接的处理器、存储器和网络接口。其中,存储器包括非易失性存储介质和内存储器。该智能终端的非易失性存储介质存储有操作系统,还可存储有计算机程序,该计算机程序被处理器执行时,可使得处理器实现年龄识别方法。该内存储器中也可储存有计算机程序,该计算机程序被处理器执行时,可使得处理器执行年龄识别方法。本领域技术人员可以理解,图12中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Fig. 12 shows an internal structure diagram of a smart terminal in an embodiment. The smart terminal may specifically be a terminal or a server. As shown in Figure 12, the smart terminal includes a processor, a memory and a network interface connected through a system bus. Among them, the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the smart terminal stores an operating system and may also store a computer program. When the computer program is executed by the processor, the processor can realize the age identification method. A computer program may also be stored in the internal memory, and when the computer program is executed by the processor, the processor can execute the age identification method. Those skilled in the art can understand that the structure shown in FIG. 12 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
在一个实施例中,提出了一种智能终端,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行以下步骤:获取待合成文本;获取所述待合成文本的文本特征,所述文本特征包括分词特征、多音字特征和/或韵律特征中的至少一个;将所述文本特征输入预设的时长预测模型,获取与所述文本特征对应的时长特征;将所述文本特征和所述时长特征输入预设的声学模型,获取与所述待合成文本对应的语音特征;将所述语音特征转换成语音,生成与所述待合成文本对应的目标语音。In one embodiment, an intelligent terminal is proposed, including a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps: Synthesizing text; acquiring text features of the text to be synthesized, the text features including at least one of word segmentation features, polyphonic character features, and/or prosodic features; inputting the text features into a preset duration prediction model, and acquiring The duration feature corresponding to the text feature; input the text feature and the duration feature into a preset acoustic model to obtain the voice feature corresponding to the text to be synthesized; convert the voice feature into speech, and generate and The target voice corresponding to the text to be synthesized.
在一个实施例中,所述获取所述待合成文本的文本特征的步骤之前,还包括:对所述待合成文本进行正则化处理。In one embodiment, before the step of obtaining the text characteristics of the text to be synthesized, the method further includes: performing regularization processing on the text to be synthesized.
在一个实施例中,所述获取所述待合成文本的文本特征的步骤还包括:将所述待合成文本输入预设的分词模型,获取与所述待合成文本对应的分词特征;将所述待合成文本和/或所述分词特征输入预设的多音字预测模型,获取所述待合成文本对应的多音字特征;将所述待合成文本和/或所述分词特征输入预设的韵律预测模型,获取所述待合成文本对应的韵律特征。In one embodiment, the step of obtaining the text features of the text to be synthesized further includes: inputting the text to be synthesized into a preset word segmentation model to obtain the word segmentation characteristics corresponding to the text to be synthesized; Input the pre-synthesized text and/or the word segmentation feature into a preset polyphonic character prediction model to obtain the polyphonic character features corresponding to the text to be synthesized; input the to-be-synthesized text and/or the word segmentation feature into the preset prosody prediction Model to obtain the prosodic features corresponding to the text to be synthesized.
在一个实施例中,所述方法还包括:获取训练样本集,所述训练样本集包含多个训练文本以及对应的文本参考特征、时长参考特征和/或语音参考特征;将所述训练文本对应的文本参考特征作为所述时长预测模型的输入,所述时长参考特征作为时长预测模型的输出,对所述时长预测模型进行训练。将所述文本参考特征和所述时长参考特征作为所述声学模型的输入,所述语音参考特征作为声学模型的输出,对所述声学模型进行训练。In one embodiment, the method further includes: obtaining a training sample set, the training sample set containing a plurality of training texts and corresponding text reference features, duration reference features, and/or voice reference features; and corresponding to the training text The text reference feature of is used as the input of the duration prediction model, and the duration reference feature is used as the output of the duration prediction model to train the duration prediction model. The text reference feature and the duration reference feature are used as the input of the acoustic model, and the speech reference feature is used as the output of the acoustic model, and the acoustic model is trained.
在一个实施例中,所述训练样本集还包含与所述多个训练文本对应的分词参考特征、多音字参考特征和/或韵律参考特征;所述方法包括:将所述训练文本作为所述分词模型的输入,所述分词参考特征作为分词模型的输出,对所述分词模型进行训练。将所述训练文本和/或所述分词参考特征作为所述多音字预测模型的输入,所述多音字参考特征作为多音字预测模型的输出,对所述多音字预测模型进行训练。将所述训练文本和/或所述分词参考特征作为所述韵律预测模型的输入,所述韵律参考特征作为韵律预测模型的输出,对所述韵律预测模型进行训练。In an embodiment, the training sample set further includes word segmentation reference features, polyphone reference features, and/or prosodic reference features corresponding to the multiple training texts; the method includes: using the training text as the The word segmentation model is input, and the word segmentation reference feature is used as the output of the word segmentation model, and the word segmentation model is trained. The training text and/or the word segmentation reference feature is used as the input of the polyphonic character prediction model, and the polyphonic character reference feature is used as the output of the polyphonic character prediction model, and the polyphonic character prediction model is trained. The training text and/or the word segmentation reference feature is used as the input of the prosody prediction model, and the prosody reference feature is used as the output of the prosody prediction model, and the prosody prediction model is trained.
在一个实施例中,所述方法还包括:通过文本迭代器获取多个待合成文本,针对每一个待合成文本,分别执行所述获取所述待合成文本的文本特征的步骤;将多个待合成文本对应的文本特征添加至预设的特征队列,当所述特征队列满足预设条件时,获取所述特征队列中的预设数量的文本特征,并将所述预设数量的文本特征分别输入所述时长预测模型,以使同步执行所述获取与所述文本特征对应的时长特征的步骤。In one embodiment, the method further includes: obtaining a plurality of texts to be synthesized through a text iterator, and for each text to be synthesized, respectively performing the step of obtaining the text characteristics of the text to be synthesized; The text features corresponding to the synthesized text are added to a preset feature queue, and when the feature queue meets a preset condition, a preset number of text features in the feature queue are obtained, and the preset number of text features are separated Input the duration prediction model so that the step of acquiring the duration feature corresponding to the text feature is executed synchronously.
在一个实施例中,提出了一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:获取待合成文本;获取所述待合成文本的文本特征,所述文本特征包括分词特征、多音字特征和/或韵律特征中的至少一个;将所述文本特征输入预设的时长预测模型,获取与所述文本特征对应的时长特征;将所述文本特征和所述时长特征输入预设的声学模型,获取与所述待合成文本对应的语音特征;将所述语音特征转换成语音,生成与所述待合成文本对应的目标语音。In one embodiment, a computer-readable storage medium is provided that stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps: obtain the text to be synthesized; obtain the text to be synthesized The text feature of the text, the text feature includes at least one of word segmentation feature, polyphonic character feature, and/or prosodic feature; inputting the text feature into a preset duration prediction model to obtain the duration feature corresponding to the text feature; The text feature and the duration feature are input into a preset acoustic model to obtain a voice feature corresponding to the text to be synthesized; the voice feature is converted into speech to generate a target voice corresponding to the text to be synthesized.
在一个实施例中,所述获取所述待合成文本的文本特征的步骤之前,还包括:对所述待合成文本进行正则化处理。In one embodiment, before the step of obtaining the text characteristics of the text to be synthesized, the method further includes: performing regularization processing on the text to be synthesized.
在一个实施例中,所述获取所述待合成文本的文本特征的步骤还包括:将所述待合成文本输入预设的分词模型,获取与所述待合成文本对应的分词特征;将所述待合成文本和/或所述分词特征输入预设的多音字预测模型,获取所述待合成文本对应的多音字特征;将所述待合成文本和/或所述分词特征输入预设的韵律预测模型,获取所述待合成文本对应的韵律特征。In one embodiment, the step of obtaining the text features of the text to be synthesized further includes: inputting the text to be synthesized into a preset word segmentation model to obtain the word segmentation characteristics corresponding to the text to be synthesized; Input the pre-synthesized text and/or the word segmentation feature into a preset polyphonic character prediction model to obtain the polyphonic character features corresponding to the text to be synthesized; input the to-be-synthesized text and/or the word segmentation feature into the preset prosody prediction Model to obtain the prosodic features corresponding to the text to be synthesized.
在一个实施例中,所述方法还包括:获取训练样本集,所述训练样本集包含多个训练文本以及对应的文本参考特征、时长参考特征和/或语音参考特征;将所述训练文本对应的文本参考特征作为所述时长预测模型的输入,所述时长参考特征作为时长预测模型的输出,对所述时长预测模型进行训练。将所述文本参考特征和所述时长参考特征作为所述声学模型的输入,所述语音参考特征作为声学模型的输出,对所述声学模型进行训练。In one embodiment, the method further includes: obtaining a training sample set, the training sample set containing a plurality of training texts and corresponding text reference features, duration reference features, and/or voice reference features; and corresponding to the training text The text reference feature of is used as the input of the duration prediction model, and the duration reference feature is used as the output of the duration prediction model to train the duration prediction model. The text reference feature and the duration reference feature are used as the input of the acoustic model, and the speech reference feature is used as the output of the acoustic model, and the acoustic model is trained.
在一个实施例中,所述训练样本集还包含与所述多个训练文本对应的分词参考特征、多音字参考特征和/或韵律参考特征;所述方法包括:将所述训练文本作为所述分词模型的输入,所述分词参考特征作为分词模型的输出,对所述分词模型进行训练。将所述训练文本和/或所述分词参考特征作为所述多音字预测模型的输入,所述多音字参考特征作为多音字预测模型的输出,对所述多音字预测模型进行训练。将所述训练文本和/或所述分词参考特征作为所述韵律预测模型的输入,所述韵律参考特征作为韵律预测模型的输出,对所述韵律预测模型进行训练。In an embodiment, the training sample set further includes word segmentation reference features, polyphone reference features, and/or prosodic reference features corresponding to the multiple training texts; the method includes: using the training text as the The word segmentation model is input, and the word segmentation reference feature is used as the output of the word segmentation model, and the word segmentation model is trained. The training text and/or the word segmentation reference feature is used as the input of the polyphonic character prediction model, and the polyphonic character reference feature is used as the output of the polyphonic character prediction model, and the polyphonic character prediction model is trained. The training text and/or the word segmentation reference feature is used as the input of the prosody prediction model, and the prosody reference feature is used as the output of the prosody prediction model, and the prosody prediction model is trained.
在一个实施例中,所述方法还包括:通过文本迭代器获取多个待合成文本,针对每一个待合成文本,分别执行所述获取所述待合成文本的文本特征的步骤;将多个待合成文本对应的文本特征添加至预设的特征队列,当所述特征队列满足预设条件时,获取所述特征队列中的预设数量的文本特征,并将所述预设数量的文本特征分别输入所述时长预测模型,以使同步执行所述获取与所述文本特征对应的时长特征的步骤。In one embodiment, the method further includes: obtaining a plurality of texts to be synthesized through a text iterator, and for each text to be synthesized, respectively performing the step of obtaining the text characteristics of the text to be synthesized; The text features corresponding to the synthesized text are added to a preset feature queue, and when the feature queue meets a preset condition, a preset number of text features in the feature queue are obtained, and the preset number of text features are separated Input the duration prediction model so that the step of acquiring the duration feature corresponding to the text feature is executed synchronously.
采用本发明的语音合成方法、装置、终端及存储介质之后,在进行语音合成的过程中,首先获取待合成文本的待合成文本的文本特征,所述文本特征包括分词特征、多音字特征和/或韵律特征等文本特征;然后将文本特征输入预设的时长预测模型获取对应的时长特征;将文本特征和时长特征输入预设的声学模型获取对应的语音特征;最后将语音特征转换成语音,生成与待合成文本对应的目标语音。在进行语音合成的特征提取的过程中,考虑的文本特征包括了多音字特征和韵律特征等,并结合模型预测的时长特征,得到最终合成语音的过程中所需要的语音特征。也就是说,本发明提供的语音合成方法、装置、终端及存储介质考虑了多种文本特征和时长特征生成的语音特征,使得合成的语音更加准确,提高了语音合成的准确性,提高了用户体验。After adopting the speech synthesis method, device, terminal and storage medium of the present invention, in the process of speech synthesis, first obtain the text characteristics of the text to be synthesized, and the text characteristics include word segmentation characteristics, polyphonic character characteristics and/ Or text features such as prosodic features; then input the text features into the preset duration prediction model to obtain the corresponding duration features; input the text features and duration features into the preset acoustic model to obtain the corresponding voice features; finally convert the voice features into speech, Generate the target speech corresponding to the text to be synthesized. In the process of feature extraction for speech synthesis, the considered text features include polyphonic character features and prosodic features, and combined with the time-length features predicted by the model, the final speech features needed in the process of synthesizing speech are obtained. That is to say, the speech synthesis method, device, terminal and storage medium provided by the present invention take into account the speech characteristics generated by multiple text characteristics and duration characteristics, so that the synthesized speech is more accurate, the accuracy of speech synthesis is improved, and the user is improved. Experience.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink) DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The program can be stored in a non-volatile computer readable storage medium. Here, when the program is executed, it may include the procedures of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database, or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered as the range described in this specification.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation manners of the present application, and their description is relatively specific and detailed, but they should not be understood as a limitation to the scope of the present application. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications and improvements can be made, and these all fall within the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.

Claims (13)

  1. 一种语音合成方法,其特征在于,所述方法包括:A method for speech synthesis, characterized in that the method includes:
    获取待合成文本;Obtain the text to be synthesized;
    获取所述待合成文本的文本特征,所述文本特征包括分词特征、多音字特征和/或韵律特征中的至少一个;Acquiring a text feature of the text to be synthesized, where the text feature includes at least one of a word segmentation feature, a polyphonic character feature, and/or a prosodic feature;
    将所述文本特征输入预设的时长预测模型,获取与所述文本特征对应的时长特征;Input the text feature into a preset duration prediction model, and obtain the duration feature corresponding to the text feature;
    将所述文本特征和所述时长特征输入预设的声学模型,获取与所述待合成文本对应的语音特征;Inputting the text feature and the duration feature into a preset acoustic model, and obtaining a voice feature corresponding to the text to be synthesized;
    将所述语音特征转换成语音,生成与所述待合成文本对应的目标语音。The voice feature is converted into voice, and a target voice corresponding to the text to be synthesized is generated.
  2. 根据权利要求1所述的方法,其特征在于,所述获取所述待合成文本的文本特征的步骤之前,还包括:The method according to claim 1, characterized in that, before the step of obtaining the text characteristics of the text to be synthesized, the method further comprises:
    对所述待合成文本进行正则化处理。Perform regularization processing on the text to be synthesized.
  3. 根据权利要求1所述的方法,其特征在于,所述获取所述待合成文本的文本特征的步骤还包括:The method according to claim 1, wherein the step of obtaining the text characteristics of the text to be synthesized further comprises:
    将所述待合成文本输入预设的分词模型,获取与所述待合成文本对应的分词特征;Input the text to be synthesized into a preset word segmentation model, and obtain the word segmentation feature corresponding to the text to be synthesized;
    将所述待合成文本和/或所述分词特征输入预设的多音字预测模型,获取所述待合成文本对应的多音字特征;Inputting the to-be-synthesized text and/or the word segmentation feature into a preset polyphonic character prediction model to obtain the polyphonic character feature corresponding to the to-be-synthesized text;
    将所述待合成文本和/或所述分词特征输入预设的韵律预测模型,获取所述待合成文本对应的韵律特征。The text to be synthesized and/or the word segmentation feature is input into a preset prosody prediction model, and the prosody feature corresponding to the text to be synthesized is obtained.
  4. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    获取训练样本集,所述训练样本集包含多个训练文本以及对应的文本参考特征、时长参考特征和/或语音参考特征;Acquiring a training sample set, the training sample set including a plurality of training texts and corresponding text reference features, duration reference features, and/or voice reference features;
    将所述训练文本对应的文本参考特征作为所述时长预测模型的输入,所述时长参考特征作为时长预测模型的输出,对所述时长预测模型进行训练;Using the text reference feature corresponding to the training text as the input of the duration prediction model, the duration reference feature as the output of the duration prediction model, and training the duration prediction model;
    将所述文本参考特征和所述时长参考特征作为所述声学模型的输入,所述语音参考特征作为声学模型的输出,对所述声学模型进行训练。The text reference feature and the duration reference feature are used as the input of the acoustic model, and the speech reference feature is used as the output of the acoustic model, and the acoustic model is trained.
  5. 根据权利要求3所述的方法,其特征在于,所述训练样本集还包含与所述多个训练文本对应的分词参考特征、多音字参考特征和/或韵律参考特征;The method according to claim 3, wherein the training sample set further comprises word segmentation reference features, polyphone reference features, and/or prosodic reference features corresponding to the plurality of training texts;
    所述方法包括:The method includes:
    将所述训练文本作为所述分词模型的输入,所述分词参考特征作为分词模型的输出,对所述分词模型进行训练。The training text is used as the input of the word segmentation model, and the word segmentation reference feature is used as the output of the word segmentation model, and the word segmentation model is trained.
    将所述训练文本和/或所述分词参考特征作为所述多音字预测模型的输入,所述多音字参考特征作为多音字预测模型的输出,对所述多音字预测模型进行训练。The training text and/or the word segmentation reference feature is used as the input of the polyphonic character prediction model, and the polyphonic character reference feature is used as the output of the polyphonic character prediction model, and the polyphonic character prediction model is trained.
    将所述训练文本和/或所述分词参考特征作为所述韵律预测模型的输入,所述韵律参考特征作为韵律预测模型的输出,对所述韵律预测模型进行训练。The training text and/or the word segmentation reference feature is used as the input of the prosody prediction model, and the prosody reference feature is used as the output of the prosody prediction model, and the prosody prediction model is trained.
  6. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    通过文本迭代器获取多个待合成文本,针对每一个待合成文本,分别执行所述获取所述待合成文本的文本特征的步骤;Obtain a plurality of texts to be synthesized through a text iterator, and for each text to be synthesized, respectively execute the step of obtaining the text characteristics of the text to be synthesized;
    将多个待合成文本对应的文本特征添加至预设的特征队列,当所述特征队列满足预设条件时,获取所述特征队列中的预设数量的文本特征,并将所述预设数量的文本特征分别输入所述时长预测模型,以使同步执行所述获取与所述文本特征对应的时长特征的步骤。A plurality of text features corresponding to the text to be synthesized are added to a preset feature queue, and when the feature queue meets a preset condition, a preset number of text features in the feature queue are acquired, and the preset number The text features of are respectively input to the duration prediction model, so that the step of acquiring the duration features corresponding to the text features is performed synchronously.
  7. 一种语音合成装置,其特征在于,所述装置包括:A speech synthesis device, characterized in that the device includes:
    获取模块,用于获取待合成文本;The obtaining module is used to obtain the text to be synthesized;
    文本特征确定模块,用于获取所述待合成文本的文本特征,所述文本特征包括分词特征、多音字特征和/或韵律特征中的至少一个;A text feature determination module, configured to obtain text features of the text to be synthesized, where the text features include at least one of word segmentation features, polyphone features, and/or prosodic features;
    时长特征确定模块,用于将所述文本特征输入预设的时长预测模型,获取与所述文本特征对应的时长特征;A duration feature determining module, configured to input the text feature into a preset duration prediction model, and obtain a duration feature corresponding to the text feature;
    语音特征确定模块,用于将所述文本特征和所述时长特征输入预设的声学模型,获取与所述待合成文本对应的语音特征;A voice feature determination module, configured to input the text feature and the duration feature into a preset acoustic model, and obtain a voice feature corresponding to the text to be synthesized;
    转换模块,用于将所述语音特征转换成语音,生成与所述待合成文本对应的目标语音。The conversion module is used to convert the voice feature into a voice, and generate a target voice corresponding to the text to be synthesized.
  8. 根据权利要求7所述的装置,其特征在于,所述文本特征确定模块还包括:8. The device according to claim 7, wherein the text feature determination module further comprises:
    分词特征确定单元,用于将所述待合成文本输入预设的分词模型,获取与所述待合成文本对应的分词特征;The word segmentation feature determining unit is configured to input the text to be synthesized into a preset word segmentation model, and obtain the word segmentation feature corresponding to the text to be synthesized;
    多音字特征确定单元,用于将所述待合成文本和/或所述分词特征输入预设的多音字预测模型,获取所述待合成文本对应的多音字特征;A polyphone feature determining unit, configured to input the text to be synthesized and/or the word segmentation feature into a preset polyphone word prediction model, and obtain the polyphone feature corresponding to the text to be synthesized;
    韵律特征确定单元,用于将所述待合成文本和/或所述分词特征输入预设的韵律预测模型,获取所述待合成文本对应的韵律特征。The prosodic feature determining unit is configured to input the text to be synthesized and/or the word segmentation feature into a preset prosody prediction model to obtain the prosodic feature corresponding to the text to be synthesized.
  9. 根据权利要求7所述的装置,其特征在于,所述装置还包括:The device according to claim 7, wherein the device further comprises:
    获取训练模块,用于获取训练样本集,所述训练样本集包含多个训练文本以及对应的文本参考特征、时长参考特征和/或语音参考特征;An acquiring training module, configured to acquire a training sample set, the training sample set including a plurality of training texts and corresponding text reference features, duration reference features, and/or voice reference features;
    时长训练模块,用于将所述训练文本对应的文本参考特征作为所述时长预测模型的输入,所述时长参考特征作为时长预测模型的输出,对所述时长预测模型进行训练;A duration training module, configured to use a text reference feature corresponding to the training text as an input of the duration prediction model, and the duration reference feature as an output of a duration prediction model to train the duration prediction model;
    语音训练模块,用于将所述文本参考特征和所述时长参考特征作为所述声学模型的输入,所述语音参考特征作为声学模型的输出,对所述声学模型进行训练。The voice training module is configured to use the text reference feature and the duration reference feature as the input of the acoustic model, and the voice reference feature as the output of the acoustic model to train the acoustic model.
  10. 根据权利要求8所述的装置,其特征在于,所述训练样本集还包含与所述多个训练文本对应的分词参考特征、多音字参考特征和/或韵律参考特征,所述装置包括:8. The device according to claim 8, wherein the training sample set further comprises word segmentation reference features, polyphone reference features, and/or prosodic reference features corresponding to the plurality of training texts, and the device comprises:
    分词训练模块,用于将所述训练文本作为所述分词模型的输入,所述分词参考特征作为分词模型的输出,对所述分词模型进行训练。The word segmentation training module is configured to use the training text as the input of the word segmentation model, and the word segmentation reference feature as the output of the word segmentation model to train the word segmentation model.
    多音字训练模块,用于将所述训练文本和/或所述分词参考特征作为所述多音字预测模型的输入,所述多音字参考特征作为多音字预测模型的输出,对所述多音字预测模型进行训练。A polyphonic character training module, configured to use the training text and/or the word segmentation reference feature as the input of the polyphonic character prediction model, and the polyphonic character reference feature as the output of the polyphonic character prediction model to predict the polyphonic character The model is trained.
    韵律训练模块,用于将所述训练文本和/或所述分词参考特征作为所述韵律预测模型的输入,所述韵律参考特征作为韵律预测模型的输出,对所述韵律预测模型进行训练。The prosody training module is configured to use the training text and/or the word segmentation reference feature as the input of the prosody prediction model, and the prosody reference feature as the output of the prosody prediction model to train the prosody prediction model.
  11. 根据权利要求7所述的装置,其特征在于,所述装置还包括:The device according to claim 7, wherein the device further comprises:
    文本获取模块,用于通过文本迭代器获取多个待合成文本,针对每一个待合成文本,分别执行所述获取所述待合成文本的文本特征的步骤;The text acquisition module is configured to acquire a plurality of texts to be synthesized through a text iterator, and for each text to be synthesized, respectively execute the step of acquiring the text characteristics of the text to be synthesized;
    文本预测模块,用于将多个待合成文本对应的文本特征添加至预设的特征队列,当所述特征队列满足预设条件时,获取所述特征队列中的预设数量的文本特征,并将所述预设数量的文本特征分别输入所述时长预测模型,以使同步执行所述获取与所述文本特征对应的时长特征的步骤。The text prediction module is used to add multiple text features corresponding to the text to be synthesized to a preset feature queue, and when the feature queue meets a preset condition, obtain a preset number of text features in the feature queue, and The preset number of text features are respectively input into the duration prediction model, so that the step of obtaining duration features corresponding to the text features is performed synchronously.
  12. 一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行如权利要求1至6中任一项所述方法的步骤。A computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the processor executes the steps of the method according to any one of claims 1 to 6.
  13. 一种智能终端,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行如权利要求1至6中任一项所述方法的步骤。An intelligent terminal, comprising a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the method according to any one of claims 1 to 6 A step of.
PCT/CN2019/130766 2019-12-31 2019-12-31 Speech synthesis method, speech synthesis apparatus, smart terminal and storage medium WO2021134591A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201980003388.1A CN111164674B (en) 2019-12-31 Speech synthesis method, device, terminal and storage medium
PCT/CN2019/130766 WO2021134591A1 (en) 2019-12-31 2019-12-31 Speech synthesis method, speech synthesis apparatus, smart terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/130766 WO2021134591A1 (en) 2019-12-31 2019-12-31 Speech synthesis method, speech synthesis apparatus, smart terminal and storage medium

Publications (1)

Publication Number Publication Date
WO2021134591A1 true WO2021134591A1 (en) 2021-07-08

Family

ID=70562373

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/130766 WO2021134591A1 (en) 2019-12-31 2019-12-31 Speech synthesis method, speech synthesis apparatus, smart terminal and storage medium

Country Status (1)

Country Link
WO (1) WO2021134591A1 (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105185372A (en) * 2015-10-20 2015-12-23 百度在线网络技术(北京)有限公司 Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device
CN105336322A (en) * 2015-09-30 2016-02-17 百度在线网络技术(北京)有限公司 Polyphone model training method, and speech synthesis method and device
US20160049144A1 (en) * 2014-08-18 2016-02-18 At&T Intellectual Property I, L.P. System and method for unified normalization in text-to-speech and automatic speech recognition
CN106507321A (en) * 2016-11-22 2017-03-15 新疆农业大学 The bilingual GSM message breath voice conversion broadcasting system of a kind of dimension, the Chinese
CN106601228A (en) * 2016-12-09 2017-04-26 百度在线网络技术(北京)有限公司 Sample marking method and device based on artificial intelligence prosody prediction
CN107039034A (en) * 2016-02-04 2017-08-11 科大讯飞股份有限公司 A kind of prosody prediction method and system
US20190172443A1 (en) * 2017-12-06 2019-06-06 International Business Machines Corporation System and method for generating expressive prosody for speech synthesis
CN110299131A (en) * 2019-08-01 2019-10-01 苏州奇梦者网络科技有限公司 A kind of phoneme synthesizing method, device, the storage medium of controllable rhythm emotion

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160049144A1 (en) * 2014-08-18 2016-02-18 At&T Intellectual Property I, L.P. System and method for unified normalization in text-to-speech and automatic speech recognition
CN105336322A (en) * 2015-09-30 2016-02-17 百度在线网络技术(北京)有限公司 Polyphone model training method, and speech synthesis method and device
CN105185372A (en) * 2015-10-20 2015-12-23 百度在线网络技术(北京)有限公司 Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device
CN107039034A (en) * 2016-02-04 2017-08-11 科大讯飞股份有限公司 A kind of prosody prediction method and system
CN106507321A (en) * 2016-11-22 2017-03-15 新疆农业大学 The bilingual GSM message breath voice conversion broadcasting system of a kind of dimension, the Chinese
CN106601228A (en) * 2016-12-09 2017-04-26 百度在线网络技术(北京)有限公司 Sample marking method and device based on artificial intelligence prosody prediction
US20190172443A1 (en) * 2017-12-06 2019-06-06 International Business Machines Corporation System and method for generating expressive prosody for speech synthesis
CN110299131A (en) * 2019-08-01 2019-10-01 苏州奇梦者网络科技有限公司 A kind of phoneme synthesizing method, device, the storage medium of controllable rhythm emotion

Also Published As

Publication number Publication date
CN111164674A (en) 2020-05-15

Similar Documents

Publication Publication Date Title
US11289069B2 (en) Statistical parameter model establishing method, speech synthesis method, server and storage medium
WO2020215666A1 (en) Speech synthesis method and apparatus, computer device, and storage medium
US11205444B2 (en) Utilizing bi-directional recurrent encoders with multi-hop attention for speech emotion recognition
WO2017067206A1 (en) Training method for multiple personalized acoustic models, and voice synthesis method and device
KR20220035180A (en) Expressive power control in E2E (End-to-end) speech synthesis system
Sheikhan et al. Using DTW neural–based MFCC warping to improve emotional speech recognition
CN110797006A (en) End-to-end speech synthesis method, device and storage medium
CN111226275A (en) Voice synthesis method, device, terminal and medium based on rhythm characteristic prediction
US20140236597A1 (en) System and method for supervised creation of personalized speech samples libraries in real-time for text-to-speech synthesis
JP2020034883A (en) Voice synthesizer and program
CN113327574B (en) Speech synthesis method, device, computer equipment and storage medium
CN113707125A (en) Training method and device for multi-language voice synthesis model
Meng et al. Synthesizing English emphatic speech for multimodal corrective feedback in computer-aided pronunciation training
Nandi et al. Implicit processing of LP residual for language identification
JP2024505076A (en) Generate diverse, natural-looking text-to-speech samples
Lorenzo-Trueba et al. Simple4all proposals for the albayzin evaluations in speech synthesis
Zangar et al. Duration modelling and evaluation for Arabic statistical parametric speech synthesis
WO2023279976A1 (en) Speech synthesis method, apparatus, device, and storage medium
Seong et al. Multilingual speech synthesis for voice cloning
CN115359780A (en) Speech synthesis method, apparatus, computer device and storage medium
CN112463921B (en) Prosody hierarchy dividing method, prosody hierarchy dividing device, computer device and storage medium
CN114758664A (en) Voice data screening method and device, electronic equipment and readable storage medium
WO2021134591A1 (en) Speech synthesis method, speech synthesis apparatus, smart terminal and storage medium
CN111164674B (en) Speech synthesis method, device, terminal and storage medium
Ronanki Prosody generation for text-to-speech synthesis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19958274

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19958274

Country of ref document: EP

Kind code of ref document: A1