WO2021134591A1

WO2021134591A1 - Speech synthesis method, speech synthesis apparatus, smart terminal and storage medium

Info

Publication number: WO2021134591A1
Application number: PCT/CN2019/130766
Authority: WO
Inventors: 李贤�; 黄东延; 丁万; 张皓; 白洛玉; 熊友军
Original assignee: 深圳市优必选科技股份有限公司
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2021-07-08
Also published as: CN111164674A

Abstract

A speech synthesis method, a speech synthesis apparatus, a smart terminal and a storage medium. The method comprises: acquiring a text to be synthesized (S102); acquiring text features of the text to be synthesized, wherein the text features comprise at least one of a word segmentation feature, a polyphone feature and/or a prosodic feature (S104); inputting the text features into a preset duration prediction model, and acquiring duration features corresponding to the text features (S106); inputting the text features and the duration features into a preset acoustic model, and acquiring the speech features corresponding to the text to be synthesized (S108); and converting the speech features to speech, and generating target speech corresponding to the text to be synthesized (S110). According to the speech synthesis method, the speech features generated by various text features and duration features are considered, such that the synthesized speech is more accurate, thereby improving the speech synthesis accuracy, and improving the user experience.

Description

Speech synthesis method, device, terminal and storage medium

Technical field

The present invention relates to the field of artificial intelligence technology, in particular to a speech synthesis method, device, intelligent terminal and computer readable storage medium.

Background technique

With the rapid development of mobile Internet and artificial intelligence technology, there are more and more scenarios for speech synthesis such as voice broadcasting, listening to novels, listening to news, and intelligent interaction. Speech synthesis can convert text, etc. into natural speech output.

technical problem

In the prior art, speech synthesis mostly uses the statistical parameter synthesis method to model the induced spectral characteristic parameters and generate a parameter synthesizer to construct the mapping relationship between the text sequence and the speech, and then the statistical model is used to generate the speech parameters at all times. (Including fundamental frequency, formant frequency, etc.), and then convert these parameters into relevant features corresponding to the voice, and finally generate the output voice. However, in the above-mentioned speech synthesis method, the calculation results of a single sub-module corresponding to each step are not necessarily all the best results, which leads to the inability to accurately convert the text into a speech suitable for multi-language and multi-tone scenes, which affects the overall The quality of the speech synthesis greatly affects the user experience.

That is to say, in the above-mentioned speech synthesis solution, the quality of the final synthesized speech is insufficient due to the problem of the calculation result of a single sub-module being non-optimal.

Technical solutions

Based on this, it is necessary to propose a speech synthesis method, device, smart terminal, and computer-readable storage medium to address the above-mentioned problems.

In the first aspect of the present invention, a speech synthesis method is proposed.

A method of speech synthesis, including:

Obtain the text to be synthesized;

Acquiring a text feature of the text to be synthesized, where the text feature includes at least one of a word segmentation feature, a polyphonic character feature, and/or a prosodic feature;

Input the text feature into a preset duration prediction model, and obtain the duration feature corresponding to the text feature;

Inputting the text feature and the duration feature into a preset acoustic model, and obtaining a voice feature corresponding to the text to be synthesized;

The voice feature is converted into voice, and a target voice corresponding to the text to be synthesized is generated.

In one embodiment, before the step of obtaining the text characteristics of the text to be synthesized, the method further includes: performing regularization processing on the text to be synthesized.

In one embodiment, the step of obtaining the text features of the text to be synthesized further includes: inputting the text to be synthesized into a preset word segmentation model to obtain the word segmentation characteristics corresponding to the text to be synthesized; Input the pre-synthesized text and/or the word segmentation feature into a preset polyphonic character prediction model to obtain the polyphonic character features corresponding to the text to be synthesized; input the to-be-synthesized text and/or the word segmentation feature into the preset prosody prediction Model to obtain the prosodic features corresponding to the text to be synthesized.

In one embodiment, the method further includes: obtaining a training sample set, the training sample set containing a plurality of training texts and corresponding text reference features, duration reference features, and/or voice reference features; and corresponding to the training text The text reference feature of is used as the input of the duration prediction model, the duration reference feature is used as the output of the duration prediction model, and the duration prediction model is trained; the text reference feature and the duration reference feature are used as the acoustic The input of the model, the voice reference feature as the output of the acoustic model, and the training of the acoustic model.

In an embodiment, the training sample set further includes word segmentation reference features, polyphone reference features, and/or prosodic reference features corresponding to the multiple training texts; the method includes: using the training text as the The word segmentation model is input, and the word segmentation reference feature is used as the output of the word segmentation model, and the word segmentation model is trained. The training text and/or the word segmentation reference feature is used as the input of the polyphonic character prediction model, and the polyphonic character reference feature is used as the output of the polyphonic character prediction model, and the polyphonic character prediction model is trained. The training text and/or the word segmentation reference feature is used as the input of the prosody prediction model, and the prosody reference feature is used as the output of the prosody prediction model, and the prosody prediction model is trained.

In one embodiment, the method further includes: obtaining a plurality of texts to be synthesized through a text iterator, and for each text to be synthesized, respectively performing the step of obtaining the text characteristics of the text to be synthesized; The text features corresponding to the synthesized text are added to a preset feature queue, and when the feature queue meets a preset condition, a preset number of text features in the feature queue are obtained, and the preset number of text features are separated Input the duration prediction model so that the step of acquiring the duration feature corresponding to the text feature is executed synchronously.

In the second aspect of the present invention, a speech synthesis device is provided.

A speech synthesis device includes:

The obtaining module is used to obtain the text to be synthesized;

A text feature determination module, configured to obtain text features of the text to be synthesized, where the text features include at least one of word segmentation features, polyphone features, and/or prosodic features;

A duration feature determining module, configured to input the text feature into a preset duration prediction model, and obtain a duration feature corresponding to the text feature;

A voice feature determination module, configured to input the text feature and the duration feature into a preset acoustic model, and obtain a voice feature corresponding to the text to be synthesized;

The conversion module is used to convert the voice feature into a voice, and generate a target voice corresponding to the text to be synthesized.

In an embodiment, the text feature determination module further includes: a preprocessing unit, configured to perform regularization processing on the text to be synthesized.

In one embodiment, the text feature determination module further includes: a word segmentation feature determination unit, configured to input the text to be synthesized into a preset word segmentation model to obtain the word segmentation feature corresponding to the text to be synthesized; polyphonic character feature The determining unit is configured to input the to-be-synthesized text and/or the word segmentation feature into a preset polyphonic character prediction model to obtain the poly-phonic character feature corresponding to the to-be-synthesized text; the prosodic feature determining unit is to input the to-be-synthesized text The synthesized text and/or the word segmentation feature is input into a preset prosody prediction model, and the prosody feature corresponding to the text to be synthesized is obtained.

In one embodiment, the device further includes: an acquiring training module, configured to acquire a training sample set, the training sample set including a plurality of training texts and corresponding text reference features, duration reference features, and/or voice reference features; The duration training module is configured to use the text reference feature corresponding to the training text as the input of the duration prediction model, and the duration reference feature as the output of the duration prediction model to train the duration prediction model; a voice training module, The text reference feature and the duration reference feature are used as the input of the acoustic model, and the speech reference feature is used as the output of the acoustic model to train the acoustic model.

In one embodiment, the training sample set further includes word segmentation reference features, polyphonic character reference features, and/or prosodic reference features corresponding to the plurality of training texts, and the device includes: a word segmentation training module for combining all The training text is used as the input of the word segmentation model, and the word segmentation reference feature is used as the output of the word segmentation model, and the word segmentation model is trained. A polyphonic character training module, configured to use the training text and/or the word segmentation reference feature as the input of the polyphonic character prediction model, and the polyphonic character reference feature as the output of the polyphonic character prediction model to predict the polyphonic character The model is trained. The prosody training module is configured to use the training text and/or the word segmentation reference feature as the input of the prosody prediction model, and the prosody reference feature as the output of the prosody prediction model to train the prosody prediction model.

In one embodiment, the device further includes: a text obtaining module, configured to obtain a plurality of texts to be synthesized through a text iterator, and for each text to be synthesized, the method of obtaining the text characteristics of the text to be synthesized is performed respectively. Step; a text prediction module for adding a plurality of text features corresponding to the text to be synthesized to a preset feature queue, and when the feature queue meets a preset condition, obtain a preset number of text features in the feature queue , And input the preset number of text features into the duration prediction model respectively, so that the step of obtaining duration features corresponding to the text features is performed synchronously.

In the third aspect of the present invention, an intelligent terminal is proposed.

An intelligent terminal includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps:

Obtain the text to be synthesized;

In the fourth aspect of the present invention, a computer-readable storage medium is provided.

A computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps:

Obtain the text to be synthesized;

Beneficial effect

Implementing the embodiments of the present invention will have the following beneficial effects:

After adopting the speech synthesis method, device, terminal and storage medium of the present invention, in the process of speech synthesis, first obtain the text characteristics of the text to be synthesized, and the text characteristics include word segmentation characteristics, polyphonic character characteristics and/ Or text features such as prosodic features; then input the text features into the preset duration prediction model to obtain the corresponding duration features; input the text features and duration features into the preset acoustic model to obtain the corresponding voice features; finally convert the voice features into speech, Generate the target speech corresponding to the text to be synthesized. In the process of feature extraction for speech synthesis, the considered text features include polyphonic character features and prosodic features, and combined with the time-length features predicted by the model, the final speech features needed in the process of synthesizing speech are obtained. That is to say, the speech synthesis method, device, terminal and storage medium provided by the present invention take into account the speech characteristics generated by multiple text characteristics and duration characteristics, so that the synthesized speech is more accurate, the accuracy of speech synthesis is improved, and the user is improved. Experience.

Description of the drawings

In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.

among them:

Figure 1 is an application environment diagram of a speech synthesis method in an embodiment of this application;

FIG. 2 is a schematic flowchart of a speech synthesis method in an embodiment of this application;

FIG. 3 is a schematic flowchart of a process of obtaining text features of a text to be synthesized in an embodiment of this application;

4 is a schematic flowchart of a method for training a duration prediction model and an acoustic model in an embodiment of this application;

FIG. 5 is a schematic flowchart of a word segmentation model, a polyphonic character prediction model, and/or a prosody prediction model in an embodiment of this application;

Fig. 6 is a flowchart of a speech synthesis method in an embodiment of the application;

Fig. 7 is a structural block diagram of a speech synthesis device in an embodiment of the application;

FIG. 8 is a structural block diagram of a text feature determination module in an embodiment of this application;

Fig. 9 is a structural block diagram of a speech synthesis device in an embodiment of the application;

FIG. 10 is a structural block diagram of a speech synthesis device in an embodiment of this application;

FIG. 11 is a structural block diagram of a speech synthesis device in an embodiment of the application;

Fig. 12 is a structural block diagram of a computer device that executes the aforementioned speech synthesis method in an embodiment of the application.

Embodiments of the present invention

The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

Fig. 1 is an application environment diagram of a method for speech synthesis in an embodiment. Referring to Figure 1, the method of speech synthesis is applied to a speech synthesis system. The speech synthesis system includes a terminal 110 and a server 120. The terminal 110 and the server 120 are connected through a network. The terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically be a terminal device such as a PC, a mobile phone, a tablet computer, or a notebook computer. The server 120 may be implemented as an independent server or a server cluster composed of multiple servers. The terminal 110 is used to obtain the text to be synthesized, and the server 120 is used to analyze and process the text to be synthesized, and synthesize the target speech corresponding to the text to be synthesized.

In another embodiment, the execution of the aforementioned speech synthesis-based method can also be based on a terminal device, which can obtain the text to be synthesized, and can also analyze the text to be synthesized and synthesize the target speech corresponding to the text to be synthesized.

Considering that the method can be applied to both the terminal and the server, and the specific speech synthesis process is the same, this embodiment is applied to the terminal as an example.

As shown in Figure 2, in one embodiment, a method for speech synthesis is provided. The speech synthesis method specifically includes the following steps S102-S110:

Step S102: Obtain the text to be synthesized.

Specifically, the text to be synthesized is text information that requires speech synthesis. For example, in scenarios such as voice chat robots and voice newspaper reading, text messages that need to be converted into voices. Exemplarily, the text to be synthesized could be "Since that moment, she will no longer be arrogant.".

The above-mentioned text to be synthesized may be obtained by directly inputting text information, or may be obtained by scanning and recognizing the text information through a camera or the like.

Step S104: Acquire text features of the text to be synthesized, where the text features include at least one of word segmentation features, polyphone features, and/or prosodic features.

Specifically, the text feature is the regular feature corresponding to the text information in the text to be synthesized.

In a specific embodiment, the text feature may be one of a word segmentation feature, a polyphonic character feature, and/or a prosodic feature.

The word segmentation feature is a phrase feature obtained by classifying the words that make up the text to be synthesized, which can be nouns, verbs, prepositions, adjectives, etc.

The polyphonic character feature is a word or word with multiple pronunciations included in the text to be synthesized. Because the pronunciation has the function of distinguishing part of speech and word meaning, the pronunciation is different due to different usage conditions or environments.

Prosodic feature is a kind of prosodic structure of language, which is closely related to syntax, text structure, information structure and other linguistic structures. Prosodic features are typical features of natural language, and are common features of different languages, such as: pitch down, accent, pause, etc. Prosodic features can be divided into three main aspects: intonation, temporal distribution and stress, which are realized by supersegment features. Supersegment features include pitch, intensity, and time characteristics, which are loaded by phonemes or groups of phonemes. Prosodic features are important forms of language and corresponding emotional expression.

Before obtaining the text features of the text to be synthesized, the text to be synthesized can also be preprocessed to avoid some minor influences (such as format problems) causing deviations in the output text characteristics.

In one embodiment, before acquiring the text characteristics of the text to be synthesized, regularization processing is performed on the text to be synthesized.

Among them, the regularization process is to normalize the synthesized text and convert the language text into a preset form of language text. For example, the upper and lower case of the English processing letter can be removed as needed to avoid the output text caused by problems such as text format. The characteristics are biased. In another specific embodiment, the normalization processing of the text to be synthesized also includes converting the text such as numbers and symbols in the text to be synthesized into Chinese, so as to facilitate subsequent extraction of word segmentation features, polyphonic character features and/or prosodic features. , To reduce the error of feature extraction.

The acquisition of the text features of the text to be synthesized may be inputting the text to be synthesized into a preset neural network model, and the preset neural network model calculates the corresponding text features according to the corresponding algorithm; or, according to the preset feature extraction algorithm , Extract the corresponding text features from the text to be synthesized.

In one embodiment, the process of obtaining the text features of the text to be synthesized through the neural network model is described.

Specifically, as shown in FIG. 3, a schematic flowchart of a process of obtaining the text features of the text to be synthesized is given.

As shown in FIG. 3, the foregoing process of obtaining the text features of the text to be synthesized includes steps S1041-S1043 as shown in FIG. 3.

Step S1041: Input the text to be synthesized into the preset word segmentation model, and obtain the word segmentation features corresponding to the text to be synthesized, where the word segmentation features include where in the text to be synthesized should be segmented or broken, so as to determine the corresponding text to be synthesized The word segmentation feature corresponding to the word segmentation result of;

Step S1042: Input the text to be synthesized and/or the feature of word segmentation into the preset polyphonic character prediction model, and obtain the feature of the polyphonic character corresponding to the text to be synthesized;

Step S1043: Input the text to be synthesized and/or the word segmentation feature into a preset prosody prediction model, and obtain the prosody feature corresponding to the text to be synthesized.

The word segmentation model is a neural network model that performs word segmentation processing on the text to be synthesized to obtain word segmentation features. Through the word segmentation model, the word segmentation features of the synthesized text can be predicted. Among them, the word segmentation feature is determined by the word vector obtained by word segmentation. The word vector is a vector corresponding to the word or phrase divided according to the word segmentation model, and is used to determine the word segmentation feature of the text to be synthesized.

The polyphonic character prediction model can predict the polyphonic character feature in the text to be synthesized or the word segmentation feature, and can be a neural network model.

The prosody prediction model is a neural network model that predicts the prosody features of the text to be synthesized or the word segmentation features, and can predict the prosody features of the text to be synthesized, such as prosodic word features, prosodic phrase features, and intonation phrase features.

The text features in the text to be synthesized in this embodiment are not limited to the text features such as word segmentation features, polyphone features, and prosodic features in this embodiment.

Users can set the text features involved in this article. The text features involved are not only word segmentation features, polyphonic character features, and prosodic features, but also other features such as pre- and post-word correlation features. The user can also establish the structure of the general neural network model by establishing the calculation graph, and select the input data.

Step S106: Input the text feature into a preset duration prediction model, and obtain a duration feature corresponding to the text feature.

Specifically, the time length feature is the time length corresponding to the phoneme text feature included in the text to be synthesized and the text feature corresponding to the text to be synthesized. The preset duration prediction model is a neural network model that predicts the length of time corresponding to the phoneme text feature. It determines the time length corresponding to each phoneme contained in the text to be synthesized, including the process of converting Pinyin into phoneme, through the multi-phone word prediction model Get the pronunciation of the word (such as ou3), then convert the pronunciation into phonemes, and then use the duration prediction model to predict the duration of the phonemes. Exemplarily, the pronunciation is converted into phonemes, the pronunciation ou3 of "我" in "I am in China" can be converted into 1 phoneme of ou, and the pronunciation of "guo" guo2 can be converted into two phonemes of g and uo.

Step S108: Input the text feature and the duration feature into a preset acoustic model, and obtain a voice feature corresponding to the text to be synthesized.

Specifically, voice features are features generated based on text features and duration features, and voice features include features such as sound intensity, loudness, pitch, and/or pitch period. Among them, sound intensity is the average sound energy per unit time passing through a unit area perpendicular to the direction of sound wave propagation; loudness reflects the subjectively perceived sound strength; pitch reflects the subjectively perceived sound frequency; the pitch period is The quasi-period of the voiced sound waveform during the pronunciation reflects the time interval between two adjacent openings and closings of the glottis or the frequency of opening and closing.

In this embodiment, the text feature obtained in step S104 and the duration feature obtained in step S106 are input into a preset acoustic model, and the voice feature corresponding to the text to be synthesized is used through the acoustic model.

The aforementioned acoustic model for predicting voice features is a neural network model, and the acoustic model has the ability to calculate corresponding voice features based on text features and duration features through prior training.

Step S110: Convert the voice feature into a voice, and generate a target voice corresponding to the text to be synthesized.

The target speech is the speech generated by the text to be synthesized. To convert voice features into voice, the voice features can be synthesized by a vocoder, and the corresponding voice and voice duration of the voice feature in the vocoder are output to obtain the target voice. Among them, the vocoder can be parallel wavenet vocoder. Specifically, the voice feature is used as the input, and the voice feature corresponding to the text to be synthesized is synthesized through a preset vocoder, and the corresponding target voice is output.

Further, the aforementioned duration prediction model and acoustic model can make good predictions about the relevant features of the synthesized text, and before using the corresponding model for prediction, the corresponding model must be trained based on the training data. In other words, before predicting the text features to obtain the corresponding duration features and voice features, the duration prediction model and acoustic model need to be trained, so that the corresponding models can accurately predict the duration features and voice features corresponding to the text features. Ability.

As shown in FIG. 4, the above-mentioned speech synthesis method further includes steps S1101-S1103 as shown in FIG. 4.

Step S1101: Obtain a training sample set, the training sample set including multiple training texts and corresponding text reference features, duration reference features, and/or voice reference features;

Step S1102: Use the text reference feature corresponding to the training text as the input of the duration prediction model, and the duration reference feature as the output of the duration prediction model, and train the duration prediction model;

Step S1103: Use the text reference feature and the duration reference feature as the input of the acoustic model, and the speech reference feature as the output of the acoustic model, and train the acoustic model.

Before model training, the data needs to be identified first, and the duration reference feature and voice reference feature corresponding to the text are determined. Among them, the duration reference feature is a duration feature corresponding to the text to be synthesized, and the speech reference feature is a voice feature corresponding to the text to be synthesized. In this embodiment, the duration prediction model and the acoustic model are trained through the pre-training sample set, so that the model has the ability to accurately predict the duration characteristics and voice characteristics corresponding to the text to be synthesized.

For each training text contained in the training data set, take the text reference feature corresponding to the training text as input and the corresponding duration reference feature as output, and train the preset duration prediction model so that the duration training model has duration feature prediction Function.

For each training text contained in the training data set, the text reference feature and duration reference feature corresponding to the training text are used as input, and the corresponding voice reference feature is used as output, and the preset acoustic model is trained to make the acoustic model have voice Feature prediction function.

Further, in one embodiment, it is also necessary to perform model training for each model of text feature prediction, which specifically includes training of a word segmentation model, a polyphonic character prediction model, and a prosody prediction model.

In other words, the word segmentation model, polyphone prediction model, and prosody prediction model involved in text feature prediction are trained through the training sample set, so that the word segmentation model, polyphone prediction model, and prosody prediction model respectively have predictive word segmentation features based on the text to be synthesized , Polyphonic character features and prosodic features and other text features.

As shown in FIG. 5, the above-mentioned speech synthesis method further includes steps S2101-S2103 as shown in FIG. 5:

In an embodiment, the training sample set further includes word segmentation reference features, polyphone reference features, and/or prosodic reference features corresponding to the plurality of training texts; the method includes:

Step S2101: The training text is used as the input of the word segmentation model, and the word segmentation reference feature is used as the output of the word segmentation model, and the word segmentation model is trained.

Step S2102: Use the training text and/or the word segmentation reference feature as the input of the polyphone prediction model, and the polyphone reference feature as the output of the polyphone prediction model, and train the polyphone prediction model.

Step S2103: Use the training text and/or the word segmentation reference feature as the input of the prosody prediction model, and the prosody reference feature as the output of the prosody prediction model, and train the prosody prediction model.

Among them, the training sample set may also include multiple training texts and word segmentation features, polyphonic character features, and prosodic features that are expected to be output by the model. The word segmentation reference feature is the word segmentation feature output by the expected word segmentation model according to the training text, and the polyphone reference feature is the polyphone feature output by the expected polyphone prediction model according to the training text and the corresponding word segmentation feature. The prosodic reference feature is the prosodic feature output by the expected prosody prediction model according to the training text and the corresponding word segmentation feature.

For each training text contained in the training data set, the training text is used as input, and the corresponding word segmentation reference feature is used as output, and the preset word segmentation model is trained so that the word segmentation model has the function of word segmentation feature prediction.

For each training text contained in the training data set, the training text and the corresponding word segmentation reference feature are used as input, and the corresponding polyphonic character reference feature is used as output, and the preset polyphonic word prediction model is trained to make the polyphonic word prediction model It has the function of predicting the features of polyphonic characters.

For each training text contained in the training data set, the training text and the corresponding word segmentation reference feature are used as input, and the corresponding prosody reference feature is used as output, and the preset prosody prediction model is trained to make the prosody prediction model have prosody features Forecast function.

In this embodiment, the word segmentation model, the polyphonic character prediction model, and the prosody prediction model are trained through pre-processed data, so that the model can accurately predict the word segmentation feature, the polyphonic character feature, and the prosody feature corresponding to the training text.

In the specific prediction process, multiple texts to be synthesized can be obtained at the same time, and text characteristics corresponding to each text to be synthesized can be obtained. The text features corresponding to multiple texts to be synthesized are filtered and sorted, and input into a preset feature queue. Obtain a preset number of text features in the feature queue, input the duration prediction model and the acoustic model for prediction, and obtain the corresponding features. Among them, the steps of generating text features corresponding to each text to be synthesized and predicting a preset number of text features are performed simultaneously.

As shown in FIG. 6, the above-mentioned speech synthesis method further includes steps S3101-S3102 as shown in FIG. 6:

Step S3101: Obtain a plurality of texts to be synthesized through a text iterator, and for each text to be synthesized, respectively execute the step of obtaining the text characteristics of the text to be synthesized;

Step S3102: Add a plurality of text features corresponding to the text to be synthesized to a preset feature queue, and when the feature queue meets a preset condition, obtain a preset number of text features in the feature queue, and combine the A preset number of text features are respectively input to the duration prediction model, so that the step of obtaining duration features corresponding to the text features is performed synchronously.

Among them, the text iterator is used to obtain continuous data in multiple texts to be synthesized and corresponding text features. The text features can be continuously and iterated in multiple processes of obtaining text features in the text to be synthesized. The feature queue contains multiple text features. An ordered collection of text features. The preset condition is a condition for inputting the text feature into the time-length prediction model, which can be the text feature reaching a certain number, or it can be the acquisition time of the preset text feature. The preset number is the number of text features output by the feature queue, which can be a fixed value or a value that changes according to a certain rule.

In this embodiment, a queue feature and a text iterator are added to process multiple texts to be synthesized, so that text conversion of the text to be synthesized is more effective and faster, and the efficiency of speech synthesis and model training is improved.

Exemplarily, in the foregoing model training and model training processes, the duration prediction model, the acoustic model, the word segmentation model, the polyphonic character prediction model, and/or the prosody prediction model are neural network models. In a specific embodiment, they are two-way Long and short-term memory application network model (BiLSTM model). .

Among them, the BiLSTM (Bi-directional Long Short-Term Memory) model makes the data time-dependent and processes the data globally. It can better predict the results through features such as before and after words.

As shown in FIG. 7, in one embodiment, a speech synthesis device is provided, and the device includes:

The obtaining module 702 is used to obtain the text to be synthesized;

The text feature determining module 704 is configured to obtain text features of the text to be synthesized, where the text features include at least one of word segmentation features, polyphone features, and/or prosodic features;

The duration feature determining module 706 is configured to input the text feature into a preset duration prediction model, and obtain the duration feature corresponding to the text feature;

A voice feature determining module 708, configured to input the text feature and the duration feature into a preset acoustic model, and obtain a voice feature corresponding to the text to be synthesized;

The conversion module 710 is configured to convert the voice feature into a voice, and generate a target voice corresponding to the text to be synthesized.

As shown in FIG. 8, in one embodiment, the text feature determination module 704 further includes: a preprocessing unit, configured to perform regularization processing on the text to be synthesized.

As shown in FIG. 8, in one embodiment, the text feature determination module 704 further includes: a word segmentation feature determination unit, configured to input the text to be synthesized into a preset word segmentation model, and obtain the text corresponding to the text to be synthesized The word segmentation feature; a polyphonic character feature determination unit for inputting the text to be synthesized and/or the word segmentation feature into a preset polyphonic word prediction model to obtain the polyphonic character feature corresponding to the text to be synthesized; prosodic feature determining unit , Used to input the text to be synthesized and/or the word segmentation feature into a preset prosody prediction model to obtain the prosodic feature corresponding to the text to be synthesized.

As shown in FIG. 9, in one embodiment, the device further includes: an acquiring training module 703, configured to acquire a training sample set, the training sample set includes a plurality of training texts and corresponding text reference features and duration reference features And/or speech reference features; duration training module 705, configured to use text reference features corresponding to the training text as the input of the duration prediction model, and the duration reference features as the output of the duration prediction model to predict the duration Model training; a voice training module 707, configured to use the text reference feature and the duration reference feature as the input of the acoustic model, and the voice reference feature as the output of the acoustic model to train the acoustic model.

As shown in FIG. 10, in one embodiment, the training sample set further includes word segmentation reference features, polyphonic character reference features, and/or prosodic reference features corresponding to the plurality of training texts, and the device includes: word segmentation training The module 7041 is configured to use the training text as the input of the word segmentation model, and use the word segmentation reference feature as the output of the word segmentation model to train the word segmentation model. The polyphonic character training module 7043 is configured to use the training text and/or the word segmentation reference feature as the input of the polyphonic character prediction model, and the polyphonic character reference feature is used as the output of the polyphonic character prediction model. The prediction model is trained. The prosody training module 7045 is configured to use the training text and/or the word segmentation reference feature as the input of the prosody prediction model, and the prosody reference feature as the output of the prosody prediction model to train the prosody prediction model.

As shown in FIG. 11, in one embodiment, the device further includes: a text obtaining module 709, configured to obtain a plurality of texts to be synthesized through a text iterator, and for each text to be synthesized, the obtaining of the The step of the text feature of the text to be synthesized; the text prediction module 711 is used to add a plurality of text features corresponding to the text to be synthesized to a preset feature queue, and when the feature queue meets a preset condition, the feature queue is acquired And input the preset number of text features into the duration prediction model respectively, so that the step of acquiring the duration feature corresponding to the text feature is performed synchronously.

In one embodiment, the duration prediction model, acoustic model, word segmentation model, polyphonic character prediction model and/or prosody prediction model is a BiLSTM model.

Fig. 12 shows an internal structure diagram of a smart terminal in an embodiment. The smart terminal may specifically be a terminal or a server. As shown in Figure 12, the smart terminal includes a processor, a memory and a network interface connected through a system bus. Among them, the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the smart terminal stores an operating system and may also store a computer program. When the computer program is executed by the processor, the processor can realize the age identification method. A computer program may also be stored in the internal memory, and when the computer program is executed by the processor, the processor can execute the age identification method. Those skilled in the art can understand that the structure shown in FIG. 12 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.

In one embodiment, an intelligent terminal is proposed, including a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps: Synthesizing text; acquiring text features of the text to be synthesized, the text features including at least one of word segmentation features, polyphonic character features, and/or prosodic features; inputting the text features into a preset duration prediction model, and acquiring The duration feature corresponding to the text feature; input the text feature and the duration feature into a preset acoustic model to obtain the voice feature corresponding to the text to be synthesized; convert the voice feature into speech, and generate and The target voice corresponding to the text to be synthesized.

In one embodiment, the method further includes: obtaining a training sample set, the training sample set containing a plurality of training texts and corresponding text reference features, duration reference features, and/or voice reference features; and corresponding to the training text The text reference feature of is used as the input of the duration prediction model, and the duration reference feature is used as the output of the duration prediction model to train the duration prediction model. The text reference feature and the duration reference feature are used as the input of the acoustic model, and the speech reference feature is used as the output of the acoustic model, and the acoustic model is trained.

In one embodiment, a computer-readable storage medium is provided that stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps: obtain the text to be synthesized; obtain the text to be synthesized The text feature of the text, the text feature includes at least one of word segmentation feature, polyphonic character feature, and/or prosodic feature; inputting the text feature into a preset duration prediction model to obtain the duration feature corresponding to the text feature; The text feature and the duration feature are input into a preset acoustic model to obtain a voice feature corresponding to the text to be synthesized; the voice feature is converted into speech to generate a target voice corresponding to the text to be synthesized.

A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The program can be stored in a non-volatile computer readable storage medium. Here, when the program is executed, it may include the procedures of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database, or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

The technical features of the above embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered as the range described in this specification.

The above-mentioned embodiments only express several implementation manners of the present application, and their description is relatively specific and detailed, but they should not be understood as a limitation to the scope of the present application. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications and improvements can be made, and these all fall within the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.

Claims

A method for speech synthesis, characterized in that the method includes:

Obtain the text to be synthesized;

Acquiring a text feature of the text to be synthesized, where the text feature includes at least one of a word segmentation feature, a polyphonic character feature, and/or a prosodic feature;

Input the text feature into a preset duration prediction model, and obtain the duration feature corresponding to the text feature;

Inputting the text feature and the duration feature into a preset acoustic model, and obtaining a voice feature corresponding to the text to be synthesized;

The voice feature is converted into voice, and a target voice corresponding to the text to be synthesized is generated.
The method according to claim 1, characterized in that, before the step of obtaining the text characteristics of the text to be synthesized, the method further comprises:

Perform regularization processing on the text to be synthesized.
The method according to claim 1, wherein the step of obtaining the text characteristics of the text to be synthesized further comprises:

Input the text to be synthesized into a preset word segmentation model, and obtain the word segmentation feature corresponding to the text to be synthesized;

Inputting the to-be-synthesized text and/or the word segmentation feature into a preset polyphonic character prediction model to obtain the polyphonic character feature corresponding to the to-be-synthesized text;

The text to be synthesized and/or the word segmentation feature is input into a preset prosody prediction model, and the prosody feature corresponding to the text to be synthesized is obtained.
The method according to claim 1, wherein the method further comprises:

Acquiring a training sample set, the training sample set including a plurality of training texts and corresponding text reference features, duration reference features, and/or voice reference features;

Using the text reference feature corresponding to the training text as the input of the duration prediction model, the duration reference feature as the output of the duration prediction model, and training the duration prediction model;

The text reference feature and the duration reference feature are used as the input of the acoustic model, and the speech reference feature is used as the output of the acoustic model, and the acoustic model is trained.
The method according to claim 3, wherein the training sample set further comprises word segmentation reference features, polyphone reference features, and/or prosodic reference features corresponding to the plurality of training texts;

The method includes:

The training text is used as the input of the word segmentation model, and the word segmentation reference feature is used as the output of the word segmentation model, and the word segmentation model is trained.

The training text and/or the word segmentation reference feature is used as the input of the polyphonic character prediction model, and the polyphonic character reference feature is used as the output of the polyphonic character prediction model, and the polyphonic character prediction model is trained.

The training text and/or the word segmentation reference feature is used as the input of the prosody prediction model, and the prosody reference feature is used as the output of the prosody prediction model, and the prosody prediction model is trained.
The method according to claim 1, wherein the method further comprises:

Obtain a plurality of texts to be synthesized through a text iterator, and for each text to be synthesized, respectively execute the step of obtaining the text characteristics of the text to be synthesized;

A plurality of text features corresponding to the text to be synthesized are added to a preset feature queue, and when the feature queue meets a preset condition, a preset number of text features in the feature queue are acquired, and the preset number The text features of are respectively input to the duration prediction model, so that the step of acquiring the duration features corresponding to the text features is performed synchronously.
A speech synthesis device, characterized in that the device includes:

The obtaining module is used to obtain the text to be synthesized;

A text feature determination module, configured to obtain text features of the text to be synthesized, where the text features include at least one of word segmentation features, polyphone features, and/or prosodic features;

A duration feature determining module, configured to input the text feature into a preset duration prediction model, and obtain a duration feature corresponding to the text feature;

A voice feature determination module, configured to input the text feature and the duration feature into a preset acoustic model, and obtain a voice feature corresponding to the text to be synthesized;

The conversion module is used to convert the voice feature into a voice, and generate a target voice corresponding to the text to be synthesized.
8. The device according to claim 7, wherein the text feature determination module further comprises:

The word segmentation feature determining unit is configured to input the text to be synthesized into a preset word segmentation model, and obtain the word segmentation feature corresponding to the text to be synthesized;

A polyphone feature determining unit, configured to input the text to be synthesized and/or the word segmentation feature into a preset polyphone word prediction model, and obtain the polyphone feature corresponding to the text to be synthesized;

The prosodic feature determining unit is configured to input the text to be synthesized and/or the word segmentation feature into a preset prosody prediction model to obtain the prosodic feature corresponding to the text to be synthesized.
The device according to claim 7, wherein the device further comprises:

An acquiring training module, configured to acquire a training sample set, the training sample set including a plurality of training texts and corresponding text reference features, duration reference features, and/or voice reference features;

A duration training module, configured to use a text reference feature corresponding to the training text as an input of the duration prediction model, and the duration reference feature as an output of a duration prediction model to train the duration prediction model;

The voice training module is configured to use the text reference feature and the duration reference feature as the input of the acoustic model, and the voice reference feature as the output of the acoustic model to train the acoustic model.
8. The device according to claim 8, wherein the training sample set further comprises word segmentation reference features, polyphone reference features, and/or prosodic reference features corresponding to the plurality of training texts, and the device comprises:

The word segmentation training module is configured to use the training text as the input of the word segmentation model, and the word segmentation reference feature as the output of the word segmentation model to train the word segmentation model.

A polyphonic character training module, configured to use the training text and/or the word segmentation reference feature as the input of the polyphonic character prediction model, and the polyphonic character reference feature as the output of the polyphonic character prediction model to predict the polyphonic character The model is trained.

The prosody training module is configured to use the training text and/or the word segmentation reference feature as the input of the prosody prediction model, and the prosody reference feature as the output of the prosody prediction model to train the prosody prediction model.
The device according to claim 7, wherein the device further comprises:

The text acquisition module is configured to acquire a plurality of texts to be synthesized through a text iterator, and for each text to be synthesized, respectively execute the step of acquiring the text characteristics of the text to be synthesized;

The text prediction module is used to add multiple text features corresponding to the text to be synthesized to a preset feature queue, and when the feature queue meets a preset condition, obtain a preset number of text features in the feature queue, and The preset number of text features are respectively input into the duration prediction model, so that the step of obtaining duration features corresponding to the text features is performed synchronously.
A computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the processor executes the steps of the method according to any one of claims 1 to 6.
An intelligent terminal, comprising a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the method according to any one of claims 1 to 6 A step of.