CN114360488A

CN114360488A - Speech synthesis, speech synthesis model training method, apparatus and storage medium

Info

Publication number: CN114360488A
Application number: CN202210029807.2A
Authority: CN
Inventors: 王俊杰; 周明康; 罗超; 邹宇
Original assignee: Ctrip Travel Information Technology Shanghai Co Ltd
Current assignee: Ctrip Travel Information Technology Shanghai Co Ltd
Priority date: 2022-01-12
Filing date: 2022-01-12
Publication date: 2022-04-15

Abstract

The invention provides a speech synthesis method, a speech synthesis model training method, a speech synthesis device and a storage medium, wherein the speech synthesis method comprises the following steps: the method comprises the steps of inputting an input text into a voice synthesis model by obtaining the input text, segmenting the input text by utilizing a segmentation sub-model in the voice synthesis model to obtain word vectors, respectively carrying out voice feature recognition on the word vectors by utilizing at least two voice feature recognition sub-models in the voice synthesis model to correspondingly obtain at least two groups of voice features, and converting the input text into audio for output according to the at least two groups of voice features. The voice synthesis task in the technical scheme provided by the invention comprises a plurality of voice feature recognition subtasks, and the voice synthesis efficiency can be improved by combining the plurality of voice feature recognition tasks in the voice synthesis task.

Description

Speech synthesis, speech synthesis model training method, apparatus and storage medium

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for speech synthesis and speech synthesis model training and a storage medium.

Background

The speech synthesis can be divided into two parts of front end and back end processing, wherein the front end can be understood as mapping text characters into some artificial phonetic features such as phonemes, and the back end converts the features into original waveform output, and the original waveform output is audio. The embodiment of the invention provides a problem of how to perform voice synthesis.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the invention and therefore may include information that does not constitute prior art that is already known to a person of ordinary skill in the art.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide a method, a device and a storage medium for training a speech synthesis model, which overcome the difficulties in the prior art and can improve the efficiency of speech synthesis.

An embodiment of the present invention provides a speech synthesis method, including:

acquiring an input text;

inputting the input text into a voice synthesis model, performing word segmentation on the input text by using a word segmentation sub-model in the voice synthesis model to obtain word vectors, and performing voice feature recognition on the word vectors by using at least two voice feature recognition sub-models in the voice synthesis model to correspondingly obtain at least two groups of voice features;

and converting the input text into audio output according to the at least two groups of voice characteristics.

In an alternative embodiment, the word segmentation sub-model is obtained based on a TinyBERT model.

In an optional embodiment, the at least two voice feature recognition submodels are a prosody pause recognition submodel and a polyphonic character recognition submodel, respectively, where the voice feature output by the prosody pause recognition submodel is a prosody pause position recognition result, and the voice feature output by the polyphonic character recognition submodel is a polyphonic character recognition result.

In an alternative embodiment, the obtaining the input text includes:

acquiring an original text;

and carrying out regularization processing on the original text to obtain the input text.

The embodiment of the invention also provides a speech synthesis model training method, which comprises the following steps:

acquiring a text sample, wherein the text sample is provided with voice feature marking information corresponding to at least two voice feature recognition tasks;

inputting the text sample into a word segmentation sub-model, and outputting a word vector which carries the voice feature labeling information of the at least two voice feature recognition tasks;

inputting a word vector into at least two voice feature recognition submodels respectively corresponding to at least two voice feature recognition tasks, correspondingly outputting at least two groups of voice feature recognition results, and respectively adjusting model parameters of the at least two voice feature recognition submodels by using the at least two groups of voice feature recognition results until the at least two voice feature recognition submodels are converged;

and constructing to obtain a voice synthesis model according to the word segmentation submodel and the converged at least two voice feature recognition submodels.

In an alternative embodiment, obtaining a text sample comprises:

obtaining an original text sample;

and performing regularization processing on the text information of the original text sample to obtain the text sample.

In an optional embodiment, the at least two voice feature recognition submodels may be a prosody pause recognition submodel and a polyphone recognition submodel, and the voice feature labeling information of the at least two voice feature recognition tasks is prosody pause position information corresponding to the prosody pause recognition submodel and polyphone labeling information corresponding to the polyphone recognition submodel.

In an alternative embodiment, the prosody pause recognition sub-model includes at least two concatenated perfect connections, and a conditional random field module connected to the at least two concatenated perfect connections;

under the condition that the word vector is input into at least two voice feature recognition submodels respectively corresponding to at least two voice feature recognition tasks, at least two groups of voice feature recognition results are correspondingly output, and the method comprises the following steps:

performing dimension transformation on the word vectors by using the at least two serially connected full-connection layers to obtain transformation vectors;

and inputting the transformation vector into the conditional random field module, and outputting the identified prosody pause position information.

In an optional embodiment, the adjusting the model parameters of the at least two speech feature recognition submodels using the at least two sets of speech feature recognition results respectively includes:

calculating the overall loss value of the speech synthesis model and the ratio of the loss values of the at least two groups of speech feature recognition results to the overall loss value respectively, and taking the ratio as a weight;

and calculating gradient values according to the weights, and adjusting model parameters of the at least two voice feature recognition submodels according to the gradient values.

An embodiment of the present invention further provides a speech synthesis apparatus, including:

the first acquisition module is used for acquiring an input text;

the voice feature recognition module is used for inputting the input text into a voice synthesis model, performing word segmentation on the input text by using a word segmentation submodel in the voice synthesis model to obtain a word vector, and performing voice feature recognition on the word vector by using at least two voice feature recognition submodels in the voice synthesis model respectively to correspondingly obtain at least two groups of voice features;

and the audio synthesis module is used for converting the input text into audio output according to the at least two groups of voice characteristics.

An embodiment of the present invention further provides a speech synthesis model training apparatus, including:

the second acquisition module is used for acquiring a text sample, and the text sample is provided with voice feature marking information corresponding to at least two voice feature recognition tasks;

the word segmentation module is used for inputting the text sample into a word segmentation sub-model and outputting a word vector, wherein the word vector carries the voice feature labeling information of the at least two voice feature recognition tasks;

the model training module inputs the word vectors into at least two voice feature recognition submodels respectively corresponding to at least two voice feature recognition tasks, correspondingly outputs at least two groups of voice feature recognition results, and respectively adjusts model parameters of the at least two voice feature recognition submodels by using the at least two groups of voice feature recognition results until the at least two voice feature recognition submodels are converged;

and the model construction module is used for constructing a voice synthesis model according to the word segmentation sub-model and the converged at least two voice feature recognition sub-models.

An embodiment of the present invention further provides an electronic device, including:

a processor;

a memory having stored therein executable instructions of the processor;

wherein the processor is configured to perform the steps of the above-described speech synthesis method or speech synthesis model training method via execution of executable instructions.

Embodiments of the present invention also provide a computer-readable storage medium for storing a program, which when executed implements the steps of the above-mentioned speech synthesis method or speech synthesis model training method.

The invention aims to provide a method, a device and a storage medium for training a speech synthesis model, which are used for inputting an input text into a speech synthesis model by obtaining the input text, segmenting the input text by using a segmentation sub-model in the speech synthesis model to obtain a word vector, respectively performing speech feature recognition on the word vector by using at least two speech feature recognition sub-models in the speech synthesis model to correspondingly obtain at least two groups of speech features, and converting the input text into audio for output according to the at least two groups of speech features. The voice synthesis task in the technical scheme provided by the invention comprises a plurality of voice feature recognition subtasks, and the voice synthesis efficiency can be improved by combining the plurality of voice feature recognition tasks in the voice synthesis task.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, with reference to the accompanying drawings.

FIG. 1 is a flow diagram of one embodiment of a speech synthesis method of the present invention;

FIG. 2 is a schematic diagram of a speech synthesis model in the speech synthesis method of the present invention;

FIG. 3 is a flow diagram for one embodiment of a speech synthesis model training method of the present invention;

FIG. 4 is a block diagram of one embodiment of a speech synthesis apparatus of the present invention;

FIG. 5 is a block diagram of an embodiment of a speech synthesis model training apparatus of the present invention;

FIG. 6 is a schematic diagram of the operation of the speech synthesis apparatus or speech synthesis model training apparatus of the present invention.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

The drawings are merely schematic illustrations of the invention and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware forwarding modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

In addition, the flow shown in the drawings is only an exemplary illustration, and not necessarily includes all the steps. For example, some steps may be divided, some steps may be combined or partially combined, and the actual execution sequence may be changed according to the actual situation. The use of "first," "second," and similar terms in the detailed description is not intended to imply any order, quantity, or importance, but rather is used to distinguish one element from another. It should be noted that features of the embodiments of the invention and of the different embodiments may be combined with each other without conflict.

The inventor of the present invention has studied and found that, with the development of deep learning technology and the improvement of computing power of computer hardware, the speech synthesis technology based on deep learning becomes more and more mature, and at this time, the front-end processing method based on regularization is more and more bloated than the deep learning method, and how to rapidly process a plurality of different flows in speech synthesis becomes a difficult problem.

The embodiment of the invention provides and processes a multi-Task learning model MTFP (Multi Task Front Processing on Text to speech) of various Front-end tasks for speech synthesis. The invention idea of the invention is:

the voice synthesis model comprises a public module and at least two branch modules, wherein the output of the public module is connected with the input of the two branch modules, the public module is a word segmentation sub-model, the at least two branch modules are respectively different voice feature recognition sub-models, and each voice feature recognition sub-model is used for recognizing different types of voice features;

the original text is subjected to word segmentation through the word segmentation submodel, word segmentation feature data are input into at least two voice feature recognition submodels, at least two types of voice feature recognition results are obtained, and the original text is converted into audio frequency to be output according to the at least two types of voice feature recognition results.

The embodiment of the invention identifies the subtasks according to the common part of the subtasks in the audio synthesis task, such as word segmentation task, and the voice characteristics which are independent mutually. And then, constructing a word segmentation submodel according to the word segmentation task of the common part, and connecting a plurality of mutually independent voice feature recognition submodels behind the word segmentation submodel, wherein each voice feature recognition submodel is used for recognizing corresponding voice features, and the voice features recognized by different voice feature recognition submodels are different.

Therefore, the embodiment of the invention combines a plurality of voice feature recognition tasks in the voice synthesis task, and can improve the voice synthesis efficiency.

Fig. 1 is a flowchart of a speech synthesis method according to an embodiment of the present invention, and as shown in fig. 1, the speech synthesis method according to the embodiment of the present invention includes the following steps:

step 110: acquiring an input text;

step 120: inputting an input text into a voice synthesis model, segmenting the input text by using a segmentation sub-model in the voice synthesis model to obtain word vectors, and respectively performing voice feature recognition on the word vectors by using at least two voice feature recognition sub-models in the voice synthesis model to correspondingly obtain at least two groups of voice features;

step 130: the input text is converted to an audio output based on at least two sets of speech features.

The voice synthesis task comprises a plurality of voice characteristic recognition subtasks, such as rhythm pause, polyphone recognition and the like.

In the embodiment of the invention, the word segmentation aims to identify and convert the word meaning of the text information of the input text to obtain word vectors, and each word vector represents the context information of the word.

In an alternative embodiment of the invention, the word segmentation sub-model is obtained based on a TinyBERT model.

Compared with a converter-based bidirectional Encoder representation technology BERT (bidirectional Encoder representation from transformers), the TinyBERT model belongs to a lightweight model at a word level, and is small in size and faster in running speed. TinyBERT is a knowledge distillation method specially designed based on a converter transform model, the size of the model is less than 1/7 of BERT, but the speed is improved by 9 times, and the performance is not obviously reduced.

In an optional embodiment of the present invention, the at least two voice feature recognition submodels may be a prosody pause recognition submodel and a polyphonic character recognition submodel, respectively, where the voice feature output by the prosody pause recognition submodel is a prosody pause location recognition result, and the voice feature output by the polyphonic character recognition submodel is a polyphonic character recognition result.

The prosody pause position recognition result can be a no-voice pause result or a voice pause result, and in the case of the voice pause result, the voice pause result contains the voice pause position information recognized by the corresponding input text.

The polyphone recognition result may be a polyphone-free word or a polyphone-containing word, and further includes a polyphone recognized corresponding to the original text data and phoneme information thereof in case of the polyphone-containing word, the phoneme information representing pronunciation.

The prosodic pause recognition result and the polyphonic recognition result reflect the phoneme characteristics of the corresponding words or phrases, and correspond to the voice characteristics of the above, so that when the original input text is converted into audio and output, the output audio can more accurately reflect the semantics of the input text.

In this case, referring to fig. 2, the speech synthesis model specifically includes:

a word segmentation submodel 21;

a prosody pause recognition submodel 22A and a polyphonic character recognition submodel 22B connected with the output end of the word segmentation submodel 21;

a speech synthesis submodel 23 connected to the output terminals of the prosody pause recognition submodel 22A and the polyphonic recognition submodel 22B.

In an optional embodiment of the present invention, acquiring the input text specifically may include:

acquiring an original text;

and carrying out regularization processing on the original text to obtain an input text.

The text information of the original text is regularized, messy codes, non-standard symbols and the like can be removed, or Chinese symbols are replaced by corresponding English symbols. In addition, the pronunciation of the number is different in different scenes, so that the number can be replaced into different Chinese characters according to the keywords of the matching statistics, such as: "the room price is 423 yuan" to "the room price is four hundred twenty three yuan", "the room number 501" to "the room number is five zero one", etc.

Fig. 3 is a flowchart of a speech synthesis model training method according to an embodiment of the present invention, and as shown in fig. 2, the speech synthesis model training method includes the following steps:

step 310: acquiring a text sample, wherein the text sample is provided with voice feature marking information corresponding to at least two voice feature recognition tasks;

step 320: inputting a text sample into a word segmentation sub-model, and outputting a word vector which carries voice feature labeling information of at least two voice feature recognition tasks;

step 330: inputting the word vector into at least two voice feature recognition submodels respectively corresponding to at least two voice feature recognition tasks, correspondingly outputting at least two groups of voice feature recognition results, and respectively adjusting model parameters of the at least two voice feature recognition submodels by using the at least two groups of voice feature recognition results until the at least two voice feature recognition submodels are converged;

step 340: and constructing to obtain a voice synthesis model according to the word segmentation submodel and the converged at least two voice feature recognition submodels.

The embodiment of the invention provides a multi-task joint training scheme for a speech synthesis model, which can improve the model training efficiency. The speech synthesis model obtained by training using the embodiment of the present invention can be used in the speech synthesis process shown in fig. 1.

In this embodiment, a supervised training scheme is adopted for training each speech feature recognition submodel, speech feature labeling is performed on a text sample corresponding to a speech feature recognition task to be realized by each speech feature recognition submodel, and speech feature labeling information corresponding to different speech feature recognition submodels is different.

In an alternative embodiment of the present invention, obtaining the text sample may include the following steps:

obtaining an original text sample;

and carrying out regularization processing on the text information of the original text sample to obtain the text sample.

For example, the text information of the original text sample is regularized, so that messy codes, non-standard symbols and the like can be removed, or Chinese symbols are replaced by corresponding English symbols. In addition, the pronunciation of the number is different in different scenes, so that the number can be replaced into different Chinese characters according to the keywords of the matching statistics, such as: "the room price is 423 yuan" to "the room price is four hundred twenty three yuan", "the room number 501" to "the room number is five zero one", etc.

In an optional embodiment of the present invention, the word segmentation sub-model selects a TinyBERT model, and the TinyBERT model may refer to the above, which is not limited herein.

In this embodiment, a normalized text sample may be converted into a character by word segmentation, and the text is converted into an index value of a corresponding dictionary according to a self-built index dictionary to construct a vector that can be processed by at least two subsequent speech feature recognition submodels, and the input text is converted into a word vector that can be processed by the model through a TinyBERT model.

In an optional embodiment of the present invention, the at least two voice feature recognition submodels may be a prosody pause recognition submodel and a polyphone recognition submodel, and the voice feature labeling information of the at least two voice feature recognition tasks is prosody pause position information corresponding to the prosody pause recognition submodel and polyphone labeling information corresponding to the polyphone recognition submodel.

In this case, in an alternative embodiment of the invention, there is a great difference between the prosody pause, which is the location where a sentence break needs to be found, and the polyphone recognition, which is the recognition of whether it is a polyphone. In the multi-task learning, a criterion of 'high cohesion and low coupling' needs to be guaranteed, namely, a common part of a multi-task model guarantees that the contribution degree of each task is maximum, but for the design of modules among all tasks, all the modules need to be guaranteed to be independent and not to influence each other, but in the actual task design, all the tasks are difficult to guarantee to be parallel and not to interfere with each other, and therefore the interaction between the two modules is avoided as much as possible.

Prosodic pause is a very critical step of a speech synthesis task, and the prosodic pause can be regarded as a sequence labeling task and identifies a position where a speech operation needs to be cut off; the task of polyphone recognition can be regarded as a classification task, namely, under the scene of inputting a text, whether certain characters are polyphones is judged, and if so, which polyphone is needed to be further judged. Therefore, for a prosody pause task (namely a sequence labeling problem), hidden vectors (word vectors) subjected to TinyBERT are subjected to dimensionality transformation through two full-connection layers, information loss is reduced as much as possible, and finally the transformed vectors are subjected to prosody pause learning by using a Conditional Random Field (CRF);

for the polyphone recognition problem, considering that the text sample may have no polyphone phenomenon, a special label is designed to represent that the polyphone does not exist, so that a normal vector is generated to participate in training; and for normal polyphone characters, extracting the hidden vector corresponding to the polyphone character position, and learning the polyphone character category corresponding to the character through a full connection layer network.

Therefore, prosodic pause is a very critical step of the speech synthesis task, which can be regarded as a sequence labeling task and identifies the position where the speech needs to be cut off. Therefore, the prosody pause recognition sub-model is substantially a text sequence recognition model.

The task of polyphone recognition can be regarded as a classification task, namely, under the scene of inputting a text, whether certain characters are polyphones is judged, and if so, which polyphone is needed to be further judged. Therefore, the polyphonic character recognition submodel employs a classification model.

In an alternative embodiment of the present invention, the prosody pause identifier model includes at least two serially connected fully connected layers fc (full connected layers), and a conditional random field module connected to the at least two serially connected fully connected layers;

under the condition that the word vectors are input into at least two voice feature recognition submodels respectively corresponding to at least two voice feature recognition tasks, performing dimension transformation on the word vectors by utilizing at least two serially connected full-connection layers to obtain transformation vectors;

and inputting the transformation vector into a conditional random field module, and outputting the identified prosody pause position information.

The prosody pause recognition submodel in the embodiment is composed of at least two fully connected layers and a conditional random field which are connected in series, wherein each node of the fully connected layers is connected with all nodes of the previous layer and used for integrating the extracted features. Thus, for two adjacent fully-connected layers in series, the input of the next fully-connected layer is the output of the previous fully-connected layer.

And the full-connection layer is used for carrying out dimension transformation, so that the information loss in the text sample can be reduced, and the prosody pause recognition sub-model accuracy is improved.

Conditional Random field CRF (conditional Random fields) is a conditional probability distribution model for one set of input sequences given another set of output sequences. In the present embodiment, the CRF can output the probability of prosody pauses occurring between two phonemes corresponding to adjacent two text elements, that is, the probability of prosody pause positions between adjacent two phonemes, based on the input word vector sequence.

In an alternative embodiment of the invention, the polyphonic character recognizer model includes a fully connected layer for recognizing the category of the learned polyphonic characters.

In an optional embodiment of the present invention, the adjusting the model parameters of the at least two speech feature recognition submodels using the at least two sets of speech feature recognition results respectively includes:

calculating the total loss value of the speech synthesis model, and the proportion of the loss values of at least two groups of speech feature recognition results to the total loss value respectively, and taking the proportion as weight;

and calculating gradient values according to the weights, and adjusting model parameters of at least two voice feature recognition submodels according to the gradient values.

In this embodiment, when performing reverse propagation, the loss values loss of the two tasks need to be combined, but considering the difference between the two tasks, the difference between the loss values may be too large, and then direct addition may cause the gradient return value to be different from the current value decreasing speed of each task, and at this time, the change direction of the loss value when the gradient decreases needs to be considered. For this reason, an adaptive loss calculation mode is designed, as shown in formula (1):

wherein k is the number of tasks, lossi represents the loss value of task i, and the formula shows that the contribution value of each task is the proportion of the whole according to the current loss value when the loss value is distributed. In short, when the loss value is used for gradient transmission, each task has an independent loss interval and a descending trend, but the interval is changed through direct addition, and in order to ensure that the correct gradient can be transmitted back to each task after loss combination, the proportion of the loss value of each task in the whole is taken as a weight, and the weight is accumulated to be used as the gradient value of the gradient transmission back. This can improve the accuracy of the model as a whole.

Based on the scheme proposed by the embodiment, the TinyBERT participle sub-model converts the input text into three-dimensional vectors, that is, (batch _ size, max _ length, hidden _ size), where batch _ size represents the batch size, max _ length represents the longest vector in the batch data, other vectors shorter than max _ length may be supplemented with 0, hidden _ size represents the hidden layer dimension after TinyBERT vectorization, written here as (BxCxH), and in an experiment, batch _ size is usually set to 256, max _ length varies with the input length, and hidden _ size is 312.

The audio synthesis model main structure adopts a branch way to respectively send the vectors obtained after word segmentation into a designed rhythm pause recognition submodel and a polyphone recognition submodel. For the prosody pause recognition task, the obtained vector passes through two full-connection network layers, the size of the two full-connection network layers is (HxL) and (LxN), L is a full-connection network parameter value, and is set as 256, and N is the number of pinyin of polyphones. And after the intermediate vector is obtained, taking the input mask matrix and the input label matrix as the input of a conditional random field CRF, and obtaining the final rhythm pause output.

For the polyphone recognition task, the polyphone position in each Batch is utilized to find out the corresponding Embedding word vector of the polyphone vector, the polyphone vector is extracted and spliced according to the Batch direction, and for the model without the polyphone, a vector with the length of H and the total length of 0 is spliced. And inputting the finally spliced vectors into a full-connection layer with the parameter of (HxM) to obtain the category output vectors of the polyphones.

Alternative embodiments of the present invention use text samples that are either protected by law or authorized by the user. The dialogs in the text sample can be translated from daily order data via the speech recognition technology ASR technology group and statistically derived, and can be checked by specialized personnel.

Because the real scene contains a large amount of simple english, for example "WIFI", "big bed room a", consequently, can construct chinese-english's phoneme table, chinese directly converts into the initial consonant and vowel wherein, english except that the common word of real scene passes through CMU dictionary conversion, all the rest convert into capital letter, convenient follow-up pronounces according to the letter.

During model training, the embodiment of the invention trains a polyphone recognition sub-model and a rhythm pause recognition sub-model simultaneously. When training the multitask model, setting the batch size to be 256, inputting the model as a text, and outputting the model as a text vector with polyphone categories and pause positions; the loss function comprises two parts, namely a polyphone recognition loss function and a prosody pause prediction loss function, the two parts of loss functions adopt cross entropy loss functions and adopt self-adaptive loss strategies, and the loss function calculation function is shown in the formula. The optimizer adopts Adam, performs model effect test once after each iteration training for a plurality of times, and observes loss reduction conditions until loss is not reduced.

Fig. 4 is a block diagram of an embodiment of a speech synthesis apparatus of the present invention. The speech synthesis apparatus of the present invention, as shown in fig. 4, includes but is not limited to:

a first obtaining module 410, obtaining an input text;

the speech feature recognition module 420 is configured to input the input text into a speech synthesis model, perform word segmentation on the input text by using a word segmentation sub-model in the speech synthesis model to obtain word vectors, perform speech feature recognition on the word vectors by using at least two speech feature recognition sub-models in the speech synthesis model, and correspondingly obtain at least two groups of speech features;

and an audio synthesis module 430 for converting the input text into audio output according to at least two groups of speech features.

The implementation principle of the above modules is described in the related description of the speech synthesis method, and will not be described herein again.

The voice synthesis device provided by the invention has the advantages that the voice synthesis task comprises a plurality of voice feature recognition subtasks, and the voice synthesis efficiency can be improved by combining the plurality of voice feature recognition tasks in the voice synthesis task.

FIG. 5 is a block diagram of a speech synthesis model training apparatus according to an embodiment of the present invention. The speech synthesis model training device of the present invention, as shown in fig. 5, includes but is not limited to:

a second obtaining module 510, configured to obtain a text sample, where the text sample has speech feature labeling information corresponding to at least two speech feature recognition tasks;

the word segmentation module 520 inputs the text sample into a word segmentation sub-model and outputs a word vector, wherein the word vector carries voice feature labeling information of at least two voice feature recognition tasks;

the model training module 530 inputs the word vectors into at least two voice feature recognition submodels respectively corresponding to the at least two voice feature recognition tasks, correspondingly outputs at least two groups of voice feature recognition results, and respectively adjusts model parameters of the at least two voice feature recognition submodels by using the at least two groups of voice feature recognition results until the at least two voice feature recognition submodels converge;

and the model construction module 540 is used for constructing a voice synthesis model according to the word segmentation sub-model and the converged at least two voice feature recognition sub-models.

The implementation principle of the above modules is described in the speech synthesis model training method, and will not be described herein again.

The speech synthesis model training device provided by the embodiment of the invention can provide a multi-task joint training scheme, so that the model training efficiency can be improved.

Optionally, the model training module 530 is specifically configured to:

The embodiment of the invention also provides electronic equipment which comprises a processor. A memory having stored therein executable instructions of the processor. Wherein the processor is configured to perform the steps of the speech synthesis method or the speech synthesis model training method via execution of the executable instructions.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" platform.

Fig. 6 is a schematic structural diagram of a speech synthesis apparatus or a speech synthesis model training apparatus of the present invention. An electronic device 600 according to this embodiment of the invention is described below with reference to fig. 6. The electronic device 600 shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 6, the electronic device 600 is embodied in the form of a general purpose computing device. The components of the electronic device 600 may include, but are not limited to: at least one processing unit 610, at least one memory unit 620, a bus 630 connecting the different platform components (including the memory unit 620 and the processing unit 610), a display unit 640, etc.

Wherein the storage unit stores program code which is executable by the processing unit 610 such that the processing unit 610 performs the steps according to various exemplary embodiments of the present invention as described in the above-mentioned speech synthesis method or speech synthesis model training method section of the present specification. For example, processing unit 610 may perform the steps as shown in fig. 1-3.

The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)621 and/or a cache memory unit 622, and may further include a read only memory unit (ROM) 623.

The storage unit 620 may also include a program/utility 624 having a set (at least one) of program modules 625, such program modules 625 including, but not limited to: a processing system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 630 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 600 may also communicate with one or more external devices 670 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 600 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 650.

Also, the electronic device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 660. The network adapter 660 may communicate with other modules of the electronic device 600 via the bus 630. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage platforms, to name a few.

Embodiments of the present invention also provide a computer-readable storage medium for storing a program, and steps of a speech synthesis method or a speech synthesis model training method implemented when the program is executed. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the speech synthesis method or speech synthesis model training method section above in this description, when the program product is run on the terminal device.

According to the program product for realizing the method, the portable compact disc read only memory (CD-ROM) can be adopted, the program code is included, and the program product can be operated on terminal equipment, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out processes of the present invention may be written in any combination of one or more programming languages, including object oriented programming languages such as Java, C + + or the like and conventional procedural programming languages, such as "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A method of speech synthesis, comprising:

acquiring an input text;

2. The speech synthesis method of claim 1, wherein the word segmentation sub-model is derived based on a TinyBERT model.

3. The speech synthesis method of claim 1, wherein the at least two speech feature recognition submodels are a prosody pause recognition submodel and a polyphonic character recognition submodel, respectively, wherein the speech feature output by the prosody pause recognition submodel is a prosody pause location recognition result, and the speech feature output by the polyphonic character recognition submodel is a polyphonic character recognition result.

4. The speech synthesis method of claim 1, wherein the obtaining input text comprises:

acquiring an original text;

5. A method for training a speech synthesis model, comprising:

6. The method of claim 5, wherein obtaining text samples comprises:

obtaining an original text sample;

7. The method of claim 5, wherein the at least two speech feature recognition submodels are a prosody pause recognition submodel and a polyphone recognition submodel, respectively, and the speech feature labeling information of the at least two speech feature recognition tasks are prosody pause location information corresponding to the prosody pause recognition submodel and polyphone labeling information corresponding to the polyphone recognition submodel, respectively.

8. The method of training a speech synthesis model according to claim 7, wherein the prosody pause recognition sub-model comprises at least two concatenated perfect connections and a conditional random field module connected to the at least two concatenated perfect connections;

9. The method of claim 5, wherein the using the at least two sets of speech feature recognition results to adjust model parameters of the at least two speech feature recognition submodels respectively comprises:

10. A speech synthesis apparatus, comprising:

the first acquisition module is used for acquiring an input text;

11. A speech synthesis model training apparatus, comprising:

12. An electronic device, comprising:

a processor;

a memory having stored therein executable instructions of the processor;

wherein the processor is configured to perform the steps of the speech synthesis method of any one of claims 1 to 4, or the speech synthesis model training method of any one of claims 5 to 9, via execution of the executable instructions.

13. A computer-readable storage medium storing a program which, when executed by a processor, performs the steps of the speech synthesis method of any one of claims 1 to 4, or the speech synthesis model training method of any one of claims 5 to 9.