CN112863480A

CN112863480A - Method and device for optimizing end-to-end speech synthesis model and electronic equipment

Info

Publication number: CN112863480A
Application number: CN202011530802.5A
Authority: CN
Inventors: 李睿端; 李健; 陈明; 武卫东
Original assignee: Beijing Sinovoice Technology Co Ltd
Current assignee: Beijing Sinovoice Technology Co Ltd
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2021-05-28
Anticipated expiration: 2040-12-22
Also published as: CN112863480B

Abstract

The invention provides an optimization method and device of an end-to-end speech synthesis model, electronic equipment and a storage medium, wherein the method comprises the following steps: according to a first preset rule, performing first soft occlusion on phonemes contained in a text input into the end-to-end speech synthesis model to generate a second text; sequentially coding the second text by adopting a phoneme coder, and predicting the coded second text by adopting a variable information predictor to obtain a first output; performing second soft shielding on the first output according to a second preset rule; and inputting the first output subjected to the second soft occlusion processing into a preset decoder for decoding to obtain a Mel spectrum. According to the optimization method of the end-to-end speech synthesis model, soft shielding is added to the input of the end-to-end speech synthesis model and the input of a decoder respectively, so that data disturbance is increased, and the robustness of the end-to-end speech synthesis model can be improved.

Description

Method and device for optimizing end-to-end speech synthesis model and electronic equipment

Technical Field

The present invention relates to the field of speech synthesis technologies, and in particular, to a method and an apparatus for optimizing an end-to-end speech synthesis model, and an electronic device.

Background

Generally, as shown in fig. 1, TTS (text-to-speech, speech synthesis) is divided into a text analysis (e.g., text regularization, polyphonic disambiguation, etc.) module, a prosody prediction module, a duration model, an acoustic model, and a vocoder. The processed text passes through a prosody prediction module, the text with prosody symbols is output, and then links such as character-voice conversion are carried out. At present, a mainstream end-to-end model fuses a duration model and an acoustic model into one model, a text generates phoneme information through a front end, the end-to-end model takes the phoneme information as input to generate a Mel spectrum, and then a vocoder is connected externally to convert acoustic feature information into audio. In the field of speech synthesis, commonly used end-to-end models fall into two categories: autoregressive models and fully parallel models.

TTS technology may enable audio to be generated from text. The speech synthesis technology goes through three main development stages, namely splicing, parameter synthesis and end-to-end synthesis. At present, the mainstream synthesis technology in the industry is end-to-end, because the sound synthesized by using the end-to-end method can be largely separated from the machine feeling, the naturalness is high, and the requirement on the recording data volume is low. However, end-to-end synthesis also has certain problems related to the structure of the end-to-end model pure black box. For example, for models with explicit duration modules, the decoder is prone to overfitting the erroneous information, which affects the quality of the final synthesized speech. Therefore, the existing end-to-end model has poor robustness, and a solution for improving the robustness of the end-to-end model is urgently needed to be provided by a person skilled in the art.

Disclosure of Invention

In view of the above, embodiments of the present invention are proposed to provide a method and an apparatus for optimizing an end-to-end speech synthesis model, and an electronic device, which overcome the above problems or at least partially solve the above problems.

In a first aspect, an embodiment of the present invention discloses an optimization method for an end-to-end speech synthesis model, including: according to a first preset rule, performing first soft occlusion on phonemes contained in a text input into the end-to-end speech synthesis model to generate a second text;

sequentially coding the second text by adopting a phoneme coder, and predicting the coded second text by adopting a variable information predictor to obtain a first output;

performing second soft shielding on the first output according to a second preset rule;

and inputting the first output subjected to the second soft occlusion processing into a preset decoder for decoding to obtain a Mel spectrum.

Optionally, the step of performing a first soft occlusion on phonemes included in the text input into the end-to-end speech synthesis model according to a first preset rule to generate a second text includes:

according to a first preset weight in the end-to-end voice synthesis model, blocking phonemes contained in a text input into the end-to-end voice synthesis model;

predicting a first probability of errors of phonemes at each position in the text through a detection network;

determining a first correction weight according to the first probability and the first preset weight, wherein the first correction weight is used as the first preset weight when soft occlusion occurs next time;

and for each position in the text, shielding the phoneme corresponding to the position according to the first correction weight and a preset mask characteristic.

Optionally, the step of determining a first correction weight according to the first probability and the first preset weight includes:

determining a product of the first probability and the first preset weight as the first correction weight.

Optionally, the step of performing a second soft occlusion on the first output according to a second preset rule includes:

shielding the first output according to a second preset weight in the end-to-end speech synthesis model;

predicting a second probability of errors occurring at each position in the first output for the phoneme through a detection network;

determining a second correction weight according to the second probability and the second preset weight, wherein the correction weight is used as the second preset weight when soft occlusion occurs next time;

and for each position in the first output, blocking a phoneme corresponding to the position according to the second correction weight and a preset mask characteristic.

Optionally, the variable information predictor comprises: at least one of a duration predictor, a pitch predictor, and an energy predictor.

In a second aspect, an embodiment of the present invention discloses an apparatus for optimizing an end-to-end speech synthesis model, where the apparatus includes: an apparatus for optimizing an end-to-end speech synthesis model, the apparatus being applied to the end-to-end speech synthesis model, wherein the apparatus comprises:

the first shielding module is used for carrying out first soft shielding on phonemes contained in the text input into the end-to-end speech synthesis model according to a first preset rule to generate a second text;

the first processing module is used for sequentially adopting a phoneme coder to code the second text, and adopting a variable information predictor to carry out prediction processing on the coded second text to obtain first output;

the second shielding module is used for carrying out second soft shielding on the first output according to a second preset rule;

and the second processing module is used for inputting the first output subjected to the second soft occlusion processing into a preset decoder to be decoded to obtain a Mel spectrum.

Optionally, the first occlusion model comprises:

the first submodule is used for shielding phonemes contained in a text input into the end-to-end speech synthesis model according to a first preset weight in the end-to-end speech synthesis model;

the second submodule is used for predicting a first probability of errors of the phonemes at each position in the text through a detection network;

a third sub-module, configured to determine a first correction weight according to the first probability and the first preset weight, where the first correction weight is used as the first preset weight when soft occlusion occurs next time;

and the fourth sub-module is used for shielding the phoneme corresponding to each position in the text according to the first correction weight and a preset mask characteristic.

Optionally, the third sub-module is specifically configured to:

Optionally, the second occlusion module comprises:

the first submodule is used for shielding the first output according to a second preset weight in the end-to-end voice synthesis model;

a second submodule for predicting a second probability of errors occurring at each position of the phonemes in the first output by means of a detection network;

a third sub-module, configured to determine a second correction weight according to the second probability and the second preset weight, where the correction weight is used as the second preset weight for the next soft occlusion;

and the fourth submodule is used for shielding the phoneme corresponding to each position in the first output according to the second correction weight and a preset mask characteristic.

In a third aspect, an embodiment of the present invention discloses an electronic device, including: one or more processors; and one or more machine-readable media having instructions stored thereon; the instructions, when executed by the one or more processors, cause the processors to perform a method of optimizing an end-to-end speech synthesis model as claimed in any one of the preceding claims.

In a fourth aspect, an embodiment of the present invention discloses a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the method for optimizing an end-to-end speech synthesis model according to any one of the above.

In the embodiment of the invention, according to a first preset rule, performing first soft occlusion on phonemes contained in a text in an input end-to-end speech synthesis model to generate a second text; sequentially coding the second text by adopting a phoneme coder, and predicting the coded second text by adopting a variable information predictor to obtain a first output; performing second soft shielding on the first output according to a second preset rule; the first output subjected to the second soft occlusion processing is input into a preset decoder to be decoded to obtain a Mel spectrum, soft occlusion is added to the input of the end-to-end speech synthesis model and the input of the decoder respectively in the embodiment of the application, so that data disturbance is increased, and the robustness of the end-to-end speech synthesis model can be improved.

Drawings

FIG. 1 is a schematic diagram of a TTS model;

FIG. 2 is a flow chart of the steps of a method for optimizing an end-to-end speech synthesis model according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a fast speed model according to an embodiment of the present invention;

fig. 4 is a block diagram of an end-to-end speech synthesis model optimization apparatus according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Referring to fig. 1, a flowchart illustrating steps of a method for optimizing an end-to-end speech synthesis model according to an embodiment of the present invention is shown.

The end-to-end speech synthesis model optimization method of the embodiment of the invention can comprise the following steps:

step 101: and according to a first preset rule, performing first soft occlusion on phonemes contained in the text in the input end-to-end speech synthesis model to generate a second text.

The input end-to-end speech synthesis model comprises a training duration module, and the training duration module usually needs other models or forced alignment tools to generate supervision information, if the part of information is wrong, the wrong information is conducted backwards all the time, and in addition, the decoder is easy to generate overfitting, so that the influence of overfitting of the wrong information on subsequent training is larger. Therefore, a first soft mask needs to be added to the input text to mask a part of the information, thereby improving the robustness of the end-to-end speech synthesis model.

The most critical alignment information in the duration module training data is typically generated by other forced alignment tools. The accuracy is problematic because the tools are generated directly, rather than manually labeled. Some additional mechanism is needed to reduce the impact of such erroneous information on model robustness. One way to exemplarily add soft occlusion is to: randomly masking 15% of input information, and then training a model to correctly predict the masked inputs (specifically, 80% of the masked inputs replace original inputs with masks, 10% of the masked inputs are randomly replaced with other inputs, and the remaining 10% of the masked inputs remain unchanged), so that the trained information considers the context at the same time, only a certain probability of input being masked does not damage the expression capability of the model basically as long as the proper probability is controlled, and the model has strong robustness. The soft-mask or soft occlusion added in the embodiment of the application is also a similar principle. The soft-mask is originally used in NLP (Natural Language Processing) to accomplish spelling correction tasks, and includes a detection network and a bert-based correction network. The probability of the character error at each position is predicted by detecting the network, and then soft-mask is performed by using the just predicted probability. The soft-mask is actually a weighting of the input and the mask, and the probability of detecting the network output is used as a weighting when weighting, and the mask feature and the input are added together. This soft-mask information is then input into the correction network, thereby completing the error correction. In the embodiment of the invention, the soft-mask is applied to the TTS model input, especially the result of the duration module, because the duration information is used as the input of the decoder, the overfitting of the decoder is directly influenced.

Step 102: and sequentially coding the second text by adopting a phoneme coder, and predicting the coded second text by adopting a variable information predictor to obtain a first output.

In the embodiment of the present application, an end-to-end Speech synthesis model is described as a Fast Speech model with reference to fig. 3. The difference between the model such as Fast speed and the autoregressive model is that the latter requires generating mel spectrum frame by frame, and the alignment information between text and voice cannot be fully utilized. This is also the reason why the autoregressive model synthesized speech lacks controllability. The fast speed adopts a feedforward neural network based on a transformer and an attention, and the synthesis of the text can be better controlled by considering the alignment information of the text and the voice. The most significant influence on the alignment information of the text and the speech is the duration module in the model, which needs to be trained on the information, and if the information is not accurate enough, the training effect of the subsequent decoder is greatly influenced, and even the final synthesized speech quality is influenced due to the overfitting of wrong information. Therefore, the soft-mask is introduced, which is equivalent to adding a certain proportion of noise to the input of the model and the input of the decoder, thereby increasing the robustness of the model.

The duration module (duration prediction) in the end-to-end model can complete the duration prediction of several frames corresponding to a certain phoneme. The training data includes alignment information. Alignment information, which may be simply understood as assigning each phoneme in an audio sequence some number of frames of audio based on the label text, for example, 'hello (sil i h ao sil)' the sequence 'i' may be aligned to the 5 th to 10 th frames of audio. Taking the fast speed of fig. 2 as an example, the duration prediction occurs in the duration predictor module. The purpose of the duration predictor is to pad the output of the encoder to a length consistent with the mel-spectrum. The output of the encoder firstly enters the duration prediction, and the duration model returns the number of times that each vector needs to be copied correspondingly. Soft-mask is added at the input of Fast Speech model and the input of decoder respectively, so that the Fast Speech model no longer depends on single frame data, but refers to the information of front and back frames as much as possible. The probability can be predicted globally by referring to the soft-mask approach in bert, and then the input can be weighted by using the probability. Where this probability may be similar to the transition probability, i.e., the probability that the current frame appears for the current location.

Step 103: and carrying out second soft shielding on the first output according to a second preset rule.

The first preset rule and the second preset rule may be the same or different, as long as the soft shielding of the first output can be realized.

Step 104: and inputting the first output subjected to the second soft occlusion processing into a preset decoder for decoding to obtain a Mel spectrum.

The default decoder may be a mel-spectrum decoder.

It should be noted that the end-to-end speech synthesis model optimization method provided in the embodiment of the present application is not only applicable to the fast speech model mentioned in the above example, but also applicable to other models of end-to-end speech synthesis, such as transformer _ tts, and a soft-mask mechanism may also be added. Unlike the fast speed model, soft-mask is applied to the encoder of the transform _ tts. This is related to the structure of the model itself.

According to the end-to-end speech synthesis model optimization method provided by the embodiment of the invention, according to a first preset rule, first soft occlusion is carried out on phonemes contained in a text input into an end-to-end speech synthesis model, and a second text is generated; sequentially coding the second text by adopting a phoneme coder, and predicting the coded second text by adopting a variable information predictor to obtain a first output; performing second soft shielding on the first output according to a second preset rule; the first output subjected to the second soft occlusion processing is input into a preset decoder to be decoded to obtain a Mel spectrum, soft occlusion is added to the input of the end-to-end speech synthesis model and the input of the decoder respectively in the embodiment of the application, so that data disturbance is increased, and the robustness of the end-to-end speech synthesis model can be improved.

In an optional embodiment, the step of performing a first soft occlusion on phonemes included in a text input into the end-to-end speech synthesis model according to a first preset rule to generate a second text includes:

firstly, according to a first preset weight in an end-to-end speech synthesis model, blocking phonemes contained in a text input into the end-to-end speech synthesis model;

secondly, predicting a first probability of errors of the phonemes at each position in the text through a detection network;

thirdly, determining a first correction weight according to the first probability and the first preset weight, wherein the first correction weight is used as the first preset weight when soft shielding is carried out next time;

and finally, for each position in the text, shielding the phoneme corresponding to the position according to the first correction weight and the preset mask characteristic.

In this optional embodiment, the preset weight in the model is adjusted every time text is input, so that the robustness of the end-to-end speech synthesis model is gradually improved in the training process.

In an optional embodiment, when determining the first correction weight according to the first probability and the first preset weight, determining a product of the first probability and the first preset weight as the first correction weight.

In an optional embodiment, according to the second preset rule, the manner of performing the second soft occlusion on the first output may be as follows:

shielding the first output according to a second preset weight in the end-to-end speech synthesis model; predicting a second probability of errors occurring at each position of the phoneme in the first output through the detection network;

determining a second correction weight according to the second probability and a second preset weight, wherein the correction weight is used as the second preset weight when soft occlusion occurs next time;

and for each position in the first output, shielding the phoneme corresponding to the position according to the second correction weight and the preset mask characteristic.

In this optional embodiment, by soft-masking the first output of the input decoder, the robustness of the decoder can be gradually improved during the training process.

In an alternative embodiment, the variable information predictor includes: at least one of a duration predictor, a pitch predictor, and an energy predictor.

Referring to fig. 4, a block diagram of an end-to-end speech synthesis model optimization apparatus according to an embodiment of the present invention is shown.

The end-to-end speech synthesis model optimization device of the embodiment of the invention can comprise the following modules:

a first occlusion module 401, configured to perform a first soft occlusion on phonemes included in a text input into the end-to-end speech synthesis model according to a first preset rule, so as to generate a second text;

a first processing module 402, configured to sequentially encode the second text by using a phoneme encoder, and perform prediction processing on the encoded second text by using a variable information predictor to obtain a first output;

a second shielding module 403, configured to perform a second soft shielding on the first output according to a second preset rule;

a second processing module 404, configured to input the first output subjected to the second soft occlusion processing into a preset decoder for decoding to obtain a mel spectrum.

Optionally, the first occlusion model comprises:

Optionally, the third sub-module is specifically configured to:

Optionally, the second occlusion module comprises:

According to the end-to-end speech synthesis model optimization device provided by the embodiment of the invention, according to a first preset rule, first soft occlusion is carried out on phonemes contained in a text input into an end-to-end speech synthesis model, and a second text is generated; sequentially coding the second text by adopting a phoneme coder, and predicting the coded second text by adopting a variable information predictor to obtain a first output; performing second soft shielding on the first output according to a second preset rule; the first output subjected to the second soft occlusion processing is input into a preset decoder to be decoded to obtain a Mel spectrum, soft occlusion is added to the input of the end-to-end speech synthesis model and the input of the decoder respectively in the embodiment of the application, so that data disturbance is increased, and the robustness of the end-to-end speech synthesis model can be improved.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

In an embodiment of the invention, an electronic device is also provided. The electronic device may include one or more processors and one or more machine-readable media having instructions, such as an application program, stored thereon. The instructions, when executed by the one or more processors, cause the processors to perform the query statement generation method described above.

In an embodiment of the present invention, there is also provided a non-transitory computer-readable storage medium having stored thereon a computer program executable by a processor of an electronic device to perform the above-described query statement generation method. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The method and apparatus for optimizing an end-to-end speech synthesis model, the electronic device and the storage medium provided by the present invention are introduced in detail, and a specific example is applied in the present document to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for optimizing an end-to-end speech synthesis model, the method comprising:

according to a first preset rule, performing first soft occlusion on phonemes contained in a text input into the end-to-end speech synthesis model to generate a second text;

2. The method according to claim 1, wherein the step of generating the second text by performing the first soft occlusion on the phonemes included in the text input into the end-to-end speech synthesis model according to the first preset rule comprises:

3. The method of claim 1, wherein the step of determining a first correction weight according to the first probability and the first preset weight comprises:

4. The method of claim 1, wherein the step of performing a second soft occlusion on the first output according to a second predetermined rule comprises:

5. The method of claim 1, wherein the variable information predictor comprises: at least one of a duration predictor, a pitch predictor, and an energy predictor.

6. An apparatus for optimizing an end-to-end speech synthesis model, the apparatus comprising:

7. The apparatus of claim 6, wherein the first occlusion model comprises:

8. The apparatus of claim 7, wherein the third sub-module is specifically configured to:

9. The apparatus of claim 6, wherein the second occlusion module comprises:

10. An electronic device, comprising:

one or more processors; and

one or more machine-readable media having instructions stored thereon;

the instructions, when executed by the one or more processors, cause the processors to perform a method of optimizing an end-to-end speech synthesis model according to any one of claims 1 to 5.