WO2022141671A1

WO2022141671A1 - Speech synthesis method and apparatus, device, and storage medium

Info

Publication number: WO2022141671A1
Application number: PCT/CN2021/071672
Authority: WO
Inventors: 周良; 孟廷; 侯秋侠; 刘丹; 江源; 胡亚军
Original assignee: 科大讯飞股份有限公司
Priority date: 2020-12-30
Filing date: 2021-01-14
Publication date: 2022-07-07
Also published as: CN112802444B; CN112802444A

Abstract

A speech synthesis method and apparatus, a device, and a storage medium. In the method, in the process of performing speech synthesis on original text to be synthesized, an assistant synthesis feature corresponding to matching text which comprises a text segment matching a text segment of the original text is referenced, and the assistant synthesis feature is a feature used for assisting in speech synthesis and determined on the basis of pronunciation audio corresponding to the matching text. By referencing the assistant synthesis feature corresponding to the matching text, speech synthesis of the original text can be assisted in by using pronunciation information in the pronunciation audio corresponding to the matching text, thereby enriching information referenced during speech synthesis of the original text, and improving the quality of speech synthesis of the original text.

Description

Speech synthesis method, device, equipment and storage medium

This application claims the priority of the patent application with the application number 202011607966.3 and the invention title "Speech Synthesis Method, Apparatus, Equipment and Storage Medium", which was submitted to the State Intellectual Property Office of China on December 30, 2020, the entire contents of which are by reference Incorporated in this application.

technical field

The present application relates to the technical field of speech processing, and more particularly, to a speech synthesis method, apparatus, device and storage medium.

Background technique

In recent years, with the development of information and the rise of artificial intelligence, human-computer interaction has become more and more important. Among them, speech synthesis is a hot spot of human-computer interaction research at home and abroad. Speech synthesis is the process of synthesizing the input original text to be synthesized into speech output.

The traditional speech synthesis model is generally based on an end-to-end speech synthesis scheme, that is, the training text and the corresponding speech data or waveform data are directly used to train the speech synthesis model, and the trained speech synthesis model is based on the input original text to be synthesized, The synthesized speech can be output, or the waveform data can be output, and then the corresponding synthesized speech can be obtained based on the waveform data.

The existing speech synthesis solutions only refer to the original text for speech synthesis, which leads to the error-prone speech synthesis and poor synthesis effect.

SUMMARY OF THE INVENTION

In view of the above problems, the present application is proposed to provide a speech synthesis method, apparatus, device and storage medium to improve the quality of synthesized speech. The specific plans are as follows:

In a first aspect of the present application, a speech synthesis method is provided, comprising:

Get the original text to be synthesized;

Obtain the auxiliary synthesis feature corresponding to the matching text, the matching text and the original text have matching text fragments, and the auxiliary synthesis feature is a feature for auxiliary speech synthesis determined based on the pronunciation audio corresponding to the matching text ;

With reference to the auxiliary synthesis feature, speech synthesis is performed on the original text to obtain synthesized speech.

Preferably, the method according to claim 1, wherein the obtaining auxiliary synthesis features corresponding to the matching text comprises:

obtaining the matching text of the text fragment matching the existence of the original text;

The auxiliary synthesis feature determined based on the pronunciation audio corresponding to the matched text is acquired.

Preferably, the auxiliary synthesis features include:

The phoneme sequence corresponding to the matching text, determined based on the pronunciation audio corresponding to the matching text;

and / or,

Determined based on the pronunciation audio corresponding to the matched text, the prosodic information corresponding to the matched text;

and / or,

Determined based on the pronunciation audio corresponding to the matched text, the phoneme-level prosodic coding corresponding to the matched text;

and / or,

The acoustic feature of the pronunciation audio corresponding to the matched text.

Preferably, the obtaining the matching text of the text segment that matches the original text exists, including:

Among the preconfigured template texts, matching texts that match text fragments within the original text are determined.

The uploaded text in the uploaded data is acquired as the matching text, the uploaded data further includes pronunciation audio corresponding to the uploaded text, and the uploaded text and the original text have text fragments that match.

Preferably, the preconfigured template text includes:

Template text in each preconfigured resource package, wherein each resource package includes a template text, and an auxiliary synthesis feature corresponding to the template text determined based on the pronunciation audio corresponding to the template text.

Preferably, in the preconfigured template text, determining the matching text that matches the text fragment in the original text includes:

The original text and the template text in each preconfigured resource package are respectively matched and calculated;

In the template text included in the resource package with the highest matching degree, the matching text that matches the text fragment in the original text is determined.

Preferably, the obtaining the auxiliary synthesis feature determined based on the pronunciation audio corresponding to the matching text, includes:

Acquire the auxiliary synthesis feature corresponding to the matching text contained in the resource package with the highest matching degree.

Preferably, the process of determining the preconfigured resource package includes:

Get the preconfigured template text and the corresponding pronunciation audio;

Based on the pronunciation audio, determine the phoneme sequence and prosody information corresponding to the template text;

The phoneme sequence and prosody information are used as auxiliary synthesis features corresponding to the template text, and the auxiliary synthesis features and the template text are organized into a resource package.

Preferably, the process of determining the preconfigured resource package further includes:

Based on the template text and the corresponding pronunciation audio, determine the phoneme-level prosodic coding corresponding to the template text;

The phoneme-level prosodic encoding is incorporated into the resource bundle.

Preferably, the phoneme-level prosodic coding corresponding to the template text is determined based on the template text and the corresponding pronunciation audio, including:

Extract phoneme-level prosodic information based on the template text and the corresponding pronunciation audio;

Inputting the template text and the phoneme-level prosody information into a coding prediction network to obtain the predicted phoneme-level prosody coding;

Inputting the predicted phoneme-level prosody coding and the template text into a generating network to obtain the generated phoneme-level prosody information;

The encoding prediction network and the generation network are trained with the generated phoneme-level prosody information approaching the extracted phoneme-level prosody information, until the end of the training, the predicted encoding prediction network after training is obtained. Phoneme-level prosodic coding.

Preferably, before acquiring the upload text in the upload data, the method further includes:

obtaining and outputting the initial synthesized speech of the original text;

Then the uploaded text is the text segment that is synthesized incorrectly in the initial synthesized speech, and the pronunciation audio corresponding to the uploaded text is the correct pronunciation corresponding to the synthesized incorrect text segment;

Or, the uploaded text is an extended text that includes a text fragment synthesized incorrectly in the initial synthesized speech, and the pronunciation audio corresponding to the uploaded text is the correct pronunciation corresponding to the extended text.

Based on the pronunciation audio corresponding to the matching text in the uploaded data, the auxiliary synthesis feature corresponding to the matching text is determined.

Preferably, performing speech synthesis on the original text with reference to the auxiliary synthesis feature to obtain synthesized speech, including:

Determine the phoneme sequence of the original text based on the phoneme sequence corresponding to the matched text;

and / or,

determining the prosody information of the original text based on the prosody information corresponding to the matched text;

Based on the phoneme sequence and/or prosody information of the original text, speech synthesis is performed on the original text to obtain synthesized speech.

Preferably, performing speech synthesis on the original text with reference to the auxiliary synthesis feature to obtain synthesized speech, further comprising:

Based on the phoneme-level prosody coding corresponding to the matched text, obtain the phoneme-level prosodic coding corresponding to the same text segment in the matched text and the original text;

During the speech synthesis process of the original text, the phoneme-level prosody coding corresponding to the same text segment is used as a supplementary input of the speech synthesis model to obtain synthesized speech.

Preferably, determining the phoneme sequence of the original text based on the phoneme sequence corresponding to the matched text includes:

Based on the phoneme sequence corresponding to the matching text, obtain the phoneme sequence corresponding to the same text segment in the matching text and the original text;

The pronunciation dictionary is queried to determine the phoneme sequences of other text segments in the original text except the same text segment, and synthesize with the phoneme sequences corresponding to the same text segment to obtain the phoneme sequence of the original text.

Process the original text based on the speech synthesis model to obtain context information for predicting the current speech frame;

Based on the context information, the matched text and the acoustic features of the pronunciation audio, determining the target acoustic feature required for predicting the current speech frame;

Based on the context information and the determined target acoustic feature, the current speech frame is predicted, and after all speech frames are predicted, synthesized speech is composed of the predicted speech frames.

Preferably, the target acoustic features required for predicting the current speech frame are determined based on the context information, the matched text and the acoustic features of the pronunciation audio, including:

Based on the context information, the matched text and the acoustic features of the pronunciation audio, obtain the correlation degree of the acoustic features of each frame in the context information and the acoustic features of the pronunciation audio;

Based on the degree of association, target acoustic features required for predicting the current speech frame are determined.

Preferably, in the acquisition of the context information and the acoustic features of the pronunciation audio, the correlation degree of each frame of the acoustic features includes:

Acquiring the first attention weight matrix of the acoustic feature of the pronunciation audio to the matched text, the first attention weight matrix including the attention weight of each frame of acoustic features to each text unit in the matched text;

obtaining a second attention weight matrix of the context information to the matched text, where the second attention weight matrix includes the attention weight of the context information to each text unit in the matched text;

Based on the first attention weight and the second attention weight matrix, a third attention weight matrix of the context information to the acoustic feature is obtained, and the third attention weight matrix includes the context information pair The attention weight of the acoustic features of each frame is used as the correlation degree between the context information and the acoustic features of each frame.

Preferably, determining the target acoustic feature required for predicting the current speech frame based on the correlation degree includes:

Each of the correlation degrees is normalized, and each normalized correlation degree is used as a weight, and the acoustic features of each frame of the pronunciation audio are weighted and added to obtain a target acoustic feature.

Preferably, the prediction of the current speech frame based on the context information and the determined target acoustic feature includes:

Determine the fusion coefficient of the target acoustic feature when predicting the current speech frame based on the current hidden layer vector at the decoding end of the speech synthesis model and the target acoustic feature;

With reference to the fusion coefficient, the target acoustic feature and the context information are fused, and the current speech frame is predicted based on the fusion result.

In a second aspect of the present application, a speech synthesis apparatus is provided, comprising:

an original text obtaining unit, used to obtain the original text to be synthesized;

Auxiliary synthesis feature acquisition unit, used to obtain the auxiliary synthesis feature corresponding to the matching text, the matching text and the original text have matching text fragments, and the auxiliary synthesis feature is determined based on the pronunciation audio corresponding to the matching text The features used to assist speech synthesis;

An auxiliary speech synthesis unit, configured to perform speech synthesis on the original text with reference to the auxiliary synthesis feature to obtain synthesized speech.

In a third aspect of the present application, a speech synthesis device is provided, comprising: a memory and a processor;

the memory for storing programs;

The processor is configured to execute the program to implement each step of the speech synthesis method described above.

In a fourth aspect of the present application, there is provided a storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements each step of the above-mentioned speech synthesis method.

In a fifth aspect of the present application, a computer program product is provided, when the computer program product runs on a terminal device, the terminal device causes the terminal device to execute each step of the above-mentioned speech synthesis method.

By the above technical solution, the speech synthesis method of the present application, in the process of speech synthesis of the original text to be synthesized, refers to the auxiliary synthesis feature corresponding to the matching text of the text fragment that matches the original text, the auxiliary synthesis feature. In order to determine the feature for auxiliary speech synthesis based on the pronunciation audio corresponding to the matching text, it can be seen that the present application can use the pronunciation information in the pronunciation audio corresponding to the matching text to assist the original text by referring to the auxiliary synthesis feature corresponding to the matching text. By performing speech synthesis, the information referenced in the speech synthesis of the original text is enriched, thereby improving the speech synthesis quality of the original text.

It can be understood that the speech synthesis system can be divided into two types with front-end preprocessing and without front-end preprocessing. The solution of the present application can be applied to these two types of speech synthesis systems at the same time. , the auxiliary synthesis feature corresponding to the above matching text can be used as the analysis result of the speech synthesis front end or the analysis result of the auxiliary correction speech synthesis front end, and then the analysis result is sent to the speech synthesis back end to assist in the speech synthesis of the original text. In the processed speech synthesis system, the auxiliary synthesis feature corresponding to the matched text can be directly used as the reference information when the speech synthesis system synthesizes the original text. For the two types of speech synthesis systems, the speech synthesis of the original text is performed with reference to the auxiliary synthesis feature of the present application, which can enrich the reference information during speech synthesis, thereby improving the quality of the synthesized speech.

Description of drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are for purposes of illustrating preferred embodiments only and are not to be considered limiting of the application. Also, the same components are denoted by the same reference numerals throughout the drawings. In the attached image:

1 is a schematic flowchart of a speech synthesis method provided by an embodiment of the present application;

Figure 2 illustrates a schematic diagram of a phoneme sequence extraction model architecture;

FIG. 3 illustrates a schematic diagram of a synthesis flow of a speech synthesis back-end;

Fig. 4 illustrates a kind of speech synthesis system architecture schematic diagram;

Figure 5 illustrates a process schematic diagram of a prediction-generating network determining phoneme-level prosodic coding;

FIG. 6 is a schematic structural diagram of a speech synthesis apparatus provided by an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a speech synthesis device provided by an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

The present application provides a speech synthesis solution, which can be applied to various speech synthesis tasks. The speech synthesis solution of the present application can be applied to speech synthesis work in a human-computer interaction scenario, as well as various other scenarios that require speech synthesis.

The solution of the present application can be implemented based on a terminal with data processing capability, and the terminal can be a mobile phone, a computer, a server, a cloud, or the like.

Next, as described in conjunction with FIG. 1 , the speech synthesis method of the present application may include the following steps:

Step S100, acquiring the original text to be synthesized.

Specifically, the original text is the text to be synthesized into speech. The original text may be provided by the user, or may be the text provided by other devices or applications that needs to be synthesized by speech.

Step S110 , acquiring auxiliary synthesis features corresponding to the matched text, where the matched text and the original text have text fragments that match.

Wherein, the matching text can be text that matches the original text or a text fragment within the original text, for example, the original text is "this pair of pants is not discounted", and the matching text may be "this pair of pants is not discounted" or "discounted" . In addition to this, the matching text may also be text that contains text fragments that match the text fragments within the original text. Still taking the above-mentioned original text as an example, the matching text may be "do you have a discount on this dress?", that is, the matching text includes a text fragment "discount" that matches the original text.

The matching text may be the text that is pre-configured and stored in the present application. For example, in scenarios such as customer service and interaction, the fixed discourse text may be recorded in advance, and the discourse text may be stored. Then, the utterance text matching the original text is searched in the stored utterance text as the matching text. Taking customer service and interaction scenarios as an example, there are some fixed vocabulary texts, such as the prompt content text that the intelligent customer service or terminal needs to prompt the user for information, such as "May I ask what you need to inquire about", "Hello, may I ask what Can I help you?" "Press 1 for phone bills, 2 for traffic" and so on. Correspondingly, these fixed discourse texts may be pre-recorded and stored together with the discourse texts as prompt sounds.

In addition to this, the matching text can also be user-uploaded text. For example, when uploading the original text to be synthesized, the user uploads the text that is prone to synthesis errors in the original text as the matching text, and can also upload the pronunciation audio corresponding to the matching text. For another example, after the user uploads the original text to be synthesized, the synthesis system outputs the synthesized initial synthesized speech. The user can determine the incorrectly synthesized text in the initial synthesized speech, then record the pronunciation audio corresponding to the synthesized incorrectly text, and upload the synthesized incorrectly text and the corresponding pronunciation audio to the speech synthesis system. Or, the user uploads the extended text containing the text with the synthesis error, and the pronunciation audio corresponding to the extended text.

The auxiliary synthesis feature corresponding to the matching text may be a feature used for auxiliary speech synthesis determined based on the pronunciation audio corresponding to the matching text. The auxiliary synthesis feature includes the pronunciation information of the pronunciation audio corresponding to the matching text, such as pronunciation information such as the phoneme sequence of pronunciation, pause information, stress, rhythm, emotion and other pronunciation information. The pronunciation information can assist the speech synthesis of the original text and improve the quality of the original text. Speech synthesis quality.

In the pronunciation audio corresponding to the matched text, the pronunciation of the text segment whose matched text matches the original text is the standard pronunciation of the text segment in the original text. For example, the original text is "These pants are not discounted". If the matching text is "discount", the pronunciation audio corresponding to the matching text is the audio corresponding to "da zhe", not other pronunciation audios such as "da she". Based on this, auxiliary synthesis features can be determined based on the corresponding pronunciation audio of the matched text to assist speech synthesis of the original text.

It can be understood that, if the pronunciation audio corresponding to the matched text can be obtained before the speech synthesis is performed on the original text, the auxiliary synthesis feature can be determined in advance based on the pronunciation audio corresponding to the matched text and stored in a local or third-party device. In this step, the process of obtaining the auxiliary synthesis feature corresponding to the matching text may be to search for the auxiliary synthesis feature corresponding to the pre-stored matching text in the local or third-party storage.

In addition, if the pronunciation audio corresponding to the matching text is temporarily obtained from the original text-to-speech synthesis process, the process of obtaining the auxiliary synthesis feature corresponding to the matching text in this step may be, after obtaining the pronunciation audio corresponding to the matching text , and the auxiliary synthesis feature is determined based on the pronunciation audio.

Step S120, referring to the auxiliary synthesis feature, perform speech synthesis on the original text to obtain synthesized speech.

Specifically, when the speech synthesis system performs speech synthesis on the original text in this step, in addition to referring to the original text, it may further refer to the auxiliary synthesis feature corresponding to the matched text, that is, to enrich the information referenced in the speech synthesis process of the original text. At the same time, since the auxiliary synthesis feature includes the pronunciation information of the pronunciation audio corresponding to the matching text, the pronunciation information can assist the speech synthesis of the original text and improve the speech synthesis quality of the original text.

In the speech synthesis method provided by the embodiment of the present application, in the process of speech synthesis of the original text to be synthesized, reference is made to the auxiliary synthesis feature corresponding to the matched text of the text segment that matches the original text, and the auxiliary synthesis feature is based on matching The feature used for auxiliary speech synthesis determined by the pronunciation audio corresponding to the text, it can be seen that the present application can utilize the pronunciation information in the pronunciation audio corresponding to the matching text to assist in the speech synthesis of the original text by referring to the auxiliary synthesis feature corresponding to the matching text. , which enriches the information referenced in the speech synthesis of the original text, thereby improving the speech synthesis quality of the original text.

It can be understood that the speech synthesis system can be divided into two types with front-end preprocessing and without front-end preprocessing. The solution of the present application can be applied to these two types of speech synthesis systems at the same time. , the auxiliary synthesis feature corresponding to the above matching text can be used as the analysis result of the speech synthesis front end or the analysis result of the auxiliary correction speech synthesis front end, and then the analysis result is sent to the speech synthesis back end to assist in the speech synthesis of the original text. In the processed speech synthesis system, the auxiliary synthesis feature corresponding to the matched text can be directly used as the reference information when the speech synthesis system synthesizes the original text. For the two types of speech synthesis systems, the speech synthesis of the original text is carried out with reference to the auxiliary synthesis feature of the present application, which can enrich the reference information during speech synthesis, thereby improving the quality of the synthesized speech.

In some embodiments of the present application, the auxiliary synthesis feature corresponding to the matching text mentioned above and the process of performing speech synthesis on the original text with reference to the auxiliary synthesis feature are described.

The auxiliary synthesis feature is a feature used for auxiliary speech synthesis determined based on the pronunciation audio corresponding to the matching text. The auxiliary synthesis feature includes the pronunciation information of the pronunciation audio corresponding to the matching text, and the pronunciation information can assist the speech synthesis of the original text. The speech synthesis quality of the original text.

Several optional forms of auxiliary synthesis features are provided in this embodiment, which are respectively introduced in the following embodiments:

1), the auxiliary synthesis feature is the phoneme sequence corresponding to the matching text.

Specifically, the speech synthesis system can be divided into two types with front-end preprocessing and without front-end preprocessing. Among them, the speech synthesis system with front-end preprocessing first performs front-end analysis on the original text before performing speech synthesis on the original text. sequence for speech synthesis.

This processing method can improve the quality of speech synthesis to a certain extent, but when there is an error in the pre-built pronunciation dictionary, it will lead to errors in the back-end synthesized speech.

To this end, in this embodiment, based on the pronunciation audio corresponding to the matched text, a phoneme sequence corresponding to the matched text may be determined as an auxiliary synthesis feature.

It can be understood that the pronunciation audio corresponding to the matched text is the correct pronunciation, so the correct phoneme sequence corresponding to the matched text can be extracted from the pronunciation audio. The correct phoneme sequence can be used as an auxiliary synthesis feature to participate in the speech synthesis process of the original text.

This embodiment provides an implementation manner of extracting the phoneme sequence from the pronunciation audio corresponding to the matched text.

As shown in Figure 2, it exemplifies a schematic diagram of the architecture of a phoneme sequence extraction model.

This application can pre-train a phoneme sequence extraction model for extracting phoneme sequences from pronunciation audio.

The phoneme sequence extraction model can adopt the LSTM (long short term memory, long short term memory network) network architecture or other optional network architectures such as HMM and CNN. As shown in Figure 2, it exemplifies a phoneme sequence extraction model with an encoder-attention-decoder architecture.

The encoding end uses the LSTM network to encode the audio feature sequence (x ₁ ,x ₂ ,...,x _n ) of the pronunciation audio to obtain the hidden layer encoding sequence (h ₁ ,h ₂ ,...,h _n ), and the decoding end Also using the LSTM network, at the decoding time t, the hidden layer state h _t- 1 at the time t-1 is input and the context vector c _t-1 calculated by the attention module is jointly calculated to obtain the hidden layer vector s _t of the decoding end, and then pass The projection yields the phoneme y t at time _t . The decoding stops when the special symbol end marker is decoded, and the phoneme sequence (y ₁ , y ₂ ,...,y _t ) is obtained.

Exemplary instructions such as:

When the matching text is "this dress is not discounted", the phoneme sequence extracted from the pronunciation audio corresponding to the matching text is: [zh e4 jian4 i1 f u7 b u4 d a3 zh e2].

When the auxiliary synthesis feature is a phoneme sequence, in the above step S120, referring to the auxiliary synthesis feature, the process of performing speech synthesis on the original text may include:

S1. Determine the phoneme sequence of the original text based on the phoneme sequence corresponding to the matched text.

Specifically, the phoneme sequence corresponding to the matched text and the same text segment in the original text may be acquired based on the phoneme sequence corresponding to the matched text.

For example, it is determined that the matched text is the same text segment as the original text, and then the phoneme sequence corresponding to the same text segment is extracted from the phoneme sequence corresponding to the matched text.

Further, the pronunciation dictionary is queried to determine the phoneme sequences of other text segments in the original text except the same text segment, and combine with the phoneme sequences corresponding to the same text segment to obtain the phoneme sequence of the original text.

Of course, the initial phoneme sequence corresponding to the original text can also be determined by querying the pronunciation dictionary, and the phoneme sequence corresponding to the same text segment extracted from the phoneme sequence corresponding to the matching text is used to replace the initial phoneme sequence in the initial phoneme sequence. From the phoneme sequence corresponding to the same text segment, the replaced phoneme sequence corresponding to the original text is obtained.

S2. Based on the phoneme sequence of the original text, perform speech synthesis on the original text to obtain synthesized speech.

Specifically, the phoneme sequence of the original text can be used as the text analysis result of the speech synthesis front end, and sent to the speech synthesis back end to assist in the speech synthesis of the original text.

Since the phoneme sequence of the original text obtained in this embodiment includes the phoneme sequence corresponding to the matching text, the part of the phoneme sequence is determined based on the correct pronunciation audio corresponding to the matching text, so that the phoneme sequence of the original text is used to assist in speech synthesis It can improve the accuracy of the synthesized speech, especially for some polyphonic words and easy typos, the accuracy of the synthesized speech is greatly improved.

2), the auxiliary synthesis feature is the prosodic information corresponding to the matching text.

Combined with the previous introduction, the front-end speech synthesis can perform text analysis on the original text. The process of text analysis can also predict the prosody information of the original text, and then the synthesis back-end performs speech synthesis based on the original text and prosody information. By taking into account prosodic information, the naturalness of the synthesized speech can be improved.

It is understandable that the prosody information predicted for the original text may also be wrong, which in turn leads to errors in the prosody of the back-end synthesized speech and affects the quality of the synthesized speech.

To this end, in this embodiment, prosodic information corresponding to the matched text may be determined based on the pronunciation audio corresponding to the matched text, as an auxiliary synthesis feature. Here, the prosody information corresponding to the matched text may be phoneme-level prosody information, which includes the prosody information of each phoneme unit in the phoneme sequence corresponding to the matched text.

It can be understood that the pronunciation audio corresponding to the matched text is the correct pronunciation, so the correct prosody information corresponding to the matched text can be extracted from the pronunciation audio. The correct prosody information can be used as an auxiliary synthesis feature to participate in the speech synthesis process of the original text. For example, the corrected prosody information of the original text is determined based on the correct prosody information, and then sent to the synthesis back-end for speech synthesis.

When the auxiliary synthesis feature is prosody information, in the above step S120, referring to the auxiliary synthesis feature, the process of performing speech synthesis on the original text may include:

S1. Determine the prosody information of the original text based on the prosody information corresponding to the matched text.

Specifically, the prosody information corresponding to the matched text and the same text segment in the original text may be acquired based on the prosody information corresponding to the matched text.

Further, prosody prediction technology can be used to predict the prosody information of the remaining text segments in the original text except the same text segment, and combine with the prosody information corresponding to the same text segment to obtain the prosody information of the original text.

S2. Based on the prosody information of the original text, perform speech synthesis on the original text to obtain synthesized speech.

In another case, when the auxiliary synthesis feature includes both phoneme sequence and prosody information, in the above step S120, referring to the auxiliary synthesis feature, the process of performing speech synthesis on the original text may include:

S1. Determine the phoneme sequence and prosody information of the original text based on the phoneme sequence and prosody information corresponding to the matched text.

S2. Based on the phoneme sequence and prosody information of the original text, speech synthesis is performed on the original text to obtain synthesized speech.

3) The auxiliary synthesis feature is the phoneme-level prosodic coding corresponding to the matching text.

Specifically, the phoneme-level prosodic coding corresponding to the matched text includes some pronunciation information of the pronunciation audio corresponding to the matched text, such as prosodic features such as pronunciation duration, stress and emphasis.

When the speech synthesis backend performs speech synthesis, it can model the prosody information of the original text, thereby improving the naturalness of the synthesized speech. In this embodiment, in order to improve the modeling accuracy of the original text prosody information by the speech synthesis backend, the phoneme-level prosody coding corresponding to the matched text can be used as an auxiliary synthesis feature, and sent to the speech synthesis backend to assist in speech synthesis.

It is understandable that the phoneme-level prosody coding corresponding to the matching text contains the correct pronunciation information corresponding to the matching text. When the speech synthesis backend performs speech synthesis based on the phoneme-level prosody coding corresponding to the matching text, the original text and the matching text contain the correct pronunciation information. The same text fragment can synthesize the voice that is consistent with the pronunciation audio of the matching text.

At the same time, the back-end of speech synthesis performs operations such as convolution on the original text. For the rest of the text fragments in the original text except the same text fragment, it will also refer to the phoneme-level prosodic coding corresponding to the same text fragment in the processing process, so as to use Phoneme-level prosodic encoding of the same text segment assists in improving the speech synthesis quality of the remaining text segments in the original text.

In addition, in some prior art, speech synthesis is only performed on non-identical text segments in the original text, and then the synthesized speech of the non-identical text segments is spliced with the pre-configured speech of the same text segment to obtain a whole corresponding to the original text. Synthesized speech. This processing method will lead to the problem of inconsistent timbre of the overall synthesized speech of the original text, and reduce the quality of the synthesized speech.

However, the speech synthesis system of the present application is still a complete synthesis system. By performing overall speech synthesis on the original text, it can be ensured that the timbre of the synthesized speech is consistent.

Further, based on different forms of modeling prosody information in the speech synthesis backend, the phoneme-level prosody coding in this embodiment may also be different.

FIG. 3 illustrates a schematic diagram of a synthesis flow of a speech synthesis back-end.

It can be seen from Figure 3 that the back-end of speech synthesis includes a duration model and an acoustic model, and the duration prosody information and the acoustic parameter prosody information are modeled respectively by the duration model and the acoustic model.

Then, in order to adapt to the model structure of the speech synthesis back-end shown in FIG. 3 , the phoneme-level prosodic coding corresponding to the matched text in this embodiment of the present application may include duration coding and acoustic parameter coding.

When the prosody code corresponding to the matching text is sent to the back-end of speech synthesis to assist in speech synthesis, the duration encoding can be sent to the duration model to assist in phoneme-level duration modeling, and the acoustic parameter encoding can be sent to the acoustic model to assist in phoneme-level acoustic parameters. modeling.

The acoustic parameter encoding may include one or more different acoustic parameter encodings, such as fundamental frequency acoustic parameter encoding or other acoustic parameter encoding, and the like.

On the basis that the auxiliary synthesis feature in the foregoing example includes phoneme sequence and prosody information, further when the auxiliary synthesis feature further includes phoneme-level prosodic coding, the above step S120, referring to the auxiliary synthesis feature, performs speech synthesis on the original text The process may further include:

S3. Based on the phoneme-level prosody coding corresponding to the matching text, obtain the phoneme-level prosodic coding corresponding to the same text segment in the matching text and the original text.

Specifically, it may be determined that the matched text is the same text segment as the original text, and then the phoneme-level prosodic coding corresponding to the same text segment is extracted from the phoneme-level prosodic coding corresponding to the matching text.

S4. During the speech synthesis process of the original text, the phoneme-level prosody coding corresponding to the same text segment is used as a supplementary input of the speech synthesis model to obtain synthesized speech.

Still take Figure 3 as an example for illustration:

Phoneme-level prosody coding includes duration coding and acoustic parameter coding.

Then, in the process of synthesizing the original text, the back-end speech synthesis can send the duration encoding corresponding to the same text segment into the duration model for phoneme-level duration modeling, and encode the acoustic parameters corresponding to the same text segment. The acoustic model is sent into the acoustic model for phoneme-level acoustic parameter modeling, and the synthesized speech is finally obtained by the speech synthesis back-end.

4) The auxiliary synthesis feature is the acoustic feature of the pronunciation audio corresponding to the matching text.

As mentioned above, speech synthesis systems can be divided into two types: with front-end preprocessing and without front-end preprocessing. Among them, the speech synthesis system without front-end preprocessing does not perform front-end analysis on the original text, but directly performs speech synthesis on the original text. In order to improve the quality control of the synthesized speech of the original text, in this embodiment, the acoustic feature of the pronunciation audio corresponding to the matching text may be used as an auxiliary synthesis feature, and sent to the speech synthesis system to assist in the speech synthesis of the original text. The acoustic feature contains the pronunciation information of the pronunciation audio that matches the text. When the speech synthesis system performs speech synthesis on the original text frame by frame, the acoustic feature associated with each frame can be extracted from the acoustic feature to assist in synthesizing each frame. Frames can be used to correct pronunciation errors, such as correcting the pronunciation errors of rare words, special symbols, polyphonic words and foreign words that are prone to errors, and finally obtain a high-quality synthesized speech.

Wherein, the acoustic features include but are not limited to cepstral features of pronunciation audio.

When the auxiliary synthesis feature is the acoustic feature of the pronunciation audio corresponding to the matching text, in the above step S120, referring to the auxiliary synthesis feature, the process of performing speech synthesis on the original text may include:

S1. Process the original text based on a speech synthesis model to obtain context information for predicting the current speech frame.

Specifically, the speech synthesis model can adopt the encoder-decoder architecture of encoder-decoder, and further connect the encoding and decoding layers through an attention module. Then, the original text can obtain the context information C _t required when synthesizing the current speech frame y _t through the encoder-decoder encoding and decoding architecture and the attention module. The context information C _t indicates the text information in the original text needed to synthesize the current speech frame y _t .

S2. Based on the context information, the matched text and the acoustic features of the pronunciation audio, determine the target acoustic feature required for predicting the current speech frame.

In an optional implementation manner, step S2 may include:

S21. Based on the context information, the matched text, and the acoustic features of the pronunciation audio, obtain the correlation degree of the acoustic features of each frame in the context information and the acoustic features of the pronunciation audio.

Specifically, the context information can be obtained through the attention mechanism to obtain the similarity with the matching text, and the attention matrix of the acoustic features of the pronunciation audio to the matching text can obtain the correlation between the acoustic features of each frame and the matching text. On this basis , based on the similarity between the context information and the matching text, and the correlation between the acoustic features of each frame and the matching text, the correlation between the context information and the acoustic features of each frame can be obtained. The correlation indicates the context information and each frame. The proximity of acoustic features. It can be understood that, when the context information is highly correlated with the acoustic feature of a target frame, it indicates that the pronunciation of the text corresponding to the context information is strongly correlated with the acoustic feature of the target frame.

Next, an optional implementation manner of step S21 is introduced, which may include the following steps:

S211. Acquire a first attention weight matrix W _mx of the acoustic feature of the pronunciation audio to the matched text.

Wherein, the first attention weight matrix W _mx includes the attention weight of each frame of acoustic feature to each text unit in the matched text. The size of the matrix W _mx is T _my *T _mx , where T _my represents the frame length of the acoustic feature corresponding to the pronunciation audio, and T _mx represents the length of the matched text.

S212: Acquire a second attention weight matrix W _cmx of the context information C _t to the matched text.

Wherein, the second attention weight matrix W _cmx includes the attention weight of the context information C _t to each text unit in the matched text. The size of the matrix W _cmx is 1*T _mx .

S213. Based on the first attention weight W _mx and the second attention weight matrix W _cmx , obtain a third attention weight matrix W _cmy of the context information C _t on the acoustic feature.

Wherein, the third attention weight matrix W _cmy includes the attention weight of the context information C _t on the acoustic features of each frame, as the degree of correlation between the context information and the acoustic features of each frame. The size of the matrix W _cmy is 1*T _my . The matrix W _cmy can be expressed as:

W _cmy =W _cmx *W _mx ′

Among them, W _mx ' represents the transpose of matrix W _mx .

S22. Based on the correlation degree, determine the target acoustic feature required for predicting the current speech frame.

Specifically, in the above steps to obtain the context information and the acoustic features of the pronunciation audio, after the correlation degree of each frame of the acoustic features, each correlation degree may be normalized first, and the normalized correlation degree As the weight, the acoustic features of each frame of the pronunciation audio are weighted and added to obtain the target acoustic feature required for predicting the current speech frame. The target acoustic feature can be denoted as C _mt .

S3. Predict the current speech frame based on the context information and the determined target acoustic feature, and after all the speech frames are predicted, form a synthesized speech from the predicted speech frames.

It can be understood that, since there are matching text segments between the original text and the matching text, the original text may not necessarily be exactly the same as the matching text, which leads to the target acoustic feature C _mt required for predicting the current speech frame obtained in the above steps, except The target acoustic feature C _mt is not required to be used in the process of synthesizing the remaining text segments except that it can be used for speech synthesis of the same text segment as the matched text in the original text. To this end, this embodiment provides a solution, so that when performing speech synthesis on the original text, for different speech frames to be predicted, the information amount of the referenced target acoustic feature C _mt can be controlled. The specific implementation process can include:

S31. Based on the current hidden layer vector at the decoding end of the speech synthesis model and the target acoustic feature C _mt , determine the fusion coefficient a _gate of the target acoustic feature C _mt when predicting the current speech frame.

Specifically, in this embodiment, a threshold mechanism or other strategies may be used to determine the fusion coefficient a _gate of the target acoustic feature C _mt when predicting the current speech frame. Taking the threshold mechanism as an example, a _gate can be expressed as:

a _gate =sigmoid(g _g (C _mt ,s _t ))

Among them, s _t represents the current hidden layer vector at the decoding end, and g _g () represents the set function relationship.

S32 , referring to the fusion coefficient a _gate , fuse the target acoustic feature C _mt and the context information C _t , and predict the current speech frame based on the fusion result.

Specifically, the current speech frame y _t can be expressed as:

y _t =g(y _t-1 ,s _t ,(1-a _gate )*C _t +a _gate *C _mt )

Among them, g() represents the set functional relationship.

Referring to FIG. 4, it illustrates a schematic diagram of a speech synthesis system architecture.

The speech synthesis system illustrated in Figure 4 adopts an end-to-end synthesis process of codec and attention mechanism.

The original text is encoded by the encoding end to obtain the encoding vector of the original text, and the context information C _t required for predicting the current speech frame y _t can be obtained through the first attention module.

The matching text is encoded by the encoding end to obtain the encoding vector of the matching text. Further, the context information C _t can obtain the attention weight of each text unit in the matched text through the second attention module to form a second attention weight matrix.

At the same time, in this embodiment, the attention weight of the acoustic feature of the pronunciation audio of the matched text to the matched text can be obtained to form a first attention weight matrix. Further, based on the first attention weight matrix and the second attention weight matrix, a third attention weight matrix of the context information C _t to the acoustic feature is obtained. The third attention weight matrix includes the correlation degree between the context information C _t and the acoustic features of each frame. The target acoustic feature C _mt required for predicting the current speech frame y _t is obtained by performing sofmax regularization on the third attention weight matrix, and performing weighted addition with the acoustic features of each frame of the pronunciation audio.

The decoder can predict the current speech frame y _t based on the target acoustic feature C _mt and the context information C _t .

The expression for predicting the current speech frame y _t by the decoding end may refer to the foregoing related introduction.

Each predicted speech frame is mapped to synthesized speech by a vocoder.

In some embodiments of the present application, the foregoing step S110, the process of acquiring the auxiliary synthesis feature corresponding to the matching text is introduced. Specifically, the process may include:

S1. Obtain the matching text of the text segment that matches the existence of the original text.

Two different implementations are provided in this embodiment, which are respectively introduced as follows:

1) Under an optional implementation, the application can collect and record a large amount of fixed phonetic texts in advance in the speech synthesis scene, take the collected phonetic texts as template texts, and store the template texts and corresponding pronunciation audios simultaneously. Alternatively, the auxiliary synthesis feature is determined based on the pronunciation audio of the template text, and then the template text and the auxiliary synthesis feature are stored together.

Based on this, the implementation process of step S1 may include:

S11. From the template text stored in the pre-configuration, determine the matching text that matches the text segment in the original text.

Optionally, in this embodiment, the collected template text and the corresponding pronunciation audio may be organized and packaged into a resource package. Specifically, each resource package includes a template text and an auxiliary synthesis feature corresponding to the template text determined based on the pronunciation audio corresponding to the template text.

The auxiliary synthesis features may include phoneme sequences and prosodic information corresponding to the template text. Further, the auxiliary synthesis feature may also include phoneme-level prosodic coding corresponding to the template text.

Examples such as:

The template text is "Welcome to AI Voice Assistant".

Based on the pronunciation audio corresponding to the template text, the auxiliary synthesis features that can be determined may include the phoneme sequence of the template text, prosodic information, phoneme-level prosodic coding, and the like. Furthermore, the template text and auxiliary synthesis features can be packaged into a resource package.

Taking the prosody information of template text as an example, its exemplary format can be as follows:

"Huan [=huan1] welcome [=ying2] [w1] make [=shi3] use [=yong4] [w3] people [=ren2] work [=gong1] wisdom [=zhi4] can [=neng2] [w1] Language [=yu3] Sound [=yin1] [w1] Assistant [=zhu4] Hand [=shou3]”.

The pronunciation of each word is specified by [=pinyin], and "[w1]" and "[w3]" represent different prosodic pause information.

It can be understood that the above is only a representation of prosody information in the examples of the present application, and those skilled in the art may also use other different markup formats to represent the prosody information of the template text.

For the packaged resource package, it can be encoded into a binary resource text, so as to reduce the occupation of storage space and facilitate the processing and identification of the subsequent speech synthesis system.

With reference to FIG. 5 , the process of determining the phoneme-level prosodic coding corresponding to the template text is introduced.

As shown in Figure 5, the phoneme-level prosodic coding corresponding to the template text can be determined based on the coding prediction network and the generation network, which can specifically include the following steps:

A1. Extract phoneme-level prosodic information based on the template text and the corresponding pronunciation audio.

A2. Input the template text and the phoneme-level prosody information into a coding prediction network to obtain a predicted phoneme-level prosodic code.

A3. Input the predicted phoneme-level prosody coding and the template text into a generation network to obtain the generated phoneme-level prosodic information.

A4. Train the encoding prediction network and the generation network with the generated phoneme-level prosody information approaching the extracted phoneme-level prosody information, until the end of the training, obtain the encoded prediction network after training. The phoneme-level prosodic encoding corresponding to the predicted template text.

Wherein, the process of training the coding prediction network and the generation network with the generated phoneme-level prosody information approaching the extracted phoneme-level prosody information as the goal may specifically be calculating the generated phoneme-level prosody information and the extracted mean square error MSE of the prosody information at the phoneme level, and adjusting network parameters through iterative training, so that the MSE reaches a preset threshold, then the training can be ended.

Further, based on the above-mentioned pre-configured resource package, the above-mentioned step S11, in the template text that is pre-configured and stored, determines the implementation process of the matching text that matches the text fragment in the original text, which may include:

S111. Perform matching calculation between the original text and the template text in each preconfigured resource package.

S112. From the template text included in the resource package with the highest matching degree, determine the matching text that matches the text segment in the original text.

Specifically, in the above matching calculation process, it may be determined whether there is template text that completely matches the original text, and if there is, the completely matching template text is determined as the matching text. If it does not exist, partial matching can be performed, for example, starting from one or both ends of the original text, and looking for the matching text of the maximum length in the template text of each resource bundle as the matching text.

For example, the original text is "Are you Wang Ning?", when it is matched with the template text in the resource package, the exact same template text is not matched, but the template text is matched "Is your name Liu Wu? ”, the original text is matched with the above template text at the maximum length, and the matching text can be obtained: “Are you” and “Are you?”.

2) In another optional implementation manner, the present application can obtain the data uploaded by the user. The uploaded data includes the uploaded text and the pronunciation audio corresponding to the uploaded text. The uploaded text has a matching text fragment with the original text. In turn, the uploaded text can be used as matching text.

In an optional scenario, after obtaining the original text to be synthesized in the above step S100, initial speech synthesis may be performed, and the initial synthesized speech of the original text may be output. The process of initial speech synthesis on the original text can use various existing or possible future speech synthesis schemes. After receiving the initial synthesized speech, the user can determine the incorrectly synthesized text segment in the initial synthesized speech, and determine the correct pronunciation corresponding to the synthesized incorrectly synthesized text segment, and then can use the synthesized incorrectly synthesized text segment as the uploaded text, and the synthesized incorrectly The correct pronunciation corresponding to the text segment of the uploaded text is used as the pronunciation audio corresponding to the uploaded text, and is uploaded as the upload data. Alternatively, the user can obtain the extended text containing the incorrectly synthesized text segment in the initial synthesized speech, and obtain the correct pronunciation corresponding to the extended text, take the extended text as the uploaded text, and the correct pronunciation corresponding to the extended text as the corresponding uploaded text. The pronunciation audio of , and upload it together as upload data.

S2. Acquire the auxiliary synthesis feature determined based on the pronunciation audio corresponding to the matched text.

Specifically, referring to the relevant introduction above, if the pronunciation audio corresponding to the matching text can be obtained before the original text is synthesized by speech, the auxiliary synthesis feature can be determined in advance based on the pronunciation audio corresponding to the matching text and stored in the local or third-party device. . In this step, the process of obtaining the auxiliary synthesis feature corresponding to the matching text may be to search for the auxiliary synthesis feature corresponding to the pre-stored matching text in the local or third-party storage.

In addition, if the pronunciation audio corresponding to the matching text is temporarily obtained from the speech synthesis process of the original text, the process of obtaining the auxiliary synthesis feature corresponding to the matching text in this step may be, after obtaining the pronunciation audio corresponding to the matching text, based on the Pronunciation audio determines auxiliary synthesis features.

It should be noted that, if the method of obtaining the matching text in the above step S1 is realized by the first method 1), that is, the original text and the template text in each pre-configured resource package are respectively matched and calculated, and the matching degree is calculated. In the template text contained in the highest resource package, to determine the matching text that matches the text fragment in the original text, the implementation process of the above-mentioned step S2 may specifically include:

S21. Obtain the auxiliary synthesis feature corresponding to the matching text contained in the resource package with the highest matching degree.

It can be understood that the resource package contains auxiliary synthesis features corresponding to the template text, such as phoneme sequences, prosody information, and phoneme-level prosodic coding. The matching text is the same as the template text or belongs to a partial text segment in the template text. Therefore, auxiliary synthesis features corresponding to the matching text can be extracted from the auxiliary synthesis features corresponding to the template text.

Further, if the method of obtaining the matching text in the above step S1 is realized by the second method 2), that is, the uploaded text in the user uploaded data is used as the matching text, then the implementation process of the above step S2 may specifically include:

The speech synthesis apparatus provided by the embodiments of the present application is described below, and the speech synthesis apparatus described below and the speech synthesis method described above can be referred to each other correspondingly.

Referring to FIG. 6, FIG. 6 is a schematic structural diagram of a speech synthesis apparatus disclosed in an embodiment of the present application.

As shown in Figure 6, the apparatus may include:

The original text acquisition unit 11 is used to acquire the original text to be synthesized;

Auxiliary synthesis feature acquisition unit 12 is used to obtain an auxiliary synthesis feature corresponding to a matching text, where the matching text and the original text have matching text fragments, and the auxiliary synthesis feature is based on the pronunciation audio corresponding to the matching text. Determined features for assisting speech synthesis;

The auxiliary speech synthesis unit 13 is configured to perform speech synthesis on the original text with reference to the auxiliary synthesis feature to obtain synthesized speech.

Optionally, the process of obtaining the auxiliary synthesis feature corresponding to the matching text by the above-mentioned auxiliary synthesis feature obtaining unit may include:

Optionally, the above-mentioned auxiliary synthesis features may include:

and / or,

Optionally, the process in which the above-mentioned auxiliary synthesis feature obtaining unit obtains the matching text of the text fragment that matches the original text exists may include:

Optionally, the above preconfigured template text may include:

Template texts in each preconfigured resource package, wherein each resource package includes a template text, and an auxiliary synthesis feature corresponding to the template text determined based on the pronunciation audio corresponding to the template text.

Optionally, the process of determining the matching text that matches the text fragment in the original text in the preconfigured template text by the above-mentioned auxiliary synthesis feature obtaining unit may include:

Optionally, the process in which the above-mentioned auxiliary synthesis feature obtaining unit obtains the auxiliary synthesis feature determined based on the pronunciation audio corresponding to the matching text may include:

Optionally, the apparatus of the present application may further include: a resource package configuration unit for configuring resource packages, and the process may include:

Get the preconfigured template text and the corresponding pronunciation audio;

Optionally, the process of configuring the resource package by the resource package configuration unit may further include:

The phoneme-level prosodic encoding is incorporated into the resource bundle.

Optionally, the above-mentioned resource package configuration unit determines the process of phoneme-level prosodic coding corresponding to the template text based on the template text and the corresponding pronunciation audio, which may include:

In another optional case, the process of obtaining the matching text of the text segment that matches the original text by the above-mentioned auxiliary synthesis feature obtaining unit may include:

The uploaded text in the uploaded data is acquired as the matching text, the uploaded data further includes pronunciation audio corresponding to the uploaded text, and the uploaded text and the original text have text fragments that match. On this basis, the process for the auxiliary synthesis feature acquisition unit to obtain the auxiliary synthesis feature determined based on the pronunciation audio corresponding to the matching text may include:

Optionally, the apparatus of the present application may further include: an initial synthesized speech output unit, configured to output an initial synthesized speech of the original text before acquiring the uploaded text in the uploaded data. On this basis, the uploaded text is the incorrectly synthesized text segment in the initial synthesized speech, and the pronunciation audio corresponding to the uploaded text is the correct pronunciation corresponding to the incorrectly synthesized text segment; or, the uploading The text is the extended text that contains the text fragments synthesized incorrectly in the initial synthesized speech, and the pronunciation audio corresponding to the uploaded text is the correct pronunciation corresponding to the expanded text.

Optionally, when the auxiliary synthesis feature includes the phoneme sequence and/or prosody information corresponding to the matching text, the above-mentioned auxiliary speech synthesis unit refers to the auxiliary synthesis feature, performs speech synthesis on the original text, and obtains the process of synthesizing speech, which can be include:

and / or,

Further optionally, when the auxiliary synthesis feature further includes the phoneme-level prosody coding corresponding to the matching text, the above-mentioned auxiliary speech synthesis unit refers to the auxiliary synthesis feature, performs speech synthesis on the original text, and obtains the process of synthesizing speech, and also Can include:

Optionally, the above-mentioned auxiliary speech synthesis unit determines the process of determining the phoneme sequence of the original text based on the phoneme sequence corresponding to the matched text, which may include:

Optionally, when the auxiliary synthesis feature includes the acoustic feature of the pronunciation audio corresponding to the matching text, the above-mentioned auxiliary speech synthesis unit refers to the auxiliary synthesis feature, performs speech synthesis on the original text, and obtains the process of synthesizing speech, which may include:

Optionally, the above-mentioned auxiliary speech synthesis unit, based on the context information, the matching text and the acoustic characteristics of the pronunciation audio, determines the process of predicting the target acoustic characteristics required for the current speech frame, which may include:

Optionally, the above-mentioned auxiliary speech synthesis unit obtains the process of obtaining the correlation degree of each frame of acoustic features in the context information and the acoustic features of the pronunciation audio, which may include:

Optionally, the above-mentioned auxiliary speech synthesis unit determines the process of predicting the required target acoustic feature of the current speech frame based on the degree of association, and can include:

Optionally, the above-mentioned auxiliary speech synthesis unit, based on the context information and the determined target acoustic feature, predicts the process of the current speech frame, which may include:

The speech synthesis apparatus provided by the embodiments of the present application can be applied to speech synthesis equipment, such as a terminal: a mobile phone, a computer, and the like. Optionally, FIG. 7 shows a block diagram of the hardware structure of the speech synthesis device. Referring to FIG. 7, the hardware structure of the speech synthesis device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication interface. bus4;

In the embodiment of the present application, the number of the processor 1, the communication interface 2, the memory 3, and the communication bus 4 is at least one, and the processor 1, the communication interface 2, and the memory 3 complete the communication with each other through the communication bus 4;

The processor 1 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;

The memory 3 may include high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), etc., such as at least one disk memory;

Wherein, the memory stores a program, and the processor can call the program stored in the memory, and the program is used for:

Get the original text to be synthesized;

Optionally, the refinement function and extension function of the program may refer to the above description.

An embodiment of the present application further provides a storage medium, where the storage medium can store a program suitable for the processor to execute, and the program is used for:

Get the original text to be synthesized;

Further, an embodiment of the present application further provides a computer program product, which, when running on a terminal device, enables the terminal device to execute any one of the above-mentioned speech synthesis methods.

Finally, it should also be noted that in this document, relational terms such as first and second are used only to distinguish one entity or operation from another, and do not necessarily require or imply these entities or that there is any such actual relationship or sequence between operations. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The various embodiments can be combined as required, and the same and similar parts can be referred to each other. .

The above description of the disclosed embodiments enables any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the present application. Therefore, this application is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

A method for speech synthesis, comprising:

Get the original text to be synthesized;

Obtain the auxiliary synthesis feature corresponding to the matching text, the matching text and the original text have matching text fragments, and the auxiliary synthesis feature is a feature for auxiliary speech synthesis determined based on the pronunciation audio corresponding to the matching text ;

With reference to the auxiliary synthesis feature, speech synthesis is performed on the original text to obtain synthesized speech.
The method according to claim 1, wherein the obtaining auxiliary synthesis features corresponding to the matching text comprises:

obtaining the matching text of the text fragment matching the existence of the original text;

The auxiliary synthesis feature determined based on the pronunciation audio corresponding to the matched text is acquired.
The method according to claim 1 or 2, wherein the auxiliary synthesis feature comprises:

The phoneme sequence corresponding to the matching text, determined based on the pronunciation audio corresponding to the matching text;

and / or,

Determined based on the pronunciation audio corresponding to the matched text, the prosodic information corresponding to the matched text;

and / or,

Determined based on the pronunciation audio corresponding to the matched text, the phoneme-level prosodic coding corresponding to the matched text;

and / or,

The acoustic feature of the pronunciation audio corresponding to the matched text.
The method according to claim 2, wherein the obtaining the matching text of the text segment matching the existence of the original text comprises:

Among the preconfigured template texts, matching texts that match text fragments within the original text are determined.
The method according to claim 2, wherein the obtaining the matching text of the text segment matching the existence of the original text comprises:

The uploaded text in the uploaded data is acquired as the matching text, the uploaded data further includes pronunciation audio corresponding to the uploaded text, and the uploaded text and the original text have text fragments that match.
The method of claim 4, wherein the preconfigured template text comprises:

Template texts in each preconfigured resource package, wherein each resource package includes a template text, and an auxiliary synthesis feature corresponding to the template text determined based on the pronunciation audio corresponding to the template text.
The method according to claim 6, wherein, in the preconfigured template text, determining the matching text that matches the text fragment in the original text comprises:

The original text and the template text in each preconfigured resource package are respectively matched and calculated;

In the template text included in the resource package with the highest matching degree, the matching text that matches the text fragment in the original text is determined.
The method according to claim 7, wherein the obtaining the auxiliary synthesis feature determined based on the pronunciation audio corresponding to the matched text comprises:

Acquire the auxiliary synthesis feature corresponding to the matching text contained in the resource package with the highest matching degree.
The method according to any one of claims 6-8, wherein the process of determining the preconfigured resource package includes:

Get the preconfigured template text and the corresponding pronunciation audio;

Based on the pronunciation audio, determine the phoneme sequence and prosody information corresponding to the template text;

The phoneme sequence and prosody information are used as auxiliary synthesis features corresponding to the template text, and the auxiliary synthesis features and the template text are organized into a resource package.
The method according to claim 9, wherein the process of determining the preconfigured resource package further comprises:

Based on the template text and the corresponding pronunciation audio, determine the phoneme-level prosodic coding corresponding to the template text;

The phoneme-level prosodic encoding is incorporated into the resource bundle.
The method according to claim 10, wherein, determining the phoneme-level prosodic coding corresponding to the template text based on the template text and the corresponding pronunciation audio, comprising:

Extract phoneme-level prosodic information based on the template text and the corresponding pronunciation audio;

Inputting the template text and the phoneme-level prosody information into a coding prediction network to obtain the predicted phoneme-level prosody coding;

Inputting the predicted phoneme-level prosody coding and the template text into a generating network to obtain the generated phoneme-level prosody information;

The encoding prediction network and the generation network are trained with the generated phoneme-level prosody information approaching the extracted phoneme-level prosody information, until the end of the training, the predicted encoding prediction network after training is obtained. Phoneme-level prosodic coding.
The method according to claim 5, characterized in that, before acquiring the uploaded text in the uploaded data, the method further comprises:

obtaining and outputting the initial synthesized speech of the original text;

Then the uploaded text is the text segment that is synthesized incorrectly in the initial synthesized speech, and the pronunciation audio corresponding to the uploaded text is the correct pronunciation corresponding to the synthesized incorrect text segment;

Or, the uploaded text is an extended text that includes a text fragment synthesized incorrectly in the initial synthesized speech, and the pronunciation audio corresponding to the uploaded text is the correct pronunciation corresponding to the extended text.
The method according to claim 5 or 12, wherein the obtaining the auxiliary synthesis feature determined based on the pronunciation audio corresponding to the matched text comprises:

Based on the pronunciation audio corresponding to the matching text in the uploaded data, the auxiliary synthesis feature corresponding to the matching text is determined.
The method according to claim 3, wherein, referring to the auxiliary synthesis feature, performing speech synthesis on the original text to obtain synthesized speech, comprising:

Determine the phoneme sequence of the original text based on the phoneme sequence corresponding to the matched text;

and / or,

determining the prosody information of the original text based on the prosody information corresponding to the matched text;

Based on the phoneme sequence and/or prosody information of the original text, speech synthesis is performed on the original text to obtain synthesized speech.
The method according to claim 14, wherein, referring to the auxiliary synthesis feature, performing speech synthesis on the original text to obtain synthesized speech, further comprising:

Based on the phoneme-level prosody coding corresponding to the matched text, obtain the phoneme-level prosodic coding corresponding to the same text segment in the matched text and the original text;

During the speech synthesis process of the original text, the phoneme-level prosody coding corresponding to the same text segment is used as a supplementary input of the speech synthesis model to obtain synthesized speech.
The method according to claim 14, wherein the determining the phoneme sequence of the original text based on the phoneme sequence corresponding to the matched text comprises:

Based on the phoneme sequence corresponding to the matching text, obtain the phoneme sequence corresponding to the same text segment in the matching text and the original text;

The pronunciation dictionary is queried to determine the phoneme sequences of other text segments in the original text except the same text segment, and synthesizing with the phoneme sequences corresponding to the same text segment to obtain the phoneme sequence of the original text.
The method according to claim 3, wherein, referring to the auxiliary synthesis feature, performing speech synthesis on the original text to obtain synthesized speech, comprising:

Process the original text based on the speech synthesis model to obtain context information for predicting the current speech frame;

Based on the context information, the matched text and the acoustic features of the pronunciation audio, determining the target acoustic feature required for predicting the current speech frame;

Based on the context information and the determined target acoustic feature, the current speech frame is predicted, and after all speech frames are predicted, synthesized speech is composed of the predicted speech frames.
The method according to claim 17, wherein the determining the target acoustic feature required for predicting the current speech frame based on the context information, the matching text and the acoustic features of the pronunciation audio comprises:

Based on the context information, the matched text and the acoustic features of the pronunciation audio, obtain the correlation degree of the acoustic features of each frame in the context information and the acoustic features of the pronunciation audio;

Based on the degree of association, target acoustic features required for predicting the current speech frame are determined.
The method according to claim 18, characterized in that in acquiring the context information and the acoustic features of the pronunciation audio, the degree of correlation of each frame of the acoustic features comprises:

Obtain the first attention weight matrix of the acoustic feature of the pronunciation audio to the matched text, and the first attention weight matrix includes the attention weight of each frame of acoustic feature to each text unit in the matched text;

acquiring a second attention weight matrix of the context information to the matched text, where the second attention weight matrix includes the context information to the attention weights of each text unit in the matched text;

Based on the first attention weight and the second attention weight matrix, a third attention weight matrix of the context information to the acoustic feature is obtained, and the third attention weight matrix includes the context information pair The attention weight of the acoustic features of each frame is used as the correlation degree between the context information and the acoustic features of each frame.
The method according to claim 18, wherein determining the target acoustic feature required for predicting the current speech frame based on the correlation degree comprises:

Each of the correlation degrees is normalized, and each normalized correlation degree is used as a weight to perform weighted addition on the acoustic features of each frame of the pronunciation audio to obtain the target acoustic feature.
The method according to any one of claims 17-20, wherein the predicting the current speech frame based on the context information and the determined target acoustic feature comprises:

Determine the fusion coefficient of the target acoustic feature when predicting the current speech frame based on the current hidden layer vector of the decoding end of the speech synthesis model and the target acoustic feature;

With reference to the fusion coefficient, the target acoustic feature and the context information are fused, and the current speech frame is predicted based on the fusion result.
A speech synthesis device, comprising:

an original text obtaining unit, used to obtain the original text to be synthesized;

Auxiliary synthesis feature acquisition unit, used to obtain the auxiliary synthesis feature corresponding to the matching text, the matching text and the original text have matching text fragments, and the auxiliary synthesis feature is determined based on the pronunciation audio corresponding to the matching text The features used to assist speech synthesis;

An auxiliary speech synthesis unit, configured to perform speech synthesis on the original text with reference to the auxiliary synthesis feature to obtain synthesized speech.
A speech synthesis device, comprising: a memory and a processor;

the memory for storing programs;

The processor is configured to execute the program to implement each step of the speech synthesis method according to any one of claims 1 to 21.
A storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, each step of the speech synthesis method according to any one of claims 1 to 21 is implemented.
A computer program product, when running on a terminal device, the computer program product enables the terminal device to execute each step of the speech synthesis method according to any one of claims 1 to 21.