CN112530405A

CN112530405A - End-to-end speech synthesis error correction method, system and device

Info

Publication number: CN112530405A
Application number: CN201910884128.1A
Authority: CN
Inventors: 杜慷; 冯大航; 陈孝良; 常乐
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2019-09-18
Filing date: 2019-09-18
Publication date: 2021-03-19
Anticipated expiration: 2039-09-18
Also published as: CN112530405B

Abstract

The invention discloses a method, a system and a device for correcting errors of end-to-end voice synthesis, wherein the method comprises the following steps: acquiring a target sentence, wherein the target sentence is a target text output by an end-to-end voice synthesis system for performing voice recognition on synthesized voice; judging whether the fluency of the target text meets preset conditions, if so, performing: predicting wrong characters or words after embedding vectorization is carried out on the target text; determining alternative words or words for replacing the wrong words or words based on the wrong words or words; and acquiring the audio frequency of the alternative characters or words, removing the audio frequency of the wrong characters or words in the synthesized voice corresponding to the target text, and inserting the audio frequency of the alternative characters or words at the corresponding position. The method can accurately position the wrong characters or words, effectively solves the problem of wrong characters and multiple characters of end-to-end voice synthesis, and further ensures that the voice interaction process is more accurate and smooth.

Description

End-to-end speech synthesis error correction method, system and device

Technical Field

The present invention relates to the field of speech synthesis technologies, and in particular, to a method, a system, and a device for end-to-end speech synthesis error correction.

Background

The development process of speech synthesis is mainly divided into three stages, namely a waveform splicing method, a parameter synthesis method and an end-to-end speech synthesis method. The waveform splicing and synthesizing method is to search corresponding pronunciation sections in a voice database according to requirements, find out required voice units from the voice database by using proper technical means, and sequentially carry out audio splicing and adjustment on a whole sentence of text so as to obtain target audio.

A parametric synthesis method based on statistical modeling is then developed, which divides the whole process of TTS (Text To Speech, from Text To Speech) into several stages, respectively counts the information of rhythm, duration, etc. of Speech, and then synthesizes the information in an acoustic model. The model has the advantages that only relevant parameters need to be stored, a large number of databases are not needed, and the instantaneity can be ensured; and the problems in the synthesis process, such as fundamental frequency, fundamental frequency fluctuation range, speech speed and even tone, can be artificially improved. The disadvantage of this model is that the overall model is too complex and needs to maintain a plurality of models respectively, besides the generally necessary parameter synthesis model, it is also necessary to add a specific model to synthesize excellent timbre for specific problems such as polyphonic character and retromorphism, which makes the overall model large in scale and difficult to maintain, and the combination of the modules also creates new problems. Once a problem occurs in one model, all models may need to be changed, which is easy to cause error accumulation; meanwhile, sound details are difficult to reproduce, the alignment problem needs artificial correction, and the mechanical feeling is serious.

Aiming at various problems of a parameter method, a TTS synthesis method based on an end-to-end model becomes a new mainstream synthesis method. The method abandons a method of combining a plurality of complex modules in parameter synthesis and directly generates audio from text. The end-to-end mode reduces the feature engineering, only text needs to be input, and other feature models can be implicitly modeled through the end-to-end model. The error transmission and accumulation of a plurality of sub-models are avoided, and various conditions such as languages, speakers, emotional information and the like are added conveniently. Meanwhile, the speech generated by the model has rich details and can greatly restore human voice. The disadvantage of this model is that often miswords and multiwords occur, such as the problem of misreading of polyphones, which is a front-end problem, and the latter is a rear-end model problem that is difficult to locate, both of which occur more or less inevitably. The problem can be accurately positioned by a parameter synthesis method, the problem can be directly modified, an end-to-end model is very difficult to be modified, the problem position is difficult to be positioned due to the fact that the model is a black box, data needs to be prepared again for debugging, the data needs to be trained again, the problems cannot be overcome by the aid of the training again, the cost is high, multiple times of experiments are needed for continuous training, the problem that the current character is wrongly written and wrongly written is difficult to position is solved, the training period is long, and the problem that the current character is wrongly written and wrongly written cannot be. For speech synthesis, there are two basic criteria, one is accuracy and the other is naturalness. For speech synthesized audio, the primary task is to read it accurately, and the secondary is to require a natural flow of its reading. The prior end-to-end TTS model relative parameter synthesis method greatly improves the naturalness, but the problem of wrong words and multiple words directly influences the accuracy problem in speech synthesis. For a voice interactive system, if the accuracy is in a problem, the system is natural and smooth and has no meaning.

Therefore, how to effectively solve the problem of wrong words and multiple words in end-to-end voice synthesis and further make the voice interaction process more accurate and smooth is a problem to be solved urgently.

Disclosure of Invention

In view of this, the invention provides an end-to-end speech synthesis error correction method, which can accurately locate an error word or word, and effectively solve the problem of multiple words and errors in end-to-end speech synthesis, thereby enabling the speech interaction process to be more accurate and smooth.

The invention provides an end-to-end speech synthesis error correction method, which comprises the following steps:

acquiring a target sentence, wherein the target sentence is a target text which is output by an end-to-end voice synthesis system for performing voice recognition on synthesized voice;

judging whether the fluency of the target text meets preset conditions, if so, performing:

predicting wrong characters or words after embedding vectorization is carried out on the target text;

determining an alternative word or word for replacing the wrong word or word based on the wrong word or word;

and acquiring the audio frequency of the alternative characters or words, removing the audio frequency of the wrong characters or words in the synthesized voice corresponding to the target text, and inserting the audio frequency of the alternative characters or words at the corresponding position.

Preferably, the determining whether the fluency of the target text meets a preset condition includes:

performing word segmentation on the target text to obtain a word segmentation set;

inputting the word segmentation set into an n-gram language model to obtain the probability value of each word or word;

multiplying probability values of all characters or words of the target text and taking the logarithm of the probability values to obtain fluency scores of the target text;

determining whether the fluency score is less than a first threshold.

Preferably, the predicting the wrong word or word after the embedding vectorization of the target text includes:

embedding the target text index information into a feature vector by utilizing a fully trained bidirectional GRU model;

predicting an output vector for each word or word after word segmentation of the target text;

and performing secondary classification on the vector predicted and output by each word or word after word segmentation, and outputting information 0 or 1, wherein 0 represents that the word or word corresponding to the position is wrong, and 1 represents that the word or word corresponding to the position is correct.

Preferably, the determining an alternative word or word for replacing the wrong word or word based on the wrong word or word comprises:

acquiring a previous word of the wrong character or word according to the index information of the wrong character or word, and querying a collocation table to obtain a candidate word set, wherein the candidate word set comprises a null value;

solving the weighted scores of the longest public word string and the editing distance of the pinyin of the wrong word or word and the pinyin of each word or word in the alternative word set, and selecting the first N words or words with the weighted scores exceeding a second threshold value for alternative;

substituting N optional characters or words and the wrong characters or words into a language model together, and respectively detecting fluency scores of the new texts;

and determining the alternative character or word corresponding to the new text with the highest fluency score as the alternative character or word for replacing the wrong character or word.

An end-to-end speech synthesis error correction system comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a target sentence, and the target sentence is a target text which is output by an end-to-end voice synthesis system for performing voice recognition on synthesized voice;

the judging module is used for judging whether the fluency of the target text meets a preset condition or not;

the prediction module is used for predicting wrong characters or words after embedding vectorization is carried out on the target text when the fluency of the target text meets a preset condition;

a determining module, configured to determine, based on the wrong word or phrase, an alternative word or phrase for replacing the wrong word or phrase;

and the audio splicing module is used for acquiring the audio of the alternative characters or words, removing the audio of the wrong characters or words in the synthesized voice corresponding to the target text, and inserting the audio of the alternative characters or words at the corresponding position.

Preferably, the judging module includes:

the word segmentation unit is used for segmenting words of the target text to obtain a word segmentation set;

the first calculation unit is used for inputting the word segmentation set into an n-gram language model to obtain the probability of each character or word;

the second calculation unit is used for multiplying the probability values of all the characters or words of the target text and taking the logarithm of the probability values to obtain the fluency score of the target text;

and the fluency score judging unit is used for judging whether the fluency score is smaller than a first threshold value or not.

Preferably, the prediction module comprises:

the embedding unit is used for embedding the target text index information into the feature vector by utilizing a fully trained bidirectional GRU model;

the prediction unit is used for predicting an output vector for each word or word after word segmentation of the target text;

and the output unit is used for carrying out secondary classification on the vector predicted and output by each word or word after word segmentation and outputting information 0 or 1, wherein 0 represents that the word or word corresponding to the position is wrong, and 1 represents that the word or word corresponding to the position is correct.

Preferably, the determining module comprises:

the query unit is used for acquiring a previous word of the wrong character or word according to the index information of the wrong character or word and querying a collocation table to obtain a candidate word set, wherein the candidate word set comprises a null value;

the alternative unit is used for solving the weighted scores of the longest public word string and the editing distance of the pinyin of the wrong word or word and the pinyin of each word or word in the alternative word set, and selecting the first N words or words with the weighted scores exceeding a second threshold value for alternative;

the third calculation unit is used for substituting N optional characters or words and the wrong characters or words into the language model together and respectively detecting the fluency scores of the new texts;

and the determining unit is used for determining the alternative character or word corresponding to the new text with the highest fluency score as the alternative character or word for replacing the wrong character or word.

An end-to-end speech synthesis error correction apparatus comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor when executing the computer program for performing:

In summary, the present invention discloses an end-to-end speech synthesis error correction method, when an error needs to be corrected in speech synthesis, a target sentence is first obtained, where the target sentence is a target text output by an end-to-end speech synthesis system performing speech recognition on synthesized speech; and then judging whether the fluency of the target text meets a preset condition, if so, performing: predicting wrong characters or words after embedding vectorization is carried out on the target text; determining an alternative character or word for replacing the wrong character or word based on the wrong character or word, acquiring the audio frequency of the alternative character or word, removing the audio frequency of the wrong character or word in the synthesized voice corresponding to the target text, and inserting the audio frequency of the alternative character or word at a corresponding position. The method can accurately predict the wrong characters or words of the target text, replace the predicted wrong characters or words with the alternative characters or words, and splice the audio frequency of the alternative characters or words into the synthesized voice again, thereby effectively solving the problem of wrong characters and multiple characters of end-to-end voice synthesis and further ensuring the voice interaction process to be more accurate and smooth.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of an end-to-end speech synthesis error correction method disclosed in the present invention;

FIG. 2 is a flowchart of a method for determining whether fluency of a target text meets a predetermined condition according to the present invention;

FIG. 3 is a flowchart of a method for predicting incorrect words or phrases after embedding vectorization of a target text according to the present invention;

FIG. 4 is a flowchart of a method for determining alternative words or phrases for replacing an incorrect word or phrase based on the incorrect word or phrase disclosed herein;

FIG. 5 is a schematic diagram of an end-to-end speech synthesis error correction system according to the present invention;

FIG. 6 is a schematic structural diagram of a judging module according to the present disclosure;

FIG. 7 is a schematic diagram of a prediction module according to the present disclosure;

FIG. 8 is a schematic diagram of a determining module according to the present disclosure;

fig. 9 is a schematic structural diagram of an end-to-end speech synthesis error correction apparatus according to embodiment 1 of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, which is a flowchart of a method of embodiment 1 of the end-to-end speech synthesis error correction method disclosed in the present invention, the method may include the following steps:

s101, obtaining a target sentence, wherein the target sentence is a target text output by an end-to-end voice synthesis system for voice recognition of synthesized voice;

when the speech synthesis needs to be corrected, the target sentence, that is, the sentence needing to be corrected is acquired first. The target sentence is a target text which is output by the end-to-end voice synthesis system through voice recognition of the synthesized voice.

When the end-to-end speech synthesis system outputs a target text, firstly, the input text is normalized, divided into words, converted into phonemes and the like to obtain the input information of the end-to-end speech synthesis system, then the input information is input into the end-to-end speech synthesis system, speech synthesis is carried out through a vocoder, and speech recognition is carried out on the synthesized speech to obtain the target text.

S102, judging whether the fluency of the target text meets a preset condition, if so, entering S103:

after the target sentence is obtained, the fluency of the target text is further judged, and whether the fluency of the target text meets a preset condition is judged; the preset condition is a condition that the target text needs to be corrected.

S103, predicting wrong characters or words after embedding vectorization is carried out on the target text;

and when the fluency of the target text meets the preset condition, further predicting wrong characters or words. When predicting the wrong characters or words of the target text, the time sequence characteristics of the sentence structure need to be considered, the characters or words of the target text need to be converted into vectors which can be processed, and then the wrong characters or words are predicted.

S104, determining alternative characters or words for replacing the wrong characters or words based on the wrong characters or words;

after the wrong characters or words in the target text are predicted, alternative characters or words for replacing the wrong characters or words are further determined according to the wrong characters or words, and the determined alternative characters or words are used for replacing the corresponding wrong characters or words in the target text.

S105, obtaining the audio frequency of the alternative characters or words, removing the audio frequency of the wrong characters or words in the synthesized voice corresponding to the target text, and inserting the audio frequency of the alternative characters or words at the corresponding position.

And finally, segmenting the position of the wrong character or word in the target text, removing the sound segment of the wrong character or word from the target text synthesis total sound segment, and leaving the sound segment before the wrong character or word and the sound segment after the wrong character or word with two audio frequencies respectively marked as 'original sound 1' and 'original sound 2'.

The audio of the alternative characters or words is inserted by using the database-based audio splicing and synthesizing method, the audio of the alternative characters or words is selected from a large recorded voice database, and the tone color of the voice database is the same as that of a training voice database of an end-to-end voice synthesizing system, so that the target sentence can be perfectly fused by audio splicing by using the method. The audio of the alternative word or word is selected and marked as the alternative word sound.

The sequence of "original 1" + "alternative" + "original 2" is spliced using the sox. combiner function of the sox (Sound eXchange) tool and appropriate silence is added in the interval of the segments. Since the synthesized speech primitives are all from natural original pronunciation, the intelligibility and naturalness of the synthesized sentence will be very high.

In summary, in the above embodiment, when an error needs to be corrected in speech synthesis, a target sentence is first obtained, where the target sentence is a target text output by an end-to-end speech synthesis system performing speech recognition on a synthesized speech; and then judging whether the fluency of the target text meets a preset condition, if so, performing: predicting wrong characters or words after embedding vectorization is carried out on the target text; determining an alternative character or word for replacing the wrong character or word based on the wrong character or word, acquiring the audio frequency of the alternative character or word, removing the audio frequency of the wrong character or word in the synthesized voice corresponding to the target text, and inserting the audio frequency of the alternative character or word at a corresponding position. The method can accurately predict the wrong characters or words of the target text, replace the predicted wrong characters or words with the alternative characters or words, and splice the audio frequency of the alternative characters or words into the synthesized voice again, thereby effectively solving the problem of wrong characters and multiple characters of end-to-end voice synthesis and further ensuring the voice interaction process to be more accurate and smooth.

Specifically, in the above embodiment, one implementation manner of determining whether the fluency of the target text in step S102 meets the preset condition is shown in fig. 2, and the implementation manner may include the following steps:

s201, performing word segmentation on the target text to obtain a word segmentation set;

s202, inputting the word segmentation set into an n-gram language model to obtain the probability value of each word or word;

s203, multiplying the probability values of all the characters or words of the target text and taking the logarithm of the probability values to obtain the fluency score of the target text;

and S204, judging whether the fluency score is smaller than a first threshold value.

Firstly, segmenting a target text to obtain a segmentation set of the target text, then scanning the target text, inputting the segmentation set into an n-gram language model to obtain a probability value of each word or word, then multiplying the probability values of all the words or words detected by the whole sentence of the target text, and taking a log value of the probability values to obtain a fluency score of the target text. If the fluency score of the target text is larger than or equal to the first threshold, the target text is normal in fluency and does not need to be modified; and if the fluency score of the target text is smaller than the first threshold value, performing error correction processing.

Specifically, in the above embodiment, one implementation manner of predicting the incorrect word or phrase after embedding and vectorizing the target text in step S103 is shown in fig. 3, and may include the following steps:

s301, embedding target text index information into a feature vector by using a fully trained bidirectional GRU model;

s302, predicting an output vector of each word or word after word segmentation of the target text;

and S303, carrying out secondary classification on the vector predicted and output by each word or word after word segmentation, and outputting information 0 or 1, wherein 0 represents that the word or word corresponding to the position is wrong, and 1 represents that the word or word corresponding to the position is correct.

When the position of a wrong character or word needs to be positioned, the time sequence characteristic of a sentence structure needs to be considered, the invention utilizes a fully trained bidirectional GRU model to embed text index information into a characteristic vector, predicts an output vector for each word or word after word segmentation, and carries out two classifications, wherein 1 is marked when the character or word at the current position is correct, and 0 is marked when the character or word at the current position is wrong.

The method comprises the following specific steps:

1) counting word frequency and word number labels;

2) in the ' word ' -index ' table, to convert a word into a processable vector, the word needs to be converted into a corresponding index, and then embedded vectorization is performed. Meanwhile, the step of converting into the index can also carry out filling operation on the input data, so that the input vector lengths of the models can be ensured to be consistent.

The method comprises the steps of segmenting a target text to be processed, labeling the target text into digital information, embedding vectorization, inputting data subjected to opposite quantization into a fully trained bidirectional GRU model network, taking out an output layer node of each time step stage, inputting the output layer node into a full connection layer with two neurons, and outputting the output information to be 1 or 0, wherein 0 represents that a word or a word corresponding to the position is wrong, and 1 represents that the word or the word corresponding to the position is correct. The wrong character or word of the target text can be accurately determined by outputting the information 0.

Specifically, in the above embodiment, one implementation manner of determining, in step S104, an alternative word or word for replacing the wrong word or word based on the wrong word or word is shown in fig. 4, and the method may include the following steps:

s401, acquiring a previous word of the wrong character or word according to the index information of the wrong character or word, and querying a collocation table to obtain a candidate word set, wherein the candidate word set comprises a null value;

s402, solving the weighted scores of the longest public word string and the editing distance of the pinyin of the wrong word or word and the pinyin of each word or word in the alternative word set, and selecting the first N words or words with the weighted scores exceeding a second threshold value for alternative;

s403, substituting N optional characters or words and wrong characters or words into the language model together, and respectively detecting fluency scores of the new texts;

s404, determining the alternative character or word corresponding to the new text with the highest fluency score as the alternative character or word for replacing the wrong character or word.

Acquiring a previous word of the wrong character or word according to the index information of the wrong character or word, inquiring a collocation table to obtain a candidate word set, wherein the candidate word set contains a null value, and acquiring the pinyin of the wrong character or word and the pinyin of each character/word in the candidate word set by using a hand tool; and solving the longest common word string and the weighted score of the editing distance for the pinyin of the wrong character or word and the pinyin of each character or word in the alternative word set, setting a corresponding second threshold, and taking out the first N characters or words exceeding the second threshold for alternative. Wherein, N can be flexibly set according to actual requirements.

And substituting the N alternative characters or words and the wrong characters or words into the N-gram model, respectively detecting the fluency of each new sentence, calculating and comparing scores, and selecting the alternative character or word with the highest final score as the output character or word at the position. The step can also eliminate misjudgment of the full-pair sentences during error detection, so that the optimal solution is still the original sentence.

In summary, the invention evaluates, positions and corrects the synthesized audio, replaces the wrong character or word in the wrong sentence with the new alternative character or word, and finds the waveform information of the alternative character or word in the speech database to be spliced with the original sentence, thereby obtaining the new and complete speech synthesis sentence without wrong characters and multiple characters.

As shown in fig. 5, which is a schematic structural diagram of an embodiment 1 of an end-to-end speech synthesis error correction system disclosed in the present invention, the system may include:

an obtaining module 501, configured to obtain a target sentence, where the target sentence is a target text output by an end-to-end speech synthesis system performing speech recognition on a synthesized speech;

A judging module 502, configured to judge whether the fluency of the target text meets a preset condition;

The prediction module 503 is configured to predict an error word or word after embedding vectorization is performed on the target text when the fluency of the target text meets a preset condition;

A determining module 504, configured to determine an alternative word or word for replacing the wrong word or word based on the wrong word or word;

And the audio splicing module 505 is configured to obtain audio of the candidate character or word, remove the audio of the wrong character or word in the synthesized speech corresponding to the target text, and insert the audio of the candidate character or word at a corresponding position.

The sequence of "original 1" + "alternative" + "original 2" is spliced using the sox. combiner function of the sox tool and appropriate silence segments are added in the interval of segments. Since the synthesized speech primitives are all from natural original pronunciation, the intelligibility and naturalness of the synthesized sentence will be very high.

Specifically, in the above embodiment, one implementation manner of the determining module when performing the determination of whether the fluency of the target text meets the preset condition is shown in fig. 6, and may include:

a word segmentation unit 601, configured to perform word segmentation on the target text to obtain a word segmentation set;

a first calculating unit 602, configured to input the word segmentation set into an n-gram language model to obtain a probability value of each word or word;

the second calculating unit 603 is configured to multiply the probability values of all the words or phrases of the target text and take the logarithm of the probability values to obtain a fluency score of the target text;

the fluency score determining unit 604 is configured to determine whether the fluency score is smaller than a first threshold.

Specifically, in the above embodiment, one implementation manner of the prediction module when performing embedded vectorization on the target text and then predicting the incorrect word or word is shown in fig. 7, and may include:

an embedding unit 701, configured to embed target text index information into a feature vector by using a fully trained bidirectional GRU model;

a prediction unit 702, configured to predict an output vector for each word or word after word segmentation of the target text;

the output unit 703 is configured to perform secondary classification on the vector predicted and output by each word or word after word segmentation, and output information 0 or 1, where 0 represents that the word or word corresponding to the position is incorrect, and 1 represents that the word or word corresponding to the position is correct.

The method comprises the following specific steps:

1) counting word frequency and word number labels;

Specifically, in the above embodiment, one implementation manner of the determining module when performing the determination of the alternative word or word for replacing the wrong word or word based on the wrong word or word is shown in fig. 8, and may include:

the query unit 801 is configured to obtain a previous word of the wrong word or word according to the index information of the wrong word or word, and query the collocation table to obtain a candidate word set, where the candidate word set includes a null value;

an alternative unit 802, configured to find a weighted score of the longest common word string and the editing distance for the pinyin of the wrong word or word and the pinyin of each word or word in the alternative word set, and take the first N words or words with the weighted score exceeding a second threshold for alternative;

a third calculating unit 803, configured to substitute the N candidate characters or words and the incorrect character or word into the language model, and detect fluency scores of the new texts respectively;

a determining unit 804, configured to determine the alternative word or phrase corresponding to the new text with the highest fluency score as the alternative word or phrase replacing the wrong word or phrase.

As shown in fig. 9, which is a schematic structural diagram of an embodiment 1 of an end-to-end speech synthesis error correction apparatus disclosed in the present invention, the apparatus includes: a memory 901, a processor 902 and a computer program stored on the memory and executable on the processor, the processor 902 when executing the computer program for performing:

acquiring a target sentence, wherein the target sentence is a target text output by an end-to-end voice synthesis system for performing voice recognition on synthesized voice;

determining alternative words or words for replacing the wrong words or words based on the wrong words or words;

In the above embodiment, when error correction is required for speech synthesis, the target sentence, that is, the sentence for which error correction is required, is acquired first. The target sentence is a target text which is output by the end-to-end voice synthesis system through voice recognition of the synthesized voice.

In summary, the end-to-end speech synthesis error correction device provided by the invention can accurately predict the wrong characters or words of the target text, replace the predicted wrong characters or words with the alternative characters or words, and splice the audio of the alternative characters or words into the synthesized speech again, thereby effectively solving the problem of wrong characters and multiple characters in the end-to-end speech synthesis, and further ensuring that the speech interaction process is more accurate and smooth.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An end-to-end speech synthesis error correction method, comprising:

2. The method of claim 1, wherein the determining whether the fluency of the target text meets a preset condition comprises:

determining whether the fluency score is less than a first threshold.

3. The method of claim 1, wherein the embedding vectorization of the target text to predict erroneous words or phrases comprises:

4. The method of claim 1, wherein determining an alternative word or word for replacing the incorrect word or word based on the incorrect word or word comprises:

5. An end-to-end speech synthesis error correction system, comprising:

6. The system of claim 5, wherein the determining module comprises:

7. The system of claim 5, wherein the prediction module comprises:

8. The system of claim 5, wherein the determining module comprises:

9. An end-to-end speech synthesis error correction apparatus, comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor when executing the computer program for performing: