CN111128122B

CN111128122B - Method and system for optimizing rhythm prediction model

Info

Publication number: CN111128122B
Application number: CN201911421271.3A
Authority: CN
Inventors: 张晴; 张辉
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2022-08-16
Anticipated expiration: 2039-12-31
Also published as: CN111128122A

Abstract

The embodiment of the invention provides an optimization method of a prosody prediction model. The method comprises the following steps: performing word segmentation on the sentence with the misprediction of the rhythm prediction model; determining the words without the prosody marks as replaceable words, and performing synonym enhancement on the replaceable words in the sentences to generate a first training data set of the sentences; acquiring other sentences similar to the sentences from the language material pool through the text similarity, feeding back the other sentences to the developer, and receiving a second training data set marked by the developer on the prosody of the words in the other sentences; and generating a third training data set based on the first training data set and the second training data set, and carrying out self-adaptive training on the prosody prediction model through the third training data set so as to optimize the prosody prediction model. The embodiment of the invention also provides an optimization system of the prosody prediction model. The embodiment of the invention optimizes problematic sentence prosody output results, reduces the manual annotation cost, saves time and improves the prediction effect of the prosody prediction model.

Description

Method and system for optimizing rhythm prediction model

Technical Field

The invention relates to the field of intelligent voice, in particular to a prosody prediction model optimization method and system.

Background

For the current TTS (Text To Speech ) system, there are two main ways of prosody prediction, which are based on statistical rules and neural network models, and in any way, the prediction is unreasonable with a probability of about 20% for general corpus. In an actual application scene, if the prosody prediction of a scene common sentence is unreasonable, the experience is greatly reduced, and how to perform targeted optimization of prosody prediction based on a feedback question sentence improves the synthesis experience of an actual scene, which is very important for a commercial TTS system.

Currently, the existing schemes for prosody optimization of an incorrect sentence include:

(1) manually adding rules: by analyzing and summarizing the characteristics of the error sentences, corresponding rules are added into the system based on the information of the word surfaces, the word characteristics, the word lengths, the sentence lengths and the like of the participles in the sentences, the related sentences are matched, and the rhythm grade is specified.

(2) Adding training data: after the question sentences are labeled manually, the question sentences are added into training data, and the model or the rule is retrained, so that the prediction capability is improved.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:

the scheme of manually adding the rules has rapid processing and strong pertinence to the statements needing to be optimized, but the rules are not flexible enough, and the modified effect often appears as follows: question sentences can be predicted correctly, but similar sentences still have problems and coverage is not comprehensive. This is very disadvantageous for commercial TTS. The added training data scheme is very important in selection of the added training data, for model training, if only feedback wrong sentences are added, the model is difficult to be pertinently strengthened, too many sentences are added, the manual labeling cost is high, and the optimization period is too long.

Disclosure of Invention

The method aims to at least solve the problems that training coverage of a prosody prediction model in the prior art is incomplete, the model is difficult to pertinently strengthen, manual annotation cost of data is high, and the period is long.

In a first aspect, an embodiment of the present invention provides a method for optimizing a prosody prediction model, including:

performing word segmentation on a statement with a misprediction of a prosody prediction model, wherein the word with the misprediction of the prosody in the statement is provided with a prosody mark;

determining words without the prosody marks as replaceable words, determining words with the prosody marks as non-replaceable words, performing synonym enhancement on the replaceable words in the sentence, and generating a first training data set of the sentence;

acquiring other sentences similar to the sentence from a language material pool through text similarity, feeding back the other sentences to a developer, and receiving a second training data set marked by the developer on the prosody of the words in the other sentences;

generating a third training data set based on at least a portion of the first training data set and at least a portion of the second training data set, the prosodic prediction model adaptively trained by the third training data set to optimize the prosodic prediction model.

In a second aspect, an embodiment of the present invention provides a system for optimizing a prosody prediction model, including:

the system comprises a sentence segmentation program module, a prosody prediction model generation module and a prosody prediction module, wherein the sentence segmentation program module is used for segmenting a sentence with a prosody prediction error in the sentence, and the prosody error word in the sentence is provided with a prosody mark;

a synonym enhancement program module, which is used for determining the words without the prosody marks as replaceable words, determining the words with the prosody marks as non-replaceable words, and performing synonym enhancement on the replaceable words in the sentences to generate a first training data set of the sentences;

the similar sentence acquisition program module is used for acquiring other sentences similar to the sentences from a language material pool through text similarity, feeding back the other sentences to a developer, and receiving a second training data set of the developer after the developer marks the prosody of the words in the other sentences;

and the model optimization program module is used for generating a third training data set based on at least one part of the first training data set and at least one part of the second training data set, and carrying out adaptive training on the prosody prediction model through the third training data set so as to optimize the prosody prediction model.

In a third aspect, an electronic device is provided, comprising: the prosody prediction model optimization system comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the steps of the prosody prediction model optimization method of any embodiment of the invention.

In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the method for optimizing a prosody prediction model according to any embodiment of the present invention.

The embodiment of the invention has the beneficial effects that: the method can quickly and comprehensively process the problem of prosody prediction errors fed back by the user, can better optimize the prosody output results of problematic sentences and related sentences by only needing a small amount of manual labeling work in a short period, reduces the manual labeling cost, saves time and improves the prediction effect of the prosody prediction model.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart of a method for optimizing a prosody prediction model according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an optimization system of a prosody prediction model according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart illustrating a method for optimizing a prosody prediction model according to an embodiment of the present invention, including the following steps:

s11: performing word segmentation on a statement with a misprediction of a prosody prediction model, wherein the word with the misprediction of the prosody in the statement is provided with a prosody mark;

s12: determining the words without the prosody marks as replaceable words, determining the words with the prosody marks as non-replaceable words, performing synonym enhancement on the replaceable words in the sentence, and generating a first training data set of the sentence;

s13: acquiring other sentences similar to the sentence from a language material pool through text similarity, feeding back the other sentences to a developer, and receiving a second training data set marked by the developer on the prosody of the words in the other sentences;

s14: generating a third training data set based on at least a portion of the first training data set and at least a portion of the second training data set, the prosodic prediction model adaptively trained by the third training data set to optimize the prosodic prediction model.

In this embodiment, some intelligent devices may have a feedback function, and if the prosody prediction is incorrect or under other conditions, the user may perform feedback through the feedback function, and may acquire a certain number of sentences of which the prosody prediction model predicts incorrectly by collecting the feedback of the user. Since the number is not particularly large, the developer can perform prosody labeling on these erroneous sentences. To further facilitate subsequent processing, pre-processing may be performed to regularize the text of these mispredicted sentences.

With respect to step S11, the collected sentences for which the prosody prediction models are mispredicted are participled. For example, "we listen to story god horse" together, where the prosody error of the whole word of "god horse" is wrongly divided into prosody separate from "god pen # horse" due to prosody prediction error. The developer uses "magic pen # malian" as the prosodic marker.

The word segmentation is carried out on the ' we listen to story spirit # horse ' together to ' we listen to ' story | spirit horse '.

In step S12, the part of the prosody prediction model that is incorrect in prediction is also a part of the prosody prediction model that is weak in the structure of such a sentence. Because the Shenpeng # Malachio is used as the rhythm mark, the Shenpeng # Malachio is used as an irreplaceable word. We will take "we", "together", "hear" and "story" as alternative words.

A new training data set is generated by synonym enhancement of alternative words, e.g. "listen together" with synonym replaced by "listen to", enhancing sentences to "we listen to a story with a pen horse". Since the non-replaceable words are prosodic tokens, such sentence-enhanced sentences also carry prosodic tokens. In the same way, synonym replacement can be performed on other words, so that more sentences are generated and determined as the first training data set.

For step S13, in the TTS business system, the TTS call text may be counted periodically, so as to obtain a corpus containing millions of high frequency call sentences.

For the current sentences with prosody prediction errors, through calculating the text similarity, sentences which are relatively similar to the text can be found from a corpus pool, according to experiments, the sentences have relatively similar semantics, a plurality of sentences are different expressions of the same semantics, similar prosody prediction errors can occur in about 25% of 100 sentences with the highest similarity, and the data sets need to be manually labeled with prosody results. Receiving the artificially labeled prosodic markers of the developer as a second training data set

For step S14, partial data is extracted from the first training data set determined in step S12 and the second training data set determined in step S13 and mixed as a data set for prosody prediction model training. The data set contains statements with richer prosody error statement types to train the prosody prediction model, so that the prosody prediction model is optimized.

According to the embodiment, the problem of prosody prediction errors fed back by the user can be quickly and comprehensively solved, and the prosody output results of problematic sentences and related sentences can be better optimized only by a small amount of manual labeling work in a shorter period, so that the manual labeling cost is reduced, the time is saved, and the prediction effect of the prosody prediction model is improved.

As an implementation manner, in this embodiment, the generating a third training data set based on at least a part of the first training data set and at least a part of the second training data set includes:

extracting a part of sentences in the first training data set to determine as a first training set, and extracting another part of sentences to determine as a first check set;

extracting a part of sentences in the second training data set to determine as a second training set, and extracting another part of sentences to determine as a second check set;

and mixing the first training set and the second training set to obtain a third training data set.

In this embodiment, the number of sentences selected from the first training data set and the second training data set may be determined according to a specific practical situation. Generally, each error sentence can be expanded into a data set of 50-200 sentences, the data is mixed with original training data, adaptive training is carried out on an original model by using proper parameters, the original prosody prediction errors can be corrected in a large probability, and the prediction result of the related sentence is greatly improved.

As an implementation manner, in this embodiment, after the adaptively training the prosody prediction model through the third training data set, the method includes:

mixing the first check set and the second check set to obtain a third check data set;

and verifying the trained prosody prediction model through the third verification data set.

In the present embodiment, the trained prosody prediction model is verified by the remaining sentences in the first and second training data sets. And further optimizing the prosody prediction model through the verification result.

According to the embodiment, the first training data set and the second training data set are distributed, and the prosody prediction model is trained and verified, so that the prediction effect of the prosody prediction model is further improved.

Fig. 2 is a schematic structural diagram of a system for optimizing a prosody prediction model according to an embodiment of the present invention, which can execute the method for optimizing a prosody prediction model according to any of the embodiments described above and is configured in a terminal.

The prosody prediction model optimization system provided by the embodiment includes: a sentence segmentation program module 11, a synonym enhancement program module 12, a similar sentence acquisition program module 13 and a model optimization program module 14.

The sentence segmentation program module 11 is configured to segment a sentence with a prediction error of a prosody prediction model, where the sentence with the prosody error has a prosody tag; the synonym enhancement program module 12 is configured to determine a term without the prosody tag as a replaceable term, determine a term with the prosody tag as an irreplaceable term, perform synonym enhancement on the replaceable term in the sentence, and generate a first training data set of the sentence; the similar sentence acquisition program module 13 is configured to acquire, from the corpus pool, other sentences similar to the sentence through text similarity, feed back the other sentences to the developer, and receive a second training data set after the developer marks a word prosody in the other sentences; the model optimizer module 14 is configured to generate a third training data set based on at least a portion of the first training data set and at least a portion of the second training data set, and adaptively train the prosodic prediction model through the third training data set to optimize the prosodic prediction model.

Further, the model optimizer module is configured to:

Further, the model optimizer module is further configured to:

Further, the sentence segmentation program module is further configured to:

and performing text regularization on the sentence with the prediction error.

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the optimization method of the prosody prediction model in any method embodiment;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

As a non-transitory computer-readable storage medium, it may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in the embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium that, when executed by a processor, perform a method of optimizing a prosody prediction model in any of the method embodiments described above.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present invention further provides an electronic device, which includes: the prosody prediction model optimization system comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the steps of the prosody prediction model optimization method of any embodiment of the invention.

The client of the embodiment of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.

(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) Other electronic devices with data processing capabilities.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A prosody prediction model optimization method comprises the following steps:

2. The method of claim 1, wherein the generating a third training data set based on at least a portion of the first training data set and at least a portion of the second training data set comprises:

3. The method of claim 2, wherein after the adaptively training the prosodic prediction model with the third training dataset, the method comprises:

4. The method of claim 1, wherein prior to the tokenizing the prosodic prediction model mispredicted sentence, the method comprises:

and performing text regularization on the sentence with the wrong prediction.

5. A system for optimizing a prosody prediction model, comprising:

a model optimization program module, configured to generate a third training data set based on at least a portion of the first training data set and at least a portion of the second training data set, and perform adaptive training on the prosody prediction model through the third training data set to optimize the prosody prediction model.

6. The system of claim 5, wherein the model optimizer module is to:

7. The system of claim 6, wherein the model optimizer module is further to:

8. The system of claim 5, wherein the sentence segmentation program module is further to:

and performing text regularization on the sentence with the prediction error.

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any of claims 1-4.

10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 4.