CN111949796B

CN111949796B - Method and system for analyzing front-end text of voice synthesis of resource-limited language

Info

Publication number: CN111949796B
Application number: CN202010858597.9A
Authority: CN
Inventors: 吴朗
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2020-08-24
Filing date: 2020-08-24
Publication date: 2023-10-20
Anticipated expiration: 2040-08-24
Also published as: CN111949796A

Abstract

The invention provides a method and a system for analyzing a front-end text of voice synthesis of a resource-limited language, wherein the method comprises the following steps: acquiring training data, wherein the training data comprises source domain data, target domain data with labels and target domain data without labels; training the neural network structure based on the mixed data and the target domain data without the label; and performing speech synthesis front-end text analysis on the resource-constrained language by using the trained neural network structure. According to the method, only a small amount of labeling data of the resource-limited languages is needed, and the quality is easier to control; from the semi-supervised learning point of view, a small amount of labeled resource-restricted language data is added in the training process, and feature distribution priori knowledge of the resource-restricted language data can be learned while feature distribution of the resource-rich language data is learned, so that the defect that an unsupervised domain self-adaption method cannot consider the feature distribution of the resource-restricted language data is avoided.

Description

Method and system for analyzing front-end text of voice synthesis of resource-limited language

Technical Field

The invention relates to the technical field of speech synthesis, in particular to a method and a system for analyzing a front-end text of resource-limited language speech synthesis.

Background

Currently, for resource-constrained languages (e.g., chinese domestic dialects), there are generally two methods for speech synthesis front-end text analysis: one method is to define the label type according to expert knowledge, manually make a large amount of target domain data with labels, and then input the target domain data into a designed neural network for training parameters, wherein the method has the problems of data distribution equalization, data labeling correctness, consistency, safety, timeliness and the like; another method is to use a migration learning method to know a neural network A of a rich-resource language (Chinese Mandarin), newly establish a neural network B of a limited-resource language, and train parameters in the network B by using local features (parameters) in the network A and a small amount of target domain data with labels, namely, the data of the limited-resource language; or a small amount of target domain data with labels is directly used for fine adjustment on the network A, so that a neural network of a resource-limited language is obtained, and the method has very poor effect between open data sets which are more in line with actual scenes; feature correlation between the target domain data and the source domain data (i.e., the resource rich language) is not fully utilized; tuning large-parameter neural networks using small amounts of tagged target domain data is often difficult to implement.

In order to solve the problem of data resource shortage in the analysis of the front-end text of the voice synthesis of the language with limited resources, a method and a system for analyzing the front-end text of the voice synthesis of the language with limited resources are needed.

Disclosure of Invention

The invention provides a method and a system for analyzing a front-end text of voice synthesis of a resource-limited language, which are used for solving the problem of data resource shortage in the front-end text analysis of voice synthesis of the resource-limited language.

The invention provides a method for analyzing a front-end text of voice synthesis of a resource-limited language, which comprises the following steps:

step 1: acquiring training data, wherein the training data comprises source domain data, target domain data with labels and target domain data without labels, the source domain data comprises text data corresponding to rich languages of resources, and the target domain data comprises text data corresponding to limited languages of resources;

step 2: training a neural network structure based on hybrid data and the unlabeled target domain data, wherein the hybrid data includes the labeled source domain data and the labeled target domain data;

step 3: and performing voice synthesis front-end text analysis on the resource limited languages by using the trained neural network structure.

Further, in the step 1, the proportion of the source domain data in the training data is 55% -65%, the proportion of the tagged target domain data in the training data is 8% -12%, and the proportion of the untagged target domain data in the training data is 27% -33%.

Further, in the step 2, the neural network structure includes a feature extractor and a classifier, wherein the classifier immediately follows the feature extractor.

Further, the feature extractor includes an encoder employing a transducer.

Further, the classifier includes a full connectivity layer, a softmax layer, and a CRF layer.

Further, the step 2: training the neural network structure based on the mixed data and the target domain data without the label, and executing the following steps:

step S21: inputting the mixed data in the neural network structure to perform supervised learning on the neural network structure and synchronously updating network parameters of the feature extractor and the classifier;

step S22: and simultaneously inputting the mixed data and the target domain data without labels in the neural network structure so as to perform semi-supervised learning on the neural network structure and only update the network parameters of the classifier.

Further, in the step S21, when the mixed data is input in the neural network structure, the output features through the feature extractor are input to the classifier once, and a dropout strategy is not adopted to learn the distinguishing features.

Further, in the step S22, the mixed data and the target domain data without labels are input into the neural network structure at the same time, so as to perform semi-supervised learning on the neural network structure, and only update the network parameters of the classifier, and the following steps are performed:

step S221: when the mixed data and the target domain data without labels are simultaneously input into the neural network structure, the output features of the feature extractor are respectively input into the classifier twice, and different network nodes are respectively dropout by adopting a dropout strategy so as to sample a first classifier network and a second classifier network;

step S222: only the network parameters of the classifier are updated by maximizing the KL-divergence between the output probabilities of the first classifier network and the second classifier network.

Further, the step 2: training the neural network structure based on the mixed data and the target domain data without the label, and further performing the following steps:

step S23: in the process of training the neural network structure, the classifier and the feature extractor are mutually opposed to interactively update network parameters in the classifier and the feature extractor;

step S24: inputting unlabeled target domain data in the neural network structure, and updating network parameters of the feature extractor by minimizing KL divergence between output probabilities of the first classifier network and the second classifier network.

The method for analyzing the front-end text of the voice synthesis of the resource-limited language provided by the embodiment of the invention has the following beneficial effects: only a small amount of labeling data of the resource-limited language is needed, and the quality is easier to control; in addition, from the semi-supervised learning point of view, a small amount of labeled resource-restricted language data is added in the training process, the feature distribution priori knowledge of the resource-restricted language data can be learned while the feature distribution of the resource-rich language data is learned, and the defect that the unsupervised domain self-adaptive method cannot consider the feature distribution of the resource-restricted language data is avoided.

The invention also provides a system for analyzing the front-end text of the voice synthesis of the resource-limited language, which comprises:

the training data acquisition module is used for acquiring training data, wherein the training data comprises source domain data, target domain data with labels and target domain data without labels, the source domain data comprises text data corresponding to rich languages of resources, and the target domain data comprises text data corresponding to limited languages of resources;

the neural network training module is used for training a neural network structure based on mixed data and the target domain data without labels, wherein the mixed data comprises the source domain data with labels and the target domain data with labels;

and the front-end text analysis module is used for carrying out voice synthesis front-end text analysis on the resource limited languages by utilizing the trained neural network structure.

The front-end text analysis system for the voice synthesis of the resource-limited language provided by the embodiment of the invention has the following beneficial effects: only a small amount of labeling data of the resource-limited language is needed, and the quality is easier to control; in addition, from the semi-supervised learning point of view, the neural network training module adds a small amount of labeled resource-restricted language data in the training process, and can learn the feature distribution priori knowledge of the resource-restricted language data while learning the feature distribution of the resource-rich language data, so that the defect that the unsupervised domain self-adaption technology cannot consider the feature distribution of the resource-restricted language is avoided, and the problem of data resource shortage in the front-end text analysis of the resource-restricted language speech synthesis can be solved by utilizing the semi-supervised domain self-adaption technology.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

FIG. 1 is a flow chart of a method for analyzing a front-end text of a speech synthesis of a resource-constrained language in an embodiment of the invention;

FIG. 2 is a block diagram of a system for front-end text analysis for speech synthesis in a resource-constrained language in accordance with an embodiment of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.

The embodiment of the invention provides a method for analyzing a front-end text of voice synthesis of a resource-limited language, which is shown in fig. 1 and comprises the following steps:

The working principle of the technical scheme is as follows: the rich resource language may be, for example, mandarin Chinese, and the limited resource language may be, for example, domestic dialect.

Based on semi-supervised learning, a small amount of tagged resource-restricted language data is added into the resource-rich language data, so that the neural network can learn some feature distribution priori knowledge of the resource-restricted language data, and further the tag prediction accuracy of the resource-restricted language data is improved. Specifically, first, training data composed of source domain data, labeled target domain data, and unlabeled target domain data is acquired; then training the neural network structure based on the mixed data consisting of the labeled source domain data and the labeled target domain data and the unlabeled target domain data; finally, the trained neural network structure is utilized to carry out voice synthesis front-end text analysis on the resource limited languages.

The beneficial effects of the technical scheme are as follows: only a small amount of labeling data of the resource-limited language is needed, and the quality is easier to control; in addition, from the semi-supervised learning point of view, a small amount of labeled resource-restricted language data is added in the training process, the feature distribution priori knowledge of the resource-restricted language data can be learned while the feature distribution of the resource-rich language data is learned, and the defect that the unsupervised domain self-adaptive method cannot consider the feature distribution of the resource-restricted language is avoided.

In one embodiment, in the step 1, the proportion of the source domain data in the training data is 55% -65%, the proportion of the tagged target domain data in the training data is 8% -12%, and the proportion of the untagged target domain data in the training data is 27% -33%.

The working principle of the technical scheme is as follows: as an example and not by way of limitation, the proportion of source domain data is 60%, the proportion of tagged target domain data is 10%, and the proportion of untagged target domain data is 30%.

The beneficial effects of the technical scheme are as follows: the source domain data, the target domain data with the labels and the proportion setting of the target domain data with the labels in the training data are provided, so that a small amount of resource limited language data with the labels is added in the training process, and the prior knowledge of the characteristic distribution of the resource limited language data can be learned while the characteristic distribution of the resource rich language data is learned.

In one embodiment, in the step 2, the neural network structure includes a feature extractor and a classifier, wherein the classifier immediately follows the feature extractor.

The working principle of the technical scheme is as follows: the feature extractor includes an encoder employing a transducer that can fully extract feature information of text context.

The classifier includes a full connectivity layer, a softmax layer, and a CRF layer. The classifier mainly has two functions, one is used as a conventional classifier, the neural network parameters are trained by using mixed data, the other is used as a determiner, target domain data without labels is used, the characteristics of the target domain data are respectively input into the classifier network twice after passing through the characteristic extractor, different network nodes in the full-connection layer of the two times of dropouts are judged, and the similarity between the two different output probabilities is measured by using a Kullback-Leibler divergence (KL divergence).

The beneficial effects of the technical scheme are as follows: the specific structure of the neural network is provided, the feature extractor adopts a transducer encoder, the transducer encoder can fully extract the feature information of the text context, and the classifier comprises a full connection layer, a softmax layer and a CRF layer, can be used as a conventional classifier and can be used as a decision device.

In one embodiment, the step 2: training the neural network structure based on the mixed data and the target domain data without the label, and executing the following steps:

The working principle of the technical scheme is as follows: firstly, inputting mixed data, performing supervised learning, and synchronously training parameters of a feature extractor and a classifier, wherein the aim is to accurately classify the mixed data so as to obtain distinguishing features; then, the mixed data (both tagged) and untagged target domain data are input simultaneously, and only the parameters of the classifier are updated.

The beneficial effects of the technical scheme are as follows: specific steps are provided for training a neural network structure based on the hybrid data and the unlabeled target domain data.

In one embodiment, in the step S21, when the mixed data is input in the neural network structure, the output features through the feature extractor are input to the classifier once, without adopting a dropout strategy, so as to learn the distinguishing features.

The working principle of the technical scheme is as follows: when the mixed data is input, the input to the classifier is once through the feature extractor, and dropout is not performed so that the distinguishing features are learned.

The beneficial effects of the technical scheme are as follows: specific methods of inputting the hybrid data in a neural network structure to supervise learning the neural network structure are provided.

In one embodiment, in the step S22, the mixed data and the target domain data without labels are input into the neural network structure at the same time, so as to perform semi-supervised learning on the neural network structure, and only update the network parameters of the classifier, and perform the following steps:

The working principle of the technical scheme is as follows: inputting target domain data without labels, respectively inputting the output characteristics of the characteristic extractor into the classifiers twice, wherein the network nodes with different dropouts are respectively equivalent to sampling two classifier networks, namely a first classifier network C1 and a second classifier network C2; then maximizing the KL divergence between the two output probabilities; thus, the classifier can detect the target domain data near the decision boundary, and different neural nodes on the classifier can learn more different characteristic representations.

The beneficial effects of the technical scheme are as follows: specific steps are provided for simultaneously inputting hybrid data and unlabeled target domain data in a neural network structure to perform semi-supervised learning of the neural network structure.

In one embodiment, the step 2: training the neural network structure based on the mixed data and the target domain data without the label, and further performing the following steps:

The working principle of the technical scheme is as follows: as described above, two classifier networks are sampled and the KL divergence between the output probabilities is maximized, so that the classifier can detect the target domain data near the decision boundary, different neural nodes on the classifier can learn more different feature representations, and at the same time, the feature extractor has to generate more distinguishing features in order to avoid these feature spaces, and the classifier and the feature extractor are mutually opposed and interactively updated in the training process.

In addition, the unlabeled target domain data is input, and the parameters of the feature extractor are updated by minimizing the KL divergence between the output probabilities of the first classifier network C1 and the second classifier network C2, so that the feature representation of the unlabeled target domain data far from the decision boundary can be obtained.

The beneficial effects of the technical scheme are as follows: specific steps for training the neural network structure based on the hybrid data and the unlabeled target domain data are provided.

As shown in fig. 2, an embodiment of the present invention provides a system for analyzing a front-end text of a speech synthesis of a resource-constrained language, including:

a training data obtaining module 201, configured to obtain training data, where the training data includes source domain data, labeled target domain data, and unlabeled target domain data, the source domain data includes text data corresponding to a resource-rich language, and the target domain data includes text data corresponding to a resource-limited language;

a neural network training module 202, configured to train a neural network structure based on hybrid data and the unlabeled target domain data, where the hybrid data includes the labeled source domain data and the labeled target domain data;

the front-end text analysis module 203 is configured to perform speech synthesis front-end text analysis on the resource-constrained language by using the trained neural network structure.

Based on semi-supervised learning, a small amount of tagged resource-restricted language data is added into the resource-rich language data, so that the neural network can learn some feature distribution priori knowledge of the resource-restricted language data, and further the tag prediction accuracy of the resource-restricted language data is improved. Specifically, the training data acquisition module 201 acquires training data composed of source domain data, labeled target domain data, and unlabeled target domain data; the neural network training module 202 trains the neural network structure based on the mixed data composed of the labeled source domain data and the labeled target domain data and the unlabeled target domain data; the front-end text analysis module 203 performs speech synthesis front-end text analysis on the resource-constrained language by using the trained neural network structure.

The beneficial effects of the technical scheme are as follows: only a small amount of labeling data of the resource-limited language is needed, and the quality is easier to control; in addition, from the semi-supervised learning point of view, the neural network training module adds a small amount of labeled resource-restricted language data in the training process, and can learn the feature distribution priori knowledge of the resource-restricted language data while learning the feature distribution of the resource-rich language data, so that the defect that the unsupervised domain self-adaption technology cannot consider the feature distribution of the resource-restricted language is avoided, and the problem of data resource shortage in the front-end text analysis of the resource-restricted language speech synthesis can be solved by utilizing the semi-supervised domain self-adaption technology.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method for analyzing a front-end text of voice synthesis of a resource-limited language is characterized by comprising the following steps:

step 1: acquiring training data, wherein the training data comprises source domain data, target domain data with labels and target domain data without labels, the source domain data comprises text data corresponding to rich languages of resources, and the target domain data comprises text data corresponding to limited languages of resources; the proportion of the source domain data in the training data is 55% -65%, the proportion of the labeled target domain data in the training data is 8% -12%, and the proportion of the unlabeled target domain data in the training data is 27% -33%;

step 2: training the neural network structure based on the hybrid data and the unlabeled target domain data, comprising: step S21: inputting the mixed data into the neural network structure to perform supervised learning on the neural network structure and synchronously updating network parameters of a feature extractor and a classifier; step S22: simultaneously inputting the mixed data and the target domain data without labels in the neural network structure to perform semi-supervised learning on the neural network structure and only update network parameters of the classifier, wherein the method comprises the following steps: step S221: when the mixed data and the target domain data without labels are simultaneously input into the neural network structure, the output features of the feature extractor are respectively input into the classifier twice, and different network nodes are respectively dropout by adopting a dropout strategy so as to sample a first classifier network and a second classifier network; step S222: updating only network parameters of the classifier by maximizing KL divergence between output probabilities of the first classifier network and the second classifier network; wherein the hybrid data includes the tagged source domain data and the tagged target domain data; the neural network structure includes a feature extractor and a classifier, wherein the classifier immediately follows the feature extractor;

2. The method of claim 1, wherein the feature extractor comprises an encoder employing a transducer.

3. The method of claim 1, wherein the classifier comprises a fully connected layer, a softmax layer, and a CRF layer.

4. The method of claim 1, wherein in the step S21, when the mixed data is input in the neural network structure, the output features through the feature extractor are input to the classifier once without employing a dropout strategy to learn the distinguishing features.

5. The method according to claim 1, wherein said step 2: training the neural network structure based on the mixed data and the target domain data without the label, and further performing the following steps:

6. A system for front-end text analysis for speech synthesis in a resource-constrained language, comprising:

the training data acquisition module is used for acquiring training data, wherein the training data comprises source domain data, target domain data with labels and target domain data without labels, the source domain data comprises text data corresponding to rich languages of resources, and the target domain data comprises text data corresponding to limited languages of resources; the proportion of the source domain data in the training data is 55% -65%, the proportion of the labeled target domain data in the training data is 8% -12%, and the proportion of the unlabeled target domain data in the training data is 27% -33%;

the neural network training module is configured to train a neural network structure based on the mixed data and the target domain data without the tag, and includes: step S21: inputting the mixed data into the neural network structure to perform supervised learning on the neural network structure and synchronously updating network parameters of a feature extractor and a classifier; step S22: simultaneously inputting the mixed data and the target domain data without labels in the neural network structure to perform semi-supervised learning on the neural network structure and only update network parameters of the classifier, wherein the method comprises the following steps: step S221: when the mixed data and the target domain data without labels are simultaneously input into the neural network structure, the output features of the feature extractor are respectively input into the classifier twice, and different network nodes are respectively dropout by adopting a dropout strategy so as to sample a first classifier network and a second classifier network; step S222: updating only network parameters of the classifier by maximizing KL divergence between output probabilities of the first classifier network and the second classifier network; wherein the hybrid data includes the tagged source domain data and the tagged target domain data; the neural network structure includes a feature extractor and a classifier, wherein the classifier immediately follows the feature extractor;