CN114707503A

CN114707503A - Front-end text analysis method based on multi-task learning

Info

Publication number: CN114707503A
Application number: CN202210132522.1A
Authority: CN
Inventors: 黎天宇; 张句; 关昊天; 王宇光
Original assignee: Huiyan Technology Tianjin Co ltd
Current assignee: Huiyan Technology Tianjin Co ltd
Priority date: 2022-02-14
Filing date: 2022-02-14
Publication date: 2022-07-05
Anticipated expiration: 2042-02-14
Also published as: CN114707503B

Abstract

The invention discloses a front-end text analysis method based on multi-task learning, which uses the same linguistic data to label features and results, uses a CNN network as a sharing layer to extract the features of the linguistic data, then respectively puts the linguistic data into two Bi-LSTMs to be trained in parallel, and specifically comprises the following steps aiming at the two tasks to output the results: s1, marking data; s2, preparing characteristics; s3, fusing features; and S4, classifying. The invention combines polyphone prediction and prosody prediction tasks by using a multi-task learning method, realizes a uniform end-to-end text processing model, namely provides a uniform front end structure, and constructs a high-quality Mandarin TTS system more quickly and easily. The training of the unified model can use the same data as input, can directly predict polyphones and prosody from the original text at the same time, can train two tasks in parallel, reduces the workload of data labeling, saves the training cost, simultaneously outputs two results, and simplifies the training process.

Description

Front-end text analysis method based on multi-task learning

Technical Field

The invention relates to the technical field of speech synthesis, in particular to a front-end text analysis method based on multi-task learning.

Background

Text-to-Speech (TTS), also known as Speech synthesis. The method aims to synthesize understandable natural voice from text, has wide application in human communication, and has been a research subject in the fields of artificial intelligence, natural language processing and voice processing for a long time. Developing TTS systems requires knowledge of the language and human speech generation, involving multiple disciplines including linguistics, acoustics, digital signal processing, and machine learning. With the development of deep learning, TTS based on neural networks is developed vigorously, and a great deal of research work is focused on different aspects of neural TTS. Therefore, the quality of synthesized speech has been greatly improved in recent years.

In mandarin chinese text-to-speech synthesis, the text processing module at the front end has a large impact on the intelligibility and naturalness of the synthesized speech. The classic Mandarin TTS front-end is a pipeline-based system consisting of a series of text processing components, such as Text Normalization (TN), Chinese Word Segmentation (CWS), polyphonic word disambiguation, prosodic prediction, and ZhuYin (G2P). This structure enables us to divide and conquer complex front-end tasks. However, this serial architecture also presents several problems. One is a complex feature engineering and data tagging effort, as each component requires different input and output tags. Another is that the front-end components require separate training and optimization, resulting in a very complex training process.

Disclosure of Invention

In view of the problems identified by the background art, the present invention provides a front-end text analysis method based on multi-task learning.

In order to solve the technical problems, the invention provides a front-end text analysis method based on multi-task learning, which comprises the following steps of carrying out feature and result labeling by using the same linguistic data, carrying out feature extraction on the linguistic data by using a CNN (CNN) network as a sharing layer, respectively putting the linguistic data into two Bi-LSTMs for parallel training, and outputting results aiming at the two tasks, wherein the specific technical scheme is as follows:

s1, data annotation:

manually marking data of the same source material, namely marking different labels for different tasks; the corpora of the same source are corpora with the same text but different labels;

s2, preparing characteristics:

s2-1, extracting word segmentation characteristics:

tokenizing speech using a tokenization model and using [ BMES]Labeling the label as a one-dimensional feature fea₁(ii) a The word segmentation model refers to a model for segmenting words of a text;

s2-2, extracting part-of-speech characteristics:

performing part-of-speech analysis on the corpus by using a part-of-speech tagging model, and analyzing the result according to [ POS ]]The tag is marked as a part-of-speech feature as a one-dimensional feature fea₂；

The [ POS ] label is a part-of-speech label, if the [ POS ] label is a noun, the [ POS ] n label is marked as [ POS ] n, if the [ POS ] n label is a verb, the [ POS ] v label is marked as [ POS ] v, and so on;

in the text processing task, the machine first simulates understanding the language. To achieve this, it must be able to understand the rules of natural language to some extent; it is to be understood first that the words, and in particular the nature of each word; therefore, the linguistic data are labeled by using the part-of-speech labeling model and are labeled according to the corresponding part-of-speech;

s2-3, constructing polyphone characteristics:

aiming at polyphone task, polyphone feature label [ POLY ] is constructed by polyphone dictionary]And as a one-dimensional feature fea₃Judging whether the character is a polyphone character, if the character is the polyphone character, marking the character as 1, and if not, marking the character as 0; the characteristic can well guide the model to complete the task of disambiguation of polyphone;

after the polyphone features are constructed, the polyphone features are also spliced behind the corresponding text, for example: north [ POS ] ns [ BMES ] B [ POLY ] 0-Jing [ POS ] ns [ BMES ] E [ POLY ]0- $, which is the following from left to right: corpus text, part of speech characteristics [ POS ], word segmentation characteristics [ BMES ], polyphone characteristics [ POLY ], polyphone labels (the two characters are not polyphones and are 'minus'), and rhythm labels;

s3, feature fusion:

s3-1, extracting shared layer characteristics:

using CNN as a sharing layer, inputting sentences according to char level and extracting deep level features;

the input of the CNN is a word vector obtained by the word embedding layer, and the output of the CNN is a characteristic vector extracted by the word vector through the CNN;

s3-2, splicing and fusing:

constructing word characteristics and char characteristics through a word embedding layer, and then using fea₁、fea₂、fea₃The three features and words are converted into feature vectors, the sizes of which are respectively the word segmentation features: [ 4X 20 ]]Part of speech characteristics: [ 60X 20 ]]And polyphone characteristics: [ 2X 20 ]]、word：[6048×50]Then, splicing and fusing the characteristic vectors;

s4, classification:

arranging the deep-level features acquired in the step S3 into sentence-level features according to time, and respectively sending the obtained feature vectors to two Bi-LSTM networks to learn context time dependence; then completing the polyphone disambiguation task and the prosody prediction task respectively.

The Bi-LSTM is an abbreviation of Bi-directional Long Short-Term Memory and is formed by combining a forward LSTM and a backward LSTM, the LSTM is a recurrent neural network and can perform prediction output at the current moment according to history information of a text and time step information at the previous moment, and the forward LSTM and the backward LSTM can gradually learn the information of the next moment along with the time. The polyphonic labels include 312 labels and the prosodic labels include two types of #1 and # 3.

Further, in the above solution, before performing data annotation, the step S1 further includes data processing, where the data processing method includes: each sentence in the corpus is segmented by words, and the sentences with the length exceeding 250 are filtered.

Further, in the above scheme, in the step S1, labeling different labels for different tasks specifically includes: and splicing the polyphone label and the prosodic label corresponding to each text.

Further, in the above scheme, in step S2-1, the [ BMES ] label is: b: beginning, M: middle, E: and (S) ending: independent.

Further, in the above scheme, in the step S2-2, the part-of-speech feature includes: the n-shaped noun adjective a, the verb v, the conjunctive c, the auxiliary word u, the adverbial d, the conjunctive c, the word m, the punctuation mark w, the adverbial p, the analogism word o, the adverbial q, and the pronoun r. The part of speech is that there are many kinds not only including 60, and these 60 mean that the corpus used in the method only includes 60 part of speech tags, and does not represent that all the part of speech tags are 60 in total.

Further, in the above scheme, the step S3-1 specifically includes: firstly, converting char into vectors through a word embedding layer (for Chinese, both char and word take a single character as a unit, after each character is extracted, each character is converted into a vector by using the word embedding layer), wherein the size of the layer is [6048 multiplied by 30], and then putting the characters into a CNN network in batches for feature extraction; the characteristic extraction of the CNN specifically refers to that the CNN extracts a numerical vector containing text semantics from a text in a corpus. The CNN network has a convolution layer, wherein 5 convolution kernels are arranged, the size of the convolution kernels is 3 x 3, the activation function is relu, the dimension of char features is the same as that of input, the dimension of char features is 30, and extracted source sentence information is used for feature fusion of two subsequent tasks.

Because the two tasks use the same language material, the CNN network can be used as a sharing layer to extract deep information of the source sentence, and compared with a single-task model, the layer can share the information learned by the two tasks, so that the generalization effect (which means the adaptability of the model to new samples) is better.

Further, in the above scheme, in the step S3-2, in order to ensure that the two tasks can dynamically select features, two feature vectors F1 and F2 are constructed for splicing and fusing, and the dimension is 140;

wherein, the combined feature vector F1 of the ith word in the ith utterance in task 1 can be expressed as:

F1_ij＝[Word_ij，Char_ij，Fea1_ij，Fea2_ij，Fea3_ij]

the combined feature vector F2 for the ith word in the ith utterance in task 2 can be expressed as:

F2_ij＝[Word_ij，Char_ij，Fea1_ij，Fea2_ij，Fea3_ij]

the two tasks can dynamically adjust the required characteristics according to different actual conditions to perform characteristic fusion.

The number and sequence of the 5 features can be adjusted as required, and the principle is as follows: the method can screen the features with unobvious effect or without positive influence on the result according to the training result and the actual requirement.

Since the stitching vectors F1 and F2 of the two features are provided, stitching can be performed according to the number of selected features. For example, three features can be selected for F1 to be spliced, and two or more features can be selected for F2 to be spliced. Two tasks need not maintain the same number of features at the same time.

Further, in the above solution, in step S4, the polyphone disambiguation task performs polyphone judgment on each word and then performs phonetic annotation disambiguation, and the prosody prediction task performs prosody labeling on each word in the sentence.

Further, in the foregoing solution, in the step S4, the Bi-LSTM output layer is followed by a linear layer, the output dimension is the number of label labels corresponding to each task, and the activation function is softmax.

Compared with the prior art, the beneficial effects of the invention are embodied in the following points:

firstly, a multi-task learning method is used for combining polyphone prediction and prosody prediction tasks, a uniform end-to-end text processing model is achieved, namely a uniform front end structure is provided, and a high-quality Mandarin TTS system is constructed more quickly and easily.

Secondly, a shared layer is used in the model, so that the calculation resources occupied by the model are reduced, and the synthesis speed of the model is relatively improved.

Thirdly, the training of the unified model can use the same data as input, can predict polyphones and prosody from the original text directly, can train two tasks in parallel, reduces the workload of data labeling, saves the training cost, and outputs two results at the same time, thereby simplifying the training process.

Drawings

FIG. 1 is a model framework diagram of a front-end text analysis method based on multi-task learning according to the present invention.

Detailed Description

Examples

To validate the invention, validation is performed on the self-established database. The training set in the data set comprises 971500 sentences, the test set verification set comprises 3000 sentences, the polyphone dictionary comprises 312 sentences, and the prosody comprises #1 and # 3. The algorithm flow of the whole system is shown in fig. 1, and the invention is further described in detail with reference to fig. 1.

FIG. 1 is a model framework diagram of a front-end text analysis method based on multi-task learning according to the present invention. As shown in fig. 1, the method mainly comprises the following steps:

s1, data annotation:

firstly, data processing is carried out: segmenting each sentence in the corpus according to characters, and filtering out sentences with the length exceeding 250;

then, manually marking the data of the same source material, and splicing a multi-phonetic character label and a rhythm label corresponding to each text; the corpora of the same source are corpora with the same text but different labels;

s2, preparing characteristics:

s2-1, extracting word segmentation characteristics:

tokenizing speech using a tokenization model and using [ BMES]Labeling the label as a one-dimensional feature fea₁；[BMES]The label is as follows: b: beginning, M: middle, E: and (S) ending: independent. The word segmentation model refers to a model for segmenting words of a text;

s2-2, extracting part-of-speech characteristics:

performing part-of-speech analysis on the corpus using part-of-speech tagging modelsThe result is according to [ POS]The tag is marked as a part-of-speech feature as a one-dimensional feature fea₂；

The part-of-speech characteristics include: the n-shaped noun adjective a, the verb v, the conjunctive c, the auxiliary word u, the adverbial d, the conjunctive c, the word m, the punctuation mark w, the adverbial p, the analogism word o, the adverbial q, and the pronoun r. The part of speech is that there are many kinds not only including 60, and these 60 mean that the corpus used in the method only includes 60 part of speech tags, and does not represent that all the part of speech tags are 60 in total.

in the text processing task, the machine first simulates understanding the language. To achieve this, it must be able to understand the rules of natural language to some extent; it is to be understood first that the words, and in particular the nature of each word; therefore, the linguistic data are labeled by using the part-of-speech labeling model and are labeled according to the corresponding parts-of-speech, more complete part-of-speech labels are too many to be embodied one by one, and the following example of the part-of-speech labeling of a whole sentence is given: "I afraid of worried about and can not go to Yunnan": i/r (pronoun) wore/d (adverb)/v (verb) wording/vn (verb noun)/u (assistant word) going/vf (trend verb) not/d (adverb)/u (assistant word) Yunnan/ns (place name);

s2-3, constructing polyphone characteristics:

s3, feature fusion:

s3-1, extracting shared layer characteristics:

firstly, converting char into vectors through a word embedding layer (for Chinese, both char and word take a single character as a unit, after each character is extracted, each character is converted into a vector by using the word embedding layer), wherein the size of the layer is [6048 multiplied by 30], and then putting the characters into a CNN network in batches for feature extraction; the characteristic extraction of the CNN specifically refers to that the CNN extracts a numerical vector containing text semantics from a text in a corpus. The CNN network has a convolution layer, wherein 5 convolution kernels are arranged, the size of the convolution kernels is 3 x 3, the activation function is relu, the dimension of char features is the same as that of input, the dimension of char features is 30, and extracted source sentence information is used for feature fusion of two subsequent tasks.

s3-2, splicing and fusing:

in order to ensure that the two tasks can dynamically select features, two feature vectors F1 and F2 are constructed for splicing and fusion, and the dimensionality is 140;

wherein, the combined feature vector F1 of the jth word in the ith utterance in task 1 can be expressed as:

F1_ij＝[Word_ij，Char_ij，Fea1_ij，Fea2_ij，Fea3_ij]

the combined feature vector F2 for the jth word in the ith utterance in task 2 can be expressed as:

F2_ij＝[Word_ij，Char_ij，Fea1_ij，Fea2_ij，Fea3_ij]

The number and sequence of the 5 features can be adjusted as required, and the principle is as follows: the method can screen the characteristics with unobvious effect or without positive influence on the result according to the training result and the actual requirement.

S4, classification:

arranging the deep-level features acquired in the step S3 into sentence-level features according to time, and respectively sending the obtained feature vectors to two Bi-LSTM networks to learn context time dependence; then, the polyphone disambiguation task and the prosody prediction task are respectively completed.

The polyphone disambiguation task performs polyphone judgment on each character and then performs phonetic annotation disambiguation, and the rhythm prediction task performs rhythm annotation on each character in the sentence.

And a linear layer is connected behind the Bi-LSTM output layer, the output dimension is the number of label labels corresponding to each task, and the activation function is softmax.

As shown in fig. 1, to ensure the size of the model and the decoding speed, only one hidden layer in the network has 200 hidden units, the two LSTM networks in fig. 1 have the same structure, the hidden layers are both 200, the input layer is 140 (the input layer unit varies with the dimension of the concatenation, for example, 120 is the concatenation of two features), and the output layer is 200.

The invention is based on a unified model of CNN-BLSTM. The configuration is as follows:

CNN has a convolutional layer. The convolution layer has n1 convolution kernels with a convolution size of k₁×k₁. Then, feature fusion is carried out and two BLSTMs are respectively put in, 1 hidden layer exists in each BLSTM, and each layer has u hidden units. Then respectively using a compound having s₁，s₂Fully connected layers of hidden units map features to labels₁，label₂And (5) maintaining.

Table 1 shows the effect obtained by the front-end text analysis method based on multitask learning.

TABLE 1 Effect of the method of the invention on polyphonic disambiguation and prosody prediction

As can be seen from table 1: 1) compared with a single model, the unified model has the advantage that the test accuracy effect is equivalent when the unified model is tested under the same test set. 2) It is superior to the serial single model in terms of model size and decoding speed. 3) The model training time and complexity are much simplified compared to a single model. The method can construct a high-quality TTS system more quickly and efficiently, and proves that the method is effective.

Claims

1. The front-end text analysis method based on multi-task learning is characterized by comprising the following steps of:

s1, data annotation:

s2, preparing characteristics:

s2-1, extracting word segmentation characteristics:

tokenizing speech using a tokenization model and using [ BMES]Labeling the label as a one-dimensional feature fea₁；

S2-2, extracting part-of-speech characteristics:

S2-3, constructing polyphone characteristics:

aiming at polyphone task, polyphone feature label [ POLY ] is constructed by polyphone dictionary]And as a one-dimensional feature fea₃Judging whether the character is a polyphone character, if the character is the polyphone character, marking the character as 1, and if not, marking the character as 0;

s3, feature fusion:

s3-1, extracting shared layer characteristics:

s3-2, splicing and fusing:

constructing word features and cha features through a word embedding layer, and then using fea₁、fea₂、fea₃Converting the three characteristics and word into a characteristic vector, and then splicing and fusing the characteristic vector;

s4, classification:

2. The method for analyzing front-end text based on multitask learning according to claim 1, wherein said step S1 further includes data processing before data labeling, and said data processing method is: each sentence in the corpus is segmented by words, and the sentences with the length exceeding 250 are filtered.

3. The method for analyzing front-end text based on multitask learning according to claim 1, wherein in step S1, labeling different labels for different tasks specifically includes: and splicing the polyphone label and the prosodic label corresponding to each text.

4. The method for front-end text analysis based on multitask learning according to claim 1, wherein in said step S2-1, [ BMES ] labels are: b: beginning, M: middle, E: and (S) ending: independent.

5. The method for analyzing front-end text based on multi-task learning of claim 4, wherein in the step S2-2, the part-of-speech features comprise: the n-shaped noun adjective a, the verb v, the conjunctive c, the auxiliary word u, the adverbial d, the conjunctive c, the word m, the punctuation mark w, the adverbial p, the analogism word o, the adverbial q, and the pronoun r.

6. The method for analyzing the front-end text based on the multitask learning according to claim 1, wherein the step S3-1 is specifically as follows: firstly, converting char into vectors through a word embedding layer, wherein the size of the layer is [6048 multiplied by 30], and then putting the vectors into a CNN network in batches for feature extraction; and the extracted source sentence information is used for feature fusion of the subsequent two tasks.

7. The method for analyzing the front-end text based on the multitask learning as claimed in claim 1, wherein in the step S3-2, two eigenvectors F1 and F2 are constructed for splicing and fusion, and the dimension is 140;

F1_ij＝[Word_ij,Char_ij,Fea1_ij,Fea2_ij,Fea3_ij]

F2_ij＝[Word_ij,Char_ij,Fea1_ij,Fea2_ij,Fea3_ij]

8. The front-end text analysis method based on multi-task learning of claim 1, wherein in step S4, the polyphonic disambiguation task performs polyphonic disambiguation on each word, and then performs phonetic annotation disambiguation, and the prosody prediction task performs prosody labeling on each word in the sentence.

9. The method for analyzing front-end text based on multitask learning according to claim 1, characterized in that in said step S4, Bi-LSTM output layer is followed by a linear layer, the output dimension is the number of label labels corresponding to each task, and the activation function is softmax.