CN111354333A

CN111354333A - Chinese prosody hierarchy prediction method and system based on self-attention

Info

Publication number: CN111354333A
Application number: CN201811571546.7A
Authority: CN
Inventors: 张鹏远; 卢春晖; 颜永红
Original assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Priority date: 2018-12-21
Filing date: 2018-12-21
Publication date: 2020-06-30
Anticipated expiration: 2038-12-21
Also published as: CN111354333B

Abstract

The invention discloses a Chinese prosody hierarchy prediction method based on self-attention, which comprises the following steps: learning a large amount of unlabelled texts to obtain word vectors of single characters, converting the texts to be predicted into word vector sequences by using the word vectors, inputting the word vector sequences into a trained prosody level prediction model, and outputting the word positions and prosody levels of the texts. The method of the invention utilizes a rhythm level prediction model to carry out Chinese rhythm level prediction, ensures the prediction performance, and simultaneously takes character granularity characteristics as input, avoids dependence on a word segmentation system and negative effects possibly caused by the dependence, and the model directly models the relationship between any two characters in a text by utilizing a self-attention mechanism, thereby realizing parallelization calculation; and the performance of the model is improved by using extra data for pre-training, so that the prosodic levels of the text to be processed can be accurately predicted at the same time, and the error transmission is avoided.

Description

Chinese prosody hierarchy prediction method and system based on self-attention

Technical Field

The invention relates to the technical field of speech synthesis, in particular to a Chinese prosody hierarchy prediction method and system based on self-attention.

Background

In a speech synthesis system, predicting a prosodic hierarchy from an input text to be synthesized is always a crucial step, and the prediction result is used as a part of linguistic features for modeling acoustic features and duration. Therefore, the accuracy of prosody hierarchy prediction determines the naturalness of the synthesized speech to a great extent, and the realization of accurate prosody hierarchy prediction has great significance.

The mainstream method at present is to use a bidirectional long-and-short-term memory network BLSTM, and use word vectors as input to respectively model different prosody levels, i.e., train a model for prosodic words, prosodic phrases, and intonation phrases, respectively, and use the prediction result of the low level as the input of the high level to realize the step-by-step prosody prediction.

However, the above method has the following problems: 1) LSTM, as an RNN structure, requires the use of the output value at the previous time each time it is predicted, this sequential calculation hinders its parallelization and makes the distance between any two words o (n); 2) training and predicting a prosody prediction model on word granularity means that word segmentation processing must be performed on input text, and the result of word segmentation directly influences the performance of prosody level prediction. In addition, the number of Chinese entries is huge, and the storage of the word vectors occupies a large storage space and calculation resources, which is obviously not practical for offline speech synthesis; 3) progressive prosody prediction can cause erroneous results to be continuously transmitted, and subsequent prediction errors are caused.

The realization of prosody hierarchy prediction of a text is an essential step in a speech synthesis system, but the current mainstream method utilizes the characteristics of word levels to depend on the performance of a word segmentation system, and the progressive prosody prediction can cause the continuous transmission of error results.

Disclosure of Invention

The invention aims at solving the problems in the prior related art at least to a certain extent, and provides a prosody hierarchy prediction method which takes characters as basic units of a model, avoids dependence on a word segmentation system and reduces the requirement on a storage space; and the simultaneous prediction of the multi-level prosody is realized by utilizing one model, and the problem of error transmission is solved.

In order to achieve the above object, the present invention provides a chinese prosody hierarchy prediction method based on self-attention, including:

learning a large amount of unlabelled texts to obtain word vectors of single characters, converting the texts to be predicted into word vector sequences by using the word vectors, inputting the word vector sequences into a trained prosody level prediction model, and outputting the word positions and prosody levels of the texts.

As an improvement of the above method, the training step of the prosody-level prediction model includes:

step 1) learning a large amount of unlabeled texts to obtain word vectors of single words;

step 2) converting the text corresponding to the word segmentation data into a word vector sequence by using the word vectors obtained in the step 1), and obtaining a word position marking sequence according to the word segmentation result;

step 3) constructing a prosodic hierarchy prediction model based on a self-attention mechanism, and pre-training the prediction model by respectively taking the word vector sequence and the word position mark sequence of the participle data obtained in the step 2) as input and output;

step 4) converting the text corresponding to the prosody labeling data into a character vector sequence by using the character vector obtained in the step 1), obtaining a word position marking sequence according to a corresponding word segmentation result, and obtaining a labeling sequence corresponding to each prosody level according to prosody labeling;

and 5) on the basis of the model obtained by pre-training in the step 3), training the prosody hierarchy prediction model again according to the word vector sequence, the word position marking sequence and the prosody marking sequence of the prosody data obtained in the step 4), so as to obtain the trained prosody hierarchy prediction model.

As an improvement of the above method, the step 1) is specifically: based on the continuous bag-of-words model CBOW, setting the dimension of a word vector as d, training by using a large amount of non-labeled texts to obtain the initial value of the word vector of all the single words in the texts, and constructing a word table by using the initial value of the word-word vector.

As a modification of the above method, the step 2) further comprises:

step 2-1) searching a word vector of a corresponding word in a word table searching manner according to the text information of the word segmentation data, so as to determine a word vector characteristic sequence of the corresponding text;

step 2-2) determining a word position mark sequence corresponding to the word segmentation data text according to the position of the word in the word, wherein B, M, E, S represents that the word is at the beginning of the word, the word is in the middle of the word, the word is at the end of the word and a single word respectively.

As a modification of the above method, the step 3) further comprises:

step 3-1) constructing a prosody hierarchy prediction model with N layers, wherein each layer comprises a feedforward neural network sublayer and a self-attention sublayer, and the two sublayers are connected by adopting residual errors, and the prosody hierarchy prediction model is as follows:

Y＝X+SubLayer(X)

wherein X, Y denotes the input and output of the sub-layer, respectively; the prediction model has four output layers, wherein the three output layers respectively predict prosodic word boundaries, prosodic phrase boundaries and intonation phrase boundaries; an output layer predicts the word position to realize word segmentation of the text;

the feedforward neural network sublayer consists of two linear projections, the middle of the two linear projections is connected by taking a modified linear unit as an activation function, and the formula is as follows:

FFN(X)＝max(XW₁+b₁,0)W₂+b₂

wherein W₁、W₂A weight matrix of two linear projections with dimensions d × d_fAnd d_f×d；b₁、b₂Is a bias vector;

the self-attention sublayer adopts multi-head self-attention, for each head, firstly, an input matrix is subjected to linear projection to obtain three matrixes Q, K, V, then, the three matrixes are subjected to zooming dot product attention operation to obtain a vector M, and the M of all the heads are spliced and subjected to linear projection to obtain the output of the sublayer; m is calculated from the formula:

wherein Softmax () is a normalized exponential function;

step 3-2) using sine and cosine functions with different frequencies to code different positions of the input sequence, wherein the coding functions are as follows:

PE(t,2i)＝sin(t/10000^2i/d)

PE(t,2i+1)＝sin(t/10000^2i/d)

wherein t is the position and i is the dimension; the position coding and the input word vector dimension are d, and the position coding and the input word vector dimension are added together to be used as the input of a prosody level prediction model;

step 3-3) pre-training a prosodic hierarchy prediction model;

iteration is carried out by taking the cross entropy between the actual output and the expected output of the minimum word segmentation task as a criterion, and the cost function is as follows:

wherein y is the expected output, y is {0,1}, a is the actual output value, a ∈ [0,1], x corresponds to each node of the output layer, n is the number of nodes of the output layer, and the parameters of the model are updated through a back propagation algorithm with descending random gradient.

As a modification of the above method, the step 4) further comprises:

step 4-1) searching a word vector of a corresponding word in a word table searching manner according to the text information of the prosody labeling data, so as to determine a word vector characteristic sequence of the corresponding text;

step 4-2) determining a word position marking sequence corresponding to the prosodic data text according to the corresponding word segmentation result of the prosodic data; b, M, E, S respectively indicates that the character is at the beginning of the word, the character is in the middle of the word, the character is at the end of the word and the single word;

step 4-3) determining the labeling sequence of each prosodic level of prosodic words, prosodic phrases and intonation phrases according to the labeling of the prosodic data; the word is denoted by B as a prosodic boundary and NB as a non-prosodic boundary, respectively.

As an improvement of the above method, the step 5) is specifically: on the basis of the model obtained by pre-training in the step 3), taking a word vector sequence of prosody data as the input of the model, and taking a lexeme marking sequence and a prosody labeling sequence of each level as the output of the model; and (3) taking the sum of the cross entropies between the actual output and the expected output of each minimized output layer as a model training criterion, and updating model parameters by adopting a back propagation algorithm with descending random gradient to obtain a trained prosody level prediction model.

The invention also provides a system for Chinese prosody hierarchy prediction based on self-attention, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of the method as claimed above.

The invention has the advantages that:

1. the prosodic hierarchy prediction model of the invention takes the character granularity characteristics as input while ensuring the prediction performance, thereby avoiding the dependence on a word segmentation system and the negative influence possibly caused by the dependence on the word segmentation system, and simultaneously reducing the size of the model;

2. the prosody hierarchy prediction model directly models the relation between any two characters in the text by using a self-attention mechanism, and can realize parallelization calculation; the performance of the model is improved by using extra data for pre-training, and the accurate prediction of the prosody hierarchy of the text to be processed is realized;

3. the method of the invention adopts one model to predict a plurality of prosody hierarchies simultaneously, thereby avoiding wrong transmission.

Drawings

Fig. 1 is a flow chart of a chinese prosody hierarchy prediction method based on self-attention according to the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The invention provides a Chinese prosody prediction method based on self-attention. The method takes a word vector as an input characteristic, models the dependency relationship among words in a text through a self-attention mechanism, and sets an independent output layer for each level of prosody, thereby realizing the simultaneous prediction of each level of the prosody. The method avoids the dependency on a segmentation system and simultaneously realizes the accurate prediction of the text prosody hierarchy.

The invention provides a Chinese prosody hierarchy prediction model construction method based on self-attention, which comprises the following steps: learning a large amount of unlabeled texts to obtain word vectors of single words; acquiring a word vector sequence and a word position mark sequence of a text corresponding to the data according to the word vector and the word segmentation data; constructing a rhythm prediction model based on a self-attention mechanism, and pre-training the model according to a word vector sequence and a word position marking sequence of word segmentation data; obtaining a word vector sequence, a word position marking sequence and each rhythm level marking sequence of a corresponding text according to the word vector and rhythm marking data with word segmentation information; and continuing training on the basis of the pre-training prosody level prediction model according to the word vector sequence, the lexeme marking sequence and each prosody level marking sequence of the prosody data. The method is based on character level characteristics, direct modeling of the relation between any two characters in the text is achieved through a self-attention mechanism, and pre-training is carried out by utilizing extra data to improve the performance of the model, so that accurate prediction of the prosody hierarchy of the text to be processed is achieved.

The method of the invention comprises the following steps:

step 1) constructing and training a prosodic hierarchy prediction model, as shown in fig. 1, the steps specifically include:

step 101), learning a large amount of unlabeled texts to obtain word vectors of single words.

The method comprises the steps that unlabeled texts are collected from expected texts in various fields, characters in the texts are used as basic training units, the dimension of each character vector is set to be d on the basis of a continuous bag-of-words model CBOW, and an initial character vector of each character is obtained through training. And constructing a word table by using the initial value of the word-word vector.

Step 102), obtaining a word vector sequence and a word position marking sequence of the corresponding text according to the word vector and the word segmentation data.

The character vector feature sequence is obtained by searching each character in the word text into the character vector of the corresponding character through the operation of searching the character table.

And determining the word position marking sequence according to the position of the word corresponding to the word segmentation data text in the word. B, M, E, S indicates the word at the beginning of the word, the word in the middle of the word, the word at the end of the word, and the single word.

Specifically, for the participle text "aribaba is completely different from walmart", the lexeme tag sequence is as follows: [ B, M, M, E, S, B, M, E, B, E, S, S ].

Step 103), constructing a prosody hierarchy prediction model based on a self-attention mechanism, and pre-training the model by utilizing the word vector characteristic sequence and the lexeme mark sequence of the participle data obtained in the step 102).

The constructed prosody hierarchy prediction model is composed of N layers, each layer comprises a feedforward neural network sub-layer and a self-attention sub-layer, and residual errors are connected between every two sub-layers, and the following formula is adopted:

Y＝X+SubLayer(X)

where X, Y represent the input and output, respectively, of the sub-layer. The model has four output layers, three of the four output layers are used for carrying out prosody level prediction, namely, prosodic word boundaries, prosodic phrase boundaries and intonation phrase boundaries are predicted respectively, and the simultaneous prediction of multi-level prosody in one model is realized; and the other output layer carries out a word segmentation task, because the prosodic hierarchy boundary is established on the basis of grammatical words, the word segmentation task is introduced to obtain word level information so as to improve the accuracy of prosodic hierarchy prediction.

Specifically, the feedforward neural network sublayer consists of two linear projections, the middle of which is connected by a modified linear unit as an activation function, and the formula is as follows:

FFN(X)＝max(XW₁+b₁,0)W₂+b₂

wherein W₁、W₂Weights for two linear projectionsMatrix with dimensions d × d_fAnd d_f×d；b₁、b₂Is a bias vector.

The self-attention sublayer adopts multi-head self-attention, for each head, firstly, an input matrix is subjected to linear projection to obtain three matrixes Q, K, V, then, scaling dot-product attention (scaled dot-product attention) operation is carried out on the three matrixes to obtain a vector M, and the M of all the heads are spliced and subjected to linear projection to obtain the output of the sublayer. M is calculated from the formula:

where Softmax () is a normalized exponential function.

The model does not use sequence models such as RNN and the like, and can not consider time sequence information, so that sine and cosine functions with different frequencies are used for coding different positions of an input sequence so as to introduce the sequence relation among words to a certain extent, and the coding function is as follows:

PE(t,2i)＝sin(t/10000^2i/d)

PE(t,2i+1)＝sin(t/10000^2i/d)

where t is the position and i is the dimension. The position code and the input word vector have the same dimension d, and the position code and the input word vector are added together to be used as model input.

When the model is pre-trained, iteration is carried out by taking the cross entropy between the actual output and the expected output of the minimum word segmentation task as the criterion, and the cost function is as follows:

where y is the desired output, y is {0,1}, a is the actual output value of the network, and satisfies a ∈ [0,1], x corresponds to each node of the output layer, and n is the number of nodes of the output layer.

And step 104), obtaining a word vector sequence, a word position marking sequence and each level prosody marking sequence of the corresponding text according to the word vector and the prosody marking data with the word segmentation information.

Wherein the method for obtaining the word vector sequence and the lexeme mark sequence is the same as the step 102). The prosodic words, prosodic phrases and the prosodic phrase labeling sequence of each prosodic level is determined by prosodic labels. The word is denoted by B as a prosodic boundary and NB as a non-prosodic boundary, respectively.

Specifically, for the prosodic annotation text "a rribaba #1 is completely different from #1 in # 3" of #1 walmart #2, the prosodic word annotation sequence is [ NB, B, NB, B, NB.

Step 105), continuing training on the basis of the pre-training prosody level prediction model in the step 103) by utilizing the word vector sequence, the lexeme mark sequence and each level prosody mark sequence obtained in the step 104).

The word vector sequence is input as a model, the lexeme marking sequence and each level of prosody marking sequence are output as a model, and the sum of cross entropies between actual output and expected output of each output layer is minimized as a criterion during training.

And 2) converting the text to be predicted into a word vector sequence by using the word vector in the step 101), inputting the word vector sequence into the trained prosody level prediction model, and outputting the lexeme and prosody level of the text.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method for Chinese prosody hierarchy prediction based on self-attention, the method comprising:

2. The method of claim 1, wherein the training of the prosodic hierarchy prediction model comprises:

3. The method for predicting Chinese prosody hierarchy based on self-attention as claimed in claim 2, wherein the step 1) is specifically as follows: based on the continuous bag-of-words model CBOW, setting the dimension of a word vector as d, training by using a large amount of non-labeled texts to obtain the initial value of the word vector of all the single words in the texts, and constructing a word table by using the initial value of the word-word vector.

4. The method for Chinese prosodic hierarchy prediction based on self-attention as claimed in claim 3, wherein said step 2) further comprises:

5. The method for Chinese prosodic hierarchy prediction based on self-attention as claimed in claim 4, wherein said step 3) further comprises:

Y＝X+SubLayer(X)

FFN(X)＝max(XW₁+b₁,0)W₂+b₂

wherein Softmax () is a normalized exponential function;

PE(t,2i)＝sin(t/10000^2i/d)

PE(t,2i+1)＝sin(t/10000^2i/d)

step 3-3) pre-training a prosodic hierarchy prediction model;

6. The method for Chinese prosodic hierarchy prediction based on self-attention as claimed in claim 5, wherein said step 4) further comprises:

7. The method for Chinese prosody hierarchy prediction based on self-attention as claimed in claim 6, wherein the step 5) is specifically as follows: on the basis of the model obtained by pre-training in the step 3), taking a word vector sequence of prosody data as the input of the model, and taking a lexeme marking sequence and a prosody labeling sequence of each level as the output of the model; and (3) taking the sum of the cross entropies between the actual output and the expected output of each minimized output layer as a model training criterion, and updating model parameters by adopting a back propagation algorithm with descending random gradient to obtain a trained prosody level prediction model.

8. A system for chinese prosodic hierarchy prediction based on self-attention, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the method according to any one of claims 1 to 7.