CN111354333B

CN111354333B - Self-attention-based Chinese prosody level prediction method and system

Info

Publication number: CN111354333B
Application number: CN201811571546.7A
Authority: CN
Inventors: 张鹏远; 卢春晖; 颜永红
Original assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Priority date: 2018-12-21
Filing date: 2018-12-21
Publication date: 2023-11-10
Anticipated expiration: 2038-12-21
Also published as: CN111354333A

Abstract

The invention discloses a Chinese prosody level prediction method based on self-attention, which comprises the following steps: and learning a large number of unlabeled texts to obtain word vectors of single words, converting the text to be predicted into a word vector sequence by using the word vectors, inputting the word vector sequence into a trained prosody level prediction model, and outputting the word positions and prosody levels of the text. According to the method, the prosody level prediction model is used for carrying out Chinese prosody level prediction, the characteristic of word granularity is used as input while the prediction performance is ensured, dependence on a word segmentation system and possible negative influence caused by the dependence are avoided, and the model directly models the relation between any two words in a text by using a self-attention mechanism, so that parallelization calculation can be realized; and the model performance is improved by pre-training with the additional data, so that the simultaneous and accurate prediction of each prosody level of the text to be processed is realized, and the error transmission is avoided.

Description

Self-attention-based Chinese prosody level prediction method and system

Technical Field

The invention relates to the technical field of speech synthesis, in particular to a self-attention-based Chinese prosody level prediction method and a self-attention-based Chinese prosody level prediction system.

Background

In speech synthesis systems, predicting prosody hierarchy from input text to be synthesized has been a crucial step, the prediction of which will be used as part of linguistic features for modeling acoustic features and duration. Therefore, the accuracy of prosody level prediction determines the naturalness of the synthesized voice to a great extent, and the realization of accurate prosody level prediction is significant.

The current mainstream method is to use a bidirectional long and short time memory network BLSTM, respectively model different prosody levels by taking word vectors as input, namely respectively train a model for prosody words, prosody phrases and intonation phrases, and take a low-level prediction result as a high-level input to realize the step-by-step prediction of prosody.

However, the above method has the following problems: 1) LSTM as an RNN structure, which requires the use of the output value of the previous moment each time the output value of the current moment is predicted, this sequential calculation prevents its parallelization and makes the distance between any two words O (n); 2) Training and predicting prosody prediction models at word granularity means that the input text must be first word-segmented, and the result of the segmentation will directly affect the performance of prosody level prediction. In addition, the number of Chinese vocabulary entries is huge, and the storage of these vocabulary vectors occupies a large memory space and computational resources, which is obviously impractical for offline speech synthesis; 3) Progressive prosody prediction may cause erroneous results to be continuously passed on, resulting in subsequent prediction errors.

The realization of prosody level prediction of a text is an indispensable step in a speech synthesis system, but the current mainstream method utilizes word-level characteristics to depend on the performance of a word segmentation system, and progressive prosody prediction can cause continuous transmission of error results.

Disclosure of Invention

The present invention aims to solve the above-mentioned problems occurring in the related art at least to a certain extent, and proposes a prosody level prediction method, which uses words as basic units of a model, and reduces the need for storage space while avoiding reliance on a word segmentation system; and a model is utilized to realize simultaneous prediction of multi-stage prosody, so that the problem of error transmission is solved.

In order to achieve the above object, the present invention proposes a method for predicting prosody level of chinese based on self-attention, the method comprising:

and learning a large number of unlabeled texts to obtain word vectors of single words, converting the text to be predicted into a word vector sequence by using the word vectors, inputting the word vector sequence into a trained prosody level prediction model, and outputting the word positions and prosody levels of the text.

As an improvement of the above method, the training step of the prosody level prediction model includes:

step 1) learning a large number of unlabeled texts to obtain word vectors of single words;

step 2) converting the text corresponding to the word segmentation data into a word vector sequence by using the word vector obtained in the step 1), and obtaining a word position mark sequence according to the word segmentation result;

step 3) constructing a prosody level prediction model based on a self-attention mechanism, and respectively taking the word vector sequence and the word position mark sequence of the word segmentation data obtained in the step 2) as input and output to pretrain the prediction model;

step 4) converting the text corresponding to the prosody marking data into a character vector sequence by utilizing the character vector obtained in the step 1), obtaining a word position marking sequence according to the corresponding word segmentation result, and obtaining marking sequences corresponding to each prosody level according to the prosody marking;

step 5) training the prosody level prediction model again according to the character vector sequence, the word position mark sequence and the prosody mark sequence of the prosody data obtained in the step 4) on the basis of the model obtained in the pre-training of the step 3) to obtain a trained prosody level prediction model.

As an improvement of the above method, the step 1) specifically includes: based on a continuous word bag model CBOW, setting the dimension of a word vector as d, training by using a large number of unlabeled texts to obtain the initial values of the word vectors of all single words in the texts, and constructing a word table by using the initial values of the word-word vectors.

As an improvement of the above method, the step 2) further includes:

step 2-1), according to the text information of the word segmentation data, looking up the word vector of the corresponding word in a word table searching way, so as to determine the word vector feature sequence of the corresponding text;

step 2-2) determining a word position mark sequence corresponding to the word segmentation data text according to the position of the word in the word, wherein B, M, E, S is used for respectively representing the word at the beginning of the word, the word in the middle of the word, the word at the end of the word and the single word.

As an improvement of the above method, the step 3) further includes:

step 3-1) constructing a prosody level prediction model of N layers, wherein each layer comprises a feedforward neural network sublayer and a self-attention sublayer, and residual connection is adopted between every two sublayers, and the following formula is adopted:

Y＝X+SubLayer(X)

wherein X, Y represents the input and output of the sub-layers, respectively; the prediction model has four output layers, wherein three output layers respectively predict prosodic word boundaries, prosodic phrase boundaries and intonation phrase boundaries; the output layer predicts word positions and realizes word segmentation of texts;

the feedforward neural network sublayer consists of two linear projections, the middle of which is connected by a correction linear unit as an activation function, and the formula is as follows:

FFN(X)＝max(XW ₁ +b ₁ ,0)W ₂ +b ₂

wherein W is ₁ 、W ₂ The dimension of the weight matrix for two linear projections is d×d _f And d _f ×d；b ₁ 、b ₂ Is a bias vector;

the self-attention sub-layer adopts multi-head self-attention, for each head, firstly, linear projection is carried out on an input matrix to obtain three matrixes Q, K, V, then dot product attention scaling operation is carried out on the three matrixes to obtain a vector M, and M of all the heads are spliced and linear projection is carried out to obtain the output of the sub-layer; m is calculated by the formula:

wherein Softmax () is a normalized exponential function;

step 3-2) encoding different positions of the input sequence using sine and cosine functions of different frequencies, the encoding functions being as follows:

PE(t,2i)＝sin(t/10000 ^2i/d )

PE(t,2i+1)＝sin(t/10000 ^2i/d )

wherein t is the position and i is the dimension; the position coding and the vector dimension of the input word are d, and the two are added together to be used as the input of the prosody level prediction model;

step 3-3) pre-training a prosody level prediction model;

iteration is carried out on the basis of cross entropy between the actual output and the expected output of the minimum word segmentation task, and the cost function is as follows:

wherein y is the expected output, y= {0,1}, a is the actual output value, satisfying a e [0,1], x corresponds to each node of the output layer, and n is the number of nodes of the output layer; the parameters of the model are updated by a back propagation algorithm with random gradient descent.

As an improvement of the above method, the step 4) further includes:

step 4-1), according to the text information of the prosody annotation data, looking up the word vector of the corresponding word in a word table searching manner, so as to determine the word vector feature sequence of the corresponding text;

step 4-2), determining a word position mark sequence corresponding to the prosodic data text according to the word segmentation result corresponding to the prosodic data; b, M, E, S is used for respectively representing a character at the beginning of a word, a character in the middle of the word, a character at the end of the word and a single character word;

step 4-3) determining labeling sequences of each prosody level of prosodic words, prosody phrases and intonation phrases according to the labeling of the prosody data; the word is represented as prosodic boundaries by B and not prosodic boundaries by NB, respectively.

As an improvement of the above method, the step 5) specifically includes: based on the model obtained by the pre-training in the step 3), taking a character vector sequence of prosody data as the input of the model, and taking a word position mark sequence and prosody mark sequences of all levels as the output of the model; and updating model parameters by adopting a back propagation algorithm of random gradient descent by taking the sum of the cross entropy between the actual output and the expected output of each output layer as a model training criterion to obtain a trained prosody level prediction model.

In addition, the invention also provides a Chinese prosody hierarchy prediction system based on self-attention, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method of the claims when executing the program.

The invention has the advantages that:

1. the prosody level prediction model takes the characteristics of word granularity as input while ensuring the prediction performance, avoids the dependence on a word segmentation system and the possible negative influence thereof, and reduces the size of the model;

2. the prosody level prediction model of the invention directly models the relationship between any two words in the text by using a self-attention mechanism, and can realize parallelization calculation; the extra data is used for pre-training to improve the model performance, so that accurate prediction of the text prosody level to be processed is realized;

3. the method adopts one model to predict a plurality of prosody levels at the same time, thereby avoiding the transmission of errors.

Drawings

Fig. 1 is a flowchart of the inventive self-attention based chinese prosody level prediction method.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The invention provides a Chinese prosody prediction method based on self-attention. According to the method, word vectors are used as input features, dependency relations among words in a text are modeled through a self-attention mechanism, independent output layers are arranged for each stage of rhythm, and simultaneous prediction of each stage of rhythm is achieved. The method can realize accurate prediction of the text prosody level while avoiding dependence on a word segmentation system.

The invention provides a method for constructing a Chinese prosody level prediction model based on self-attention, which comprises the following steps: learning a large number of unlabeled texts to obtain word vectors of the single words; acquiring a character vector sequence and a word position mark sequence of a text corresponding to the data according to the character vector and the word segmentation data; constructing a prosody prediction model based on a self-attention mechanism, and pre-training the model according to a word vector sequence and a word position mark sequence of word segmentation data; obtaining a character vector sequence, a word position mark sequence and each prosody level mark sequence of a corresponding text according to the character vector and prosody mark data with word segmentation information; and continuing training on the basis of the pre-training prosody level prediction model according to the character vector sequence, the word position marking sequence and the prosody level marking sequences of the prosody data. Based on character level characteristics, the method models the relationship between any two characters in the text directly through a self-attention mechanism, and pre-trains by utilizing additional data to improve model performance, so that accurate prediction of the text prosody level to be processed is realized.

The method of the invention comprises the following steps:

step 1) constructing and training a prosody level prediction model, as shown in fig. 1, the steps specifically include:

step 101), learning a large number of unlabeled texts to obtain word vectors of the single words.

The unlabeled text is collected from the expected text in each field, characters in the text are used as basic training units, the dimension of each character vector is set to be d based on a continuous word bag model CBOW, and the initial character vector of each character is obtained through training. A word table is constructed with word-to-word vector initial values.

Step 102), acquiring a character vector sequence and a word position mark sequence of the corresponding text according to the character vector and the word segmentation data.

The character vector feature sequence is obtained by looking up each character in the word segmentation text by looking up the character vector of the corresponding character.

The word position mark sequence is determined according to the position of the word corresponding to the word segmentation data text in the word. The words are shown at the beginning of the word, the middle of the word, the end of the word, and the single word at B, M, E, S, respectively.

Specifically, for the word segmentation text "the aleba is completely different from the walmar" the lexeme tag sequence is: [ B, M, M, E, S, B, M, E, B, E, S, S ].

Step 103), constructing a prosody level prediction model based on a self-attention mechanism, and pre-training the model by utilizing the character vector feature sequence and the word position mark sequence of the word segmentation data obtained in the step 102).

The constructed prosody level prediction model consists of N layers, wherein each layer comprises a feedforward neural network sublayer and a self-attention sublayer, and residual connection is adopted between every two sublayers, and the following formula is adopted:

Y＝X+SubLayer(X)

where X, Y represents the input and output of the sub-layers, respectively. The model has four output layers, three of which are used for predicting prosody levels, namely, predicting prosody word boundaries, prosody phrase boundaries and intonation phrase boundaries respectively, so that simultaneous prediction of multi-level prosodies in one model is realized; the other output layer carries out word segmentation task, because prosody level boundaries are established on the basis of grammar words, word level information can be obtained by introducing the word segmentation task so as to improve accuracy of prosody level prediction.

Specifically, the feed-forward neural network sublayer consists of two linear projections, the middle of which is connected by a modified linear unit as an activation function, with the following formula:

FFN(X)＝max(XW ₁ +b ₁ ,0)W ₂ +b ₂

wherein W is ₁ 、W ₂ The dimension of the weight matrix for two linear projections is d×d _f And d _f ×d；b ₁ 、b ₂ Is a bias vector.

The self-attention sub-layer adopts multi-head self-attention, for each head, firstly, linear projection is carried out on an input matrix to obtain three matrices Q, K, V, then dot product attention scaling (scaled dot-product attention) operation is carried out on the three matrices to obtain a vector M, and M of all heads are spliced and linear projection is carried out to obtain the output of the sub-layer. M is calculated by the formula:

where Softmax () is a normalized exponential function.

The model does not use sequence models such as RNN and the like, and time sequence information cannot be considered, so that different positions of an input sequence are encoded by using sine and cosine functions with different frequencies to introduce a sequence relation among words to a certain extent, and the encoding functions are as follows:

PE(t,2i)＝sin(t/10000 ^2i/d )

PE(t,2i+1)＝sin(t/10000 ^2i/d )

where t is the position and i is the dimension. The position code and the input word vector dimension are d, and the two are added together to be used as model input.

When the model is pre-trained, iteration is carried out on the basis of cross entropy between the actual output and the expected output of the minimum word segmentation task, and the cost function is as follows:

wherein y is the expected output, y= {0,1}, a is the actual output value of the network, satisfying a e [0,1], x corresponds to each node of the output layer, and n is the number of nodes of the output layer. The parameters of the neural network are updated by a back propagation algorithm with random gradient descent.

Step 104), obtaining a character vector sequence, a word position mark sequence and prosody mark sequences of all levels of corresponding texts according to the character vector and prosody mark data with word segmentation information.

Wherein the word vector sequence and the lexeme marker sequence are obtained in the same manner as in step 102). The labeling sequence of each prosody level of prosodic words, prosodic phrases and intonation phrases is determined by prosody labeling. The word is represented as prosodic boundaries by B and not prosodic boundaries by NB, respectively.

Specifically, for the prosodic annotation text "aleba #1 and #1 walma #2 are completely different #1, #3", the prosodic word labeling sequences thereof are [ NB, B, B, NB, NB, B, NB, B, NB, B) and prosodic phrase labeling sequence [ NB, NB, NB, NB, NB, NB, NB, B, NB, NB, NB, B ], and intonation phrase labeling sequence [ NB, NB, NB, NB, NB, NB, NB, NB, NB, NB, NB, B ].

Step 105), training is continued on the basis of the pre-training prosody level prediction model in step 103) by using the character vector sequence, the word position mark sequence and the prosody mark sequences of each level obtained in step 104).

The word vector sequence is model input, the word position marking sequence and each level rhythm marking sequence are model output, and the training is based on the criterion of minimizing the sum of the cross entropy between the actual output and the expected output of each output layer.

Step 2) converting the text to be predicted into a character vector sequence by utilizing the character vector in step 101), inputting the character vector sequence into a trained prosody level prediction model, and outputting the word position and prosody level of the text.

Finally, it should be noted that the above embodiments are only for illustrating the technical solution of the present invention and are not limiting. Although the present invention has been described in detail with reference to the embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the appended claims.

Claims

1. A method of chinese prosody level prediction based on self-attention, the method comprising:

learning a large number of unlabeled texts to obtain word vectors of single words, converting the text to be predicted into a word vector sequence by using the word vectors, inputting the word vector sequence into a trained prosody level prediction model, and outputting the word positions and prosody levels of the text;

the training step of the prosody level prediction model comprises the following steps:

step 5) training the prosody level prediction model again according to the character vector sequence, the word position mark sequence and the prosody mark sequence of the prosody data obtained in the step 4) on the basis of the model obtained in the pre-training of the step 3) to obtain a trained prosody level prediction model;

the step 3) further comprises:

Y＝X+SubLayer(X)

FFN(X)＝max(XW ₁ +b ₁ ,0)W ₂ +b ₂

wherein Softmax () is a normalized exponential function;

PE(t,2i)＝sin(t/10000 ^2i/d )

PE(t,2i+1)＝sin(t/10000 ^2i/d )

step 3-3) pre-training a prosody level prediction model;

2. The method according to claim 1, wherein the step 1) is specifically: based on a continuous word bag model CBOW, setting the dimension of a word vector as d, training by using a large number of unlabeled texts to obtain the initial values of the word vectors of all single words in the texts, and constructing a word table by using the initial values of the word-word vectors.

3. The method according to claim 2, wherein said step 2) further comprises:

4. The method of claim 3, wherein the step 4) further comprises:

5. The method according to claim 4, wherein the step 5) is specifically: based on the model obtained by the pre-training in the step 3), taking a character vector sequence of prosody data as the input of the model, and taking a word position mark sequence and prosody mark sequences of all levels as the output of the model; and updating model parameters by adopting a back propagation algorithm of random gradient descent by taking the sum of the cross entropy between the actual output and the expected output of each output layer as a model training criterion to obtain a trained prosody level prediction model.

6. A self-attention based chinese prosody hierarchy prediction system comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method of one of claims 1 to 5 when the program is executed.