CN111951781A

CN111951781A - Chinese prosody boundary prediction method based on graph-to-sequence

Info

Publication number: CN111951781A
Application number: CN202010845400.8A
Authority: CN
Inventors: 陈帅婷; 王龙标; 本多清志
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-08-20
Filing date: 2020-08-20
Publication date: 2020-11-17

Abstract

The invention discloses a Chinese prosody boundary prediction method based on graph-to-sequence, which comprises the following four steps of: (1) word embedding represents the feature: converting features into a digital representation, so the technique of mapping words into real number domain vectors is called word embedding; (2) text time sequence feature extraction model: labeling of prosodic boundaries as sequence labeling in the time dimension; (3) text space information: processing an input text sequence into a graphic structure, and processing the dependency relationship between prosodic boundaries by adding spatial information; (4) spatio-temporal features are combined. Combining the time sequence information of the text with the space information as a new feature, and increasing the accuracy of the final prosodic boundary.

Description

Chinese prosody boundary prediction method based on graph-to-sequence

Technical Field

The invention belongs to the field of speech synthesis, and mainly relates to a technology for improving the accuracy of text prosody boundary prediction in speech synthesis, which provides better conditions for synthesizing natural speech by subsequent speech synthesis.

Background

Speech synthesis/(TTS) is a technique that goes from text to speech and aims at making artificial speech for machines. Classical Statistical Parametric Speech Synthesis (SPSS) systems typically consist of three modules including a front-end module (which converts text to linguistic features), an acoustic model (which maps linguistic features to acoustic features), and a vocoder (which generates speech waveforms from acoustic features). In the past decades, statistical parametric speech synthesis has been complicated to use due to its artificially designed nature and cumbersome inter-module communication. However, with the development of deep learning in recent years, the neural network architecture for end-to-end speech synthesis replaces the traditional module, simplifies the model design and can generate speech with the definition close to the pronunciation level of human. Although the synthesized speech has good sound quality, a lot of research shows that the synthesized speech still has the problems of unnatural, too gentle, simple and stiff. This is mainly because, in addition to the clear and accurate pronunciation in natural speech, prosodic rhythm in speech can help listeners to better understand what the speaker is expressing and what emotion the speaker is expressing.

In speech synthesis in chinese, unlike english where adjacent words are separated by spaces, a word in chinese can be one or more words without explicit separation symbols between adjacent words, and therefore we use prosodic structures to deal with the problem of rhythm in a sentence. In a typical chinese speech synthesis system, the prosodic structure is generally divided into three levels, Prosodic Words (PW), Prosodic Phrases (PP), and Intonation Phrases (IP), which respectively indicate pauses between words in a sentence, pauses between short sentences, and pauses sequentially increase in time. Typical prosody prediction methods are rule-based models and statistical models, such as conditional random fields CRF and RNN. In recent years, a multitask learning (MTL) structure is also applied to prosody prediction. At present, the research of Chinese speech synthesis prosody boundary prediction does not combine the time domain information and the spatial information in the text.

Disclosure of Invention

Aiming at the problem of prosody boundary prediction accuracy in speech synthesis, the invention aims to improve the prosody boundary accuracy in a prosody prediction module in speech synthesis, improve the fluency and naturalness of synthesized speech, increase the authenticity, and strive to discover the characteristics of the speech synthesis module to promote the development of speech synthesis technology.

Along with the development of the neural network, the combined application of the bidirectional long-time and short-time memory network and the random condition domain achieves good effect in prosody prediction. Therefore, the Chinese prosody boundary prediction method based on the graph-to-sequence, which is provided by the invention, takes the Bilstm-CRF as a basic frame, uses the pre-training BERT as text embedding, extracts time information and space information represented by a graph structure from a bidirectional long and short time memory network, and performs space-time information fusion to perform Chinese prosody boundary prediction through an attention neural network based on a graph.

The technical scheme of the technology is as follows: a Chinese prosody boundary prediction method based on graph-to-sequence comprises the following four steps:

(1) pretrained text embedding-BERT

At present, about 3500 commonly used Chinese characters are recorded, but the number of the formed sentences is not large, so that the same character often has different meanings under different contexts, namely different meanings of different characters. BERT is a recently proposed unsupervised pre-training method for general NLP tasks, which is essentially a language model. First, BERT is based on Transformer, thus providing more structured memory for handling long-term dependencies in text. Secondly, as a deep bidirectional model, BERT is more powerful than models from left to right and from right to left, and can represent the input text as word embedding containing context information;

(2) text timing features

Prosodic boundaries in speech synthesis are a time series that predicts context information that is not context-independent, and therefore extracts timing information using a bi-directional long-and-short memory network. The forward bidirectional long-short time memory network and the backward bidirectional long-short time memory network are combined into the bidirectional long-short time memory network, so that the characteristics of the context can be more effectively acquired, and the context information of the input text can be extracted;

(3) text space information

The input text sequence can be processed into a graph structure, the text content is represented by graph nodes, and the grammar and semantic connections are represented by graph boundaries. Converting an input text sequence into a directed graph, wherein nodes in the directed graph are input text sequence contents, and constructing nodes and adjacent matrixes in a graph structure;

(4) spatio-temporal feature binding

Combining the extracted time characteristics in the bidirectional long-and-short-term memory network with the spatial information of the text by using a graph-based attention mechanism;

in the result estimation stage, the prosody boundary is predicted through the content containing the spatio-temporal information in the step (4) by a statistical model random condition domain.

Advantageous effects

The method not only provides the method for acquiring the space information from the text in the speech synthesis, but also combines the time sequence information of the text and the space information together to be used as a new characteristic, and increases the accuracy of the final prosodic boundary.

The invention develops a new thought for the prosody prediction module in the subsequent voice synthesis and makes a contribution to promoting the research of the prosody prediction module in the existing voice synthesis.

Drawings

FIG. 1 is a diagram illustrating an example of a rhythm structure of "senior #2 accompanies #1 and grandson #2 to play a slide # 3";

PW, PP, IP, S represent prosodic words, prosodic phrases, intonation phrases and sentences, respectively;

FIG. 2 is a model framework of the present invention;

FIG. 3 is a sequence-graph structure conversion diagram;

fig. 4 is a Bilstm extraction timing feature.

Detailed Description

The present invention will be described and demonstrated in further detail below with reference to experimental procedures and experimental results.

The invention provides the representation of text space information from the perspective of text analysis on the basis of the basic framework of the current universal sequence prediction Bilstm-CRF, and on the basis, the time sequence information and the space information of the text are firstly combined together to improve the prosodic boundary prediction result in speech synthesis. The key points of the specific technical scheme are divided into the following three parts:

(1) sequence prediction infrastructure

Currently, the longest method in the industry in the prosody prediction module for speech synthesis is Bilstm-CRF. The text input by the Bilstm is embedded into a vector, and the output is the feature extracted in the time domain. And the output of the Bilstm is also the CRF input, and the CRF output end outputs a prediction result according to the characteristics of the time domain.

The application of the Bilstm in the invention is mainly embodied in the extraction of time domain features of an input text, the input content is a BERT embedded vector with context information, and the output content is a feature vector with time sequence information, as shown in FIG. 3. The output of the neuron can be transmitted to the neuron at the next moment, and simultaneously, a hidden layer state is output to be used by the current layer when the current layer processes the next sample, and the hidden layer state can be regarded as a full-connection neural network with self-circulation feedback. Therefore, in the task where the timing information is important, the long-time memory network can obtain the relationship between the samples in the long-time sequence, so that the context characteristics of the input text can be obtained.

The CRF has strong reasoning capability under the condition of a given observation sequence, can use complex, overlapped and non-independent characteristics for training and reasoning, can fully utilize the context information as the characteristics, and can also arbitrarily add other external characteristics, so that the information which can be obtained by the model is very rich. In the invention, CRF is used for observing the space-time characteristics after fusion and finding the optimal path in all possible label sequences. During the training process, the model is optimized by maximizing the correct tag sequence while minimizing the scores of other erroneous sequences.

The model can effectively complete the prosody prediction task in the speech synthesis. The basic framework structure of the invention is the model.

(2) Text space information representation

In the process of prosodic boundary prediction in speech synthesis, the input information is only text, and Bilstm can perform temporal feature extraction, but text also exists as spatial information. The conversion of input text into a graphical structure can be viewed as a spatial mapping process, mapping a time-domain sequential input to a lower-level spatial domain with syntactic information.

The input text sequence is processed into a graph structure, the text content is represented by nodes in a graph, and the adjacency relation between words is represented by edges in the graph, namely, the prosody prediction task in speech synthesis can be similarly modeled as a graph-sequence process. From the above, the graph structure is composed of the node set and the side information. The edges in the figure can be set into different categories according to the requirements of the edges. In the present invention, the following reasons apply:

firstly, the connection line in the graph uses the relation between characters and words and between words, which has high accuracy on the word segmentation requirement of the input text, but at present, the accuracy of a tool for automatically segmenting words in the speech synthesis cannot be perfect, so that extra errors can be introduced, and the negative influence on the experiment is generated;

secondly, the relation between characters and words and between words is used while the graph is marked, and a great deal of time and energy are consumed for manual marking;

the types of edges used in the present invention for the above two reasons are: in the adjacent relation between words in the text sequence, the adjacent value between two words is set as 1, and the non-adjacent value is set as 0. This reflects the most basic relative position relationship between words and sentences, and is the most basic spatial information. All edges are combined together into a contiguous matrix.

In the extraction of the spatial features, in order to capture the features in the whole sentence, the transition of the graph state is realized through the information transfer between the connecting nodes in the graph, and the Bilstm is used in the invention to avoid the gradient reduction and the fracture in the circulation process, so that the transition of the node state in the graph is realized.

(3) Spatio-temporal feature binding

The time characteristics can be obtained by performing Bilstm extraction on the input text sequence, and the graphic structure of the text can represent the spatial characteristics of the text. In the invention, the spatial information representation of the text is a directional graphic structure, so that an aligned relation exists in a time domain and a spatial domain. Therefore, the combination of the two features in the present invention is a selected attention mechanism.

The attention mechanism mimics the internal process of biological observation behavior, i.e., a mechanism that aligns internal experience with external perception to increase the fineness of observation of a partial region. The use of the attention mechanism can reduce the computational burden of processing high-dimensional input data, reduce the data dimension by structurally selecting a subset of the input, and enable the task processing system to focus more on finding significant useful information in the input data related to the current output, thereby improving the quality of the output, so as to selectively focus on useful parts of the input sequence and learn the "alignment" therebetween.

And taking the fused space-time characteristics as the input of a statistical model and a conditional domain CRF (random field regression) to obtain a final prediction result.

The data used in the present invention is a total of 82900 statements, where the training set: and (3) test set: the validation set was 8:1: 1. The prosodic boundary prediction content comprises Prosodic Words (PW), Prosodic Phrases (PP) and Intonation Phrases (IP), and the distribution of the three prosodic boundaries on the training set, the verification set and the test set is shown in table 1:

TABLE 1 Experimental database partitioning and basic cases thereof

	Training set/piece	Verification set/number	Test set/number
				#
1	272475	2581	1964
				#2	153355	2505	1696
#3	189920	2923	2001

The specific model training parameter settings in the present invention are shown in table 2. The experiment in the invention uses 1 GPU of K40m model to complete the training and decoding work of the model.

TABLE 2 model architecture and training parameters

The baseline experiment used the BERT pre-trained Bilstm-CRF model used for word embedding. The invention adopts a Bilstm-CRF model based on BERT pre-training as an initial model. The experimental comparison shows that the prosodic boundary accuracy of each level is improved, and the specific results are shown in the following table 3.

By accurately comparing the experimental results, it can be seen that the Chinese prosody boundary prediction method based on the graph-to-sequence provided by the invention is applied to prosody boundaries: prosodic Words (PW), Prosodic Phrases (PP) and Intonation Phrases (IP) are respectively promoted by 1.73%, 2.16% and 1.24%, and it can be seen that the prosodic boundary prediction method has a positive effect on prosodic boundary prediction.

TABLE 3 results of baseline and inventive experiments

Rate of accuracy	#1(％)	#2(％)	#3(％)
				Baseline experiment	91.64	71.85	78.17
Experiments of the invention	93.37	74.01	79.41

While the invention has been described in connection with the drawings, the invention is not limited to the specific embodiments described above, which are intended to be illustrative rather than limiting, and that many modifications may be made by those skilled in the art without departing from the spirit of the invention, which will fall within the scope of the appended claims.

Claims

1. A Chinese prosody boundary prediction method based on graph-to-sequence is characterized by comprising the following four steps:

(1) word embedding represents the feature:

converting features into a digital representation, so the technique of mapping words into real number domain vectors is called word embedding;

(2) text time sequence feature extraction model:

labeling of prosodic boundaries as sequence labeling in the time dimension;

(3) text space information:

processing an input text sequence into a graphic structure, and processing the dependency relationship between prosodic boundaries by adding spatial information;

(4) spatio-temporal feature combination:

capturing temporal and spatial information, performing recursive aggregation, learning a high-level representation of nodes, and using a GAT attention mechanism to capture spatial dependencies and embed context information into an embedding space.

2. The method for Chinese prosody boundary prediction based on graph-to-sequence as claimed in claim 1, wherein the specific strategy of the step (3) is: the input text sequence is converted into a graph, the words in the text sequence are used as nodes, and the relationship between the words is used as edges, so that the syntactic and semantic information is represented by graph boundary information.

3. The method of Chinese prosodic boundary prediction based on graph-to-sequence as claimed in claim 1, wherein the specific step of step (4) is to input the time-series features extracted by Bilstm and the graph structure containing spatial information into the graph-based attention network, and to spatially capture the syntactic and semantic information inside the sequence and the information on the time-series for fusion.

4. The method of claim 1, wherein for the prosodic boundary prediction task, there is a dependency between consecutive labels, and each sentence is modeled and decoded jointly.

5. The method of claim 4, wherein a random conditional field CRF layer is added at the output of the model structure, and the prosodic boundary prediction is performed using sentence-level information given a set of input sequences, allowing the network to find the optimal path among all possible sequences as the final prediction result.