CN111951781A - Chinese prosody boundary prediction method based on graph-to-sequence - Google Patents

Chinese prosody boundary prediction method based on graph-to-sequence Download PDF

Info

Publication number
CN111951781A
CN111951781A CN202010845400.8A CN202010845400A CN111951781A CN 111951781 A CN111951781 A CN 111951781A CN 202010845400 A CN202010845400 A CN 202010845400A CN 111951781 A CN111951781 A CN 111951781A
Authority
CN
China
Prior art keywords
sequence
graph
information
text
prosodic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010845400.8A
Other languages
Chinese (zh)
Inventor
陈帅婷
王龙标
本多清志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202010845400.8A priority Critical patent/CN111951781A/en
Publication of CN111951781A publication Critical patent/CN111951781A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a Chinese prosody boundary prediction method based on graph-to-sequence, which comprises the following four steps of: (1) word embedding represents the feature: converting features into a digital representation, so the technique of mapping words into real number domain vectors is called word embedding; (2) text time sequence feature extraction model: labeling of prosodic boundaries as sequence labeling in the time dimension; (3) text space information: processing an input text sequence into a graphic structure, and processing the dependency relationship between prosodic boundaries by adding spatial information; (4) spatio-temporal features are combined. Combining the time sequence information of the text with the space information as a new feature, and increasing the accuracy of the final prosodic boundary.

Description

Chinese prosody boundary prediction method based on graph-to-sequence
Technical Field
The invention belongs to the field of speech synthesis, and mainly relates to a technology for improving the accuracy of text prosody boundary prediction in speech synthesis, which provides better conditions for synthesizing natural speech by subsequent speech synthesis.
Background
Speech synthesis/(TTS) is a technique that goes from text to speech and aims at making artificial speech for machines. Classical Statistical Parametric Speech Synthesis (SPSS) systems typically consist of three modules including a front-end module (which converts text to linguistic features), an acoustic model (which maps linguistic features to acoustic features), and a vocoder (which generates speech waveforms from acoustic features). In the past decades, statistical parametric speech synthesis has been complicated to use due to its artificially designed nature and cumbersome inter-module communication. However, with the development of deep learning in recent years, the neural network architecture for end-to-end speech synthesis replaces the traditional module, simplifies the model design and can generate speech with the definition close to the pronunciation level of human. Although the synthesized speech has good sound quality, a lot of research shows that the synthesized speech still has the problems of unnatural, too gentle, simple and stiff. This is mainly because, in addition to the clear and accurate pronunciation in natural speech, prosodic rhythm in speech can help listeners to better understand what the speaker is expressing and what emotion the speaker is expressing.
In speech synthesis in chinese, unlike english where adjacent words are separated by spaces, a word in chinese can be one or more words without explicit separation symbols between adjacent words, and therefore we use prosodic structures to deal with the problem of rhythm in a sentence. In a typical chinese speech synthesis system, the prosodic structure is generally divided into three levels, Prosodic Words (PW), Prosodic Phrases (PP), and Intonation Phrases (IP), which respectively indicate pauses between words in a sentence, pauses between short sentences, and pauses sequentially increase in time. Typical prosody prediction methods are rule-based models and statistical models, such as conditional random fields CRF and RNN. In recent years, a multitask learning (MTL) structure is also applied to prosody prediction. At present, the research of Chinese speech synthesis prosody boundary prediction does not combine the time domain information and the spatial information in the text.
Disclosure of Invention
Aiming at the problem of prosody boundary prediction accuracy in speech synthesis, the invention aims to improve the prosody boundary accuracy in a prosody prediction module in speech synthesis, improve the fluency and naturalness of synthesized speech, increase the authenticity, and strive to discover the characteristics of the speech synthesis module to promote the development of speech synthesis technology.
Along with the development of the neural network, the combined application of the bidirectional long-time and short-time memory network and the random condition domain achieves good effect in prosody prediction. Therefore, the Chinese prosody boundary prediction method based on the graph-to-sequence, which is provided by the invention, takes the Bilstm-CRF as a basic frame, uses the pre-training BERT as text embedding, extracts time information and space information represented by a graph structure from a bidirectional long and short time memory network, and performs space-time information fusion to perform Chinese prosody boundary prediction through an attention neural network based on a graph.
The technical scheme of the technology is as follows: a Chinese prosody boundary prediction method based on graph-to-sequence comprises the following four steps:
(1) pretrained text embedding-BERT
At present, about 3500 commonly used Chinese characters are recorded, but the number of the formed sentences is not large, so that the same character often has different meanings under different contexts, namely different meanings of different characters. BERT is a recently proposed unsupervised pre-training method for general NLP tasks, which is essentially a language model. First, BERT is based on Transformer, thus providing more structured memory for handling long-term dependencies in text. Secondly, as a deep bidirectional model, BERT is more powerful than models from left to right and from right to left, and can represent the input text as word embedding containing context information;
(2) text timing features
Prosodic boundaries in speech synthesis are a time series that predicts context information that is not context-independent, and therefore extracts timing information using a bi-directional long-and-short memory network. The forward bidirectional long-short time memory network and the backward bidirectional long-short time memory network are combined into the bidirectional long-short time memory network, so that the characteristics of the context can be more effectively acquired, and the context information of the input text can be extracted;
(3) text space information
The input text sequence can be processed into a graph structure, the text content is represented by graph nodes, and the grammar and semantic connections are represented by graph boundaries. Converting an input text sequence into a directed graph, wherein nodes in the directed graph are input text sequence contents, and constructing nodes and adjacent matrixes in a graph structure;
(4) spatio-temporal feature binding
Combining the extracted time characteristics in the bidirectional long-and-short-term memory network with the spatial information of the text by using a graph-based attention mechanism;
in the result estimation stage, the prosody boundary is predicted through the content containing the spatio-temporal information in the step (4) by a statistical model random condition domain.
Advantageous effects
The method not only provides the method for acquiring the space information from the text in the speech synthesis, but also combines the time sequence information of the text and the space information together to be used as a new characteristic, and increases the accuracy of the final prosodic boundary.
The invention develops a new thought for the prosody prediction module in the subsequent voice synthesis and makes a contribution to promoting the research of the prosody prediction module in the existing voice synthesis.
Drawings
FIG. 1 is a diagram illustrating an example of a rhythm structure of "senior #2 accompanies #1 and grandson #2 to play a slide # 3";
PW, PP, IP, S represent prosodic words, prosodic phrases, intonation phrases and sentences, respectively;
FIG. 2 is a model framework of the present invention;
FIG. 3 is a sequence-graph structure conversion diagram;
fig. 4 is a Bilstm extraction timing feature.
Detailed Description
The present invention will be described and demonstrated in further detail below with reference to experimental procedures and experimental results.
The invention provides the representation of text space information from the perspective of text analysis on the basis of the basic framework of the current universal sequence prediction Bilstm-CRF, and on the basis, the time sequence information and the space information of the text are firstly combined together to improve the prosodic boundary prediction result in speech synthesis. The key points of the specific technical scheme are divided into the following three parts:
(1) sequence prediction infrastructure
Currently, the longest method in the industry in the prosody prediction module for speech synthesis is Bilstm-CRF. The text input by the Bilstm is embedded into a vector, and the output is the feature extracted in the time domain. And the output of the Bilstm is also the CRF input, and the CRF output end outputs a prediction result according to the characteristics of the time domain.
The application of the Bilstm in the invention is mainly embodied in the extraction of time domain features of an input text, the input content is a BERT embedded vector with context information, and the output content is a feature vector with time sequence information, as shown in FIG. 3. The output of the neuron can be transmitted to the neuron at the next moment, and simultaneously, a hidden layer state is output to be used by the current layer when the current layer processes the next sample, and the hidden layer state can be regarded as a full-connection neural network with self-circulation feedback. Therefore, in the task where the timing information is important, the long-time memory network can obtain the relationship between the samples in the long-time sequence, so that the context characteristics of the input text can be obtained.
The CRF has strong reasoning capability under the condition of a given observation sequence, can use complex, overlapped and non-independent characteristics for training and reasoning, can fully utilize the context information as the characteristics, and can also arbitrarily add other external characteristics, so that the information which can be obtained by the model is very rich. In the invention, CRF is used for observing the space-time characteristics after fusion and finding the optimal path in all possible label sequences. During the training process, the model is optimized by maximizing the correct tag sequence while minimizing the scores of other erroneous sequences.
The model can effectively complete the prosody prediction task in the speech synthesis. The basic framework structure of the invention is the model.
(2) Text space information representation
In the process of prosodic boundary prediction in speech synthesis, the input information is only text, and Bilstm can perform temporal feature extraction, but text also exists as spatial information. The conversion of input text into a graphical structure can be viewed as a spatial mapping process, mapping a time-domain sequential input to a lower-level spatial domain with syntactic information.
The input text sequence is processed into a graph structure, the text content is represented by nodes in a graph, and the adjacency relation between words is represented by edges in the graph, namely, the prosody prediction task in speech synthesis can be similarly modeled as a graph-sequence process. From the above, the graph structure is composed of the node set and the side information. The edges in the figure can be set into different categories according to the requirements of the edges. In the present invention, the following reasons apply:
firstly, the connection line in the graph uses the relation between characters and words and between words, which has high accuracy on the word segmentation requirement of the input text, but at present, the accuracy of a tool for automatically segmenting words in the speech synthesis cannot be perfect, so that extra errors can be introduced, and the negative influence on the experiment is generated;
secondly, the relation between characters and words and between words is used while the graph is marked, and a great deal of time and energy are consumed for manual marking;
the types of edges used in the present invention for the above two reasons are: in the adjacent relation between words in the text sequence, the adjacent value between two words is set as 1, and the non-adjacent value is set as 0. This reflects the most basic relative position relationship between words and sentences, and is the most basic spatial information. All edges are combined together into a contiguous matrix.
In the extraction of the spatial features, in order to capture the features in the whole sentence, the transition of the graph state is realized through the information transfer between the connecting nodes in the graph, and the Bilstm is used in the invention to avoid the gradient reduction and the fracture in the circulation process, so that the transition of the node state in the graph is realized.
(3) Spatio-temporal feature binding
The time characteristics can be obtained by performing Bilstm extraction on the input text sequence, and the graphic structure of the text can represent the spatial characteristics of the text. In the invention, the spatial information representation of the text is a directional graphic structure, so that an aligned relation exists in a time domain and a spatial domain. Therefore, the combination of the two features in the present invention is a selected attention mechanism.
The attention mechanism mimics the internal process of biological observation behavior, i.e., a mechanism that aligns internal experience with external perception to increase the fineness of observation of a partial region. The use of the attention mechanism can reduce the computational burden of processing high-dimensional input data, reduce the data dimension by structurally selecting a subset of the input, and enable the task processing system to focus more on finding significant useful information in the input data related to the current output, thereby improving the quality of the output, so as to selectively focus on useful parts of the input sequence and learn the "alignment" therebetween.
And taking the fused space-time characteristics as the input of a statistical model and a conditional domain CRF (random field regression) to obtain a final prediction result.
The data used in the present invention is a total of 82900 statements, where the training set: and (3) test set: the validation set was 8:1: 1. The prosodic boundary prediction content comprises Prosodic Words (PW), Prosodic Phrases (PP) and Intonation Phrases (IP), and the distribution of the three prosodic boundaries on the training set, the verification set and the test set is shown in table 1:
TABLE 1 Experimental database partitioning and basic cases thereof
Training set/piece Verification set/number Test set/number
#
1 272475 2581 1964
#2 153355 2505 1696
#3 189920 2923 2001
The specific model training parameter settings in the present invention are shown in table 2. The experiment in the invention uses 1 GPU of K40m model to complete the training and decoding work of the model.
TABLE 2 model architecture and training parameters
Figure BDA0002642880900000061
The baseline experiment used the BERT pre-trained Bilstm-CRF model used for word embedding. The invention adopts a Bilstm-CRF model based on BERT pre-training as an initial model. The experimental comparison shows that the prosodic boundary accuracy of each level is improved, and the specific results are shown in the following table 3.
By accurately comparing the experimental results, it can be seen that the Chinese prosody boundary prediction method based on the graph-to-sequence provided by the invention is applied to prosody boundaries: prosodic Words (PW), Prosodic Phrases (PP) and Intonation Phrases (IP) are respectively promoted by 1.73%, 2.16% and 1.24%, and it can be seen that the prosodic boundary prediction method has a positive effect on prosodic boundary prediction.
TABLE 3 results of baseline and inventive experiments
Rate of accuracy #1(%) #2(%) #3(%)
Baseline experiment 91.64 71.85 78.17
Experiments of the invention 93.37 74.01 79.41
While the invention has been described in connection with the drawings, the invention is not limited to the specific embodiments described above, which are intended to be illustrative rather than limiting, and that many modifications may be made by those skilled in the art without departing from the spirit of the invention, which will fall within the scope of the appended claims.

Claims (5)

1. A Chinese prosody boundary prediction method based on graph-to-sequence is characterized by comprising the following four steps:
(1) word embedding represents the feature:
converting features into a digital representation, so the technique of mapping words into real number domain vectors is called word embedding;
(2) text time sequence feature extraction model:
labeling of prosodic boundaries as sequence labeling in the time dimension;
(3) text space information:
processing an input text sequence into a graphic structure, and processing the dependency relationship between prosodic boundaries by adding spatial information;
(4) spatio-temporal feature combination:
capturing temporal and spatial information, performing recursive aggregation, learning a high-level representation of nodes, and using a GAT attention mechanism to capture spatial dependencies and embed context information into an embedding space.
2. The method for Chinese prosody boundary prediction based on graph-to-sequence as claimed in claim 1, wherein the specific strategy of the step (3) is: the input text sequence is converted into a graph, the words in the text sequence are used as nodes, and the relationship between the words is used as edges, so that the syntactic and semantic information is represented by graph boundary information.
3. The method of Chinese prosodic boundary prediction based on graph-to-sequence as claimed in claim 1, wherein the specific step of step (4) is to input the time-series features extracted by Bilstm and the graph structure containing spatial information into the graph-based attention network, and to spatially capture the syntactic and semantic information inside the sequence and the information on the time-series for fusion.
4. The method of claim 1, wherein for the prosodic boundary prediction task, there is a dependency between consecutive labels, and each sentence is modeled and decoded jointly.
5. The method of claim 4, wherein a random conditional field CRF layer is added at the output of the model structure, and the prosodic boundary prediction is performed using sentence-level information given a set of input sequences, allowing the network to find the optimal path among all possible sequences as the final prediction result.
CN202010845400.8A 2020-08-20 2020-08-20 Chinese prosody boundary prediction method based on graph-to-sequence Pending CN111951781A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010845400.8A CN111951781A (en) 2020-08-20 2020-08-20 Chinese prosody boundary prediction method based on graph-to-sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010845400.8A CN111951781A (en) 2020-08-20 2020-08-20 Chinese prosody boundary prediction method based on graph-to-sequence

Publications (1)

Publication Number Publication Date
CN111951781A true CN111951781A (en) 2020-11-17

Family

ID=73358697

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010845400.8A Pending CN111951781A (en) 2020-08-20 2020-08-20 Chinese prosody boundary prediction method based on graph-to-sequence

Country Status (1)

Country Link
CN (1) CN111951781A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112967728A (en) * 2021-05-19 2021-06-15 北京世纪好未来教育科技有限公司 End-to-end speech synthesis method and device combined with acoustic transfer function
CN114091444A (en) * 2021-11-15 2022-02-25 北京声智科技有限公司 Text processing method and device, computer equipment and storage medium
WO2022121179A1 (en) * 2020-12-11 2022-06-16 平安科技(深圳)有限公司 Speech synthesis method and apparatus, device, and storage medium
CN116055651A (en) * 2023-01-06 2023-05-02 广东电网有限责任公司 Shared access method, device, equipment and medium for multi-center energy economic data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070112569A1 (en) * 2005-11-14 2007-05-17 Nien-Chih Wang Method for text-to-pronunciation conversion
US20140358547A1 (en) * 2013-05-28 2014-12-04 International Business Machines Corporation Hybrid predictive model for enhancing prosodic expressiveness
CN105244020A (en) * 2015-09-24 2016-01-13 百度在线网络技术(北京)有限公司 Prosodic hierarchy model training method, text-to-speech method and text-to-speech device
CN109979429A (en) * 2019-05-29 2019-07-05 南京硅基智能科技有限公司 A kind of method and system of TTS
CN110164413A (en) * 2019-05-13 2019-08-23 北京百度网讯科技有限公司 Phoneme synthesizing method, device, computer equipment and storage medium
CN110534087A (en) * 2019-09-04 2019-12-03 清华大学深圳研究生院 A kind of text prosody hierarchy Structure Prediction Methods, device, equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070112569A1 (en) * 2005-11-14 2007-05-17 Nien-Chih Wang Method for text-to-pronunciation conversion
US20140358547A1 (en) * 2013-05-28 2014-12-04 International Business Machines Corporation Hybrid predictive model for enhancing prosodic expressiveness
CN105244020A (en) * 2015-09-24 2016-01-13 百度在线网络技术(北京)有限公司 Prosodic hierarchy model training method, text-to-speech method and text-to-speech device
CN110164413A (en) * 2019-05-13 2019-08-23 北京百度网讯科技有限公司 Phoneme synthesizing method, device, computer equipment and storage medium
CN109979429A (en) * 2019-05-29 2019-07-05 南京硅基智能科技有限公司 A kind of method and system of TTS
CN110534087A (en) * 2019-09-04 2019-12-03 清华大学深圳研究生院 A kind of text prosody hierarchy Structure Prediction Methods, device, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"基于图神经网络的中文韵律边界预测", 万方数据库 *
ILONA KOUTNY: "Prosody Prediction from Text in Hungarian and its Realization in TTS Conversion", 《INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022121179A1 (en) * 2020-12-11 2022-06-16 平安科技(深圳)有限公司 Speech synthesis method and apparatus, device, and storage medium
CN112967728A (en) * 2021-05-19 2021-06-15 北京世纪好未来教育科技有限公司 End-to-end speech synthesis method and device combined with acoustic transfer function
CN112967728B (en) * 2021-05-19 2021-07-30 北京世纪好未来教育科技有限公司 End-to-end speech synthesis method and device combined with acoustic transfer function
CN114091444A (en) * 2021-11-15 2022-02-25 北京声智科技有限公司 Text processing method and device, computer equipment and storage medium
CN116055651A (en) * 2023-01-06 2023-05-02 广东电网有限责任公司 Shared access method, device, equipment and medium for multi-center energy economic data
CN116055651B (en) * 2023-01-06 2023-11-10 广东电网有限责任公司 Shared access method, device, equipment and medium for multi-center energy economic data

Similar Documents

Publication Publication Date Title
CN111954903B (en) Multi-speaker neuro-text-to-speech synthesis
CN112863483B (en) Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm
EP3994683B1 (en) Multilingual neural text-to-speech synthesis
CN111951781A (en) Chinese prosody boundary prediction method based on graph-to-sequence
CN110060690A (en) Multi-to-multi voice conversion method based on STARGAN and ResNet
CN112509563B (en) Model training method and device and electronic equipment
EP4029010B1 (en) Neural text-to-speech synthesis with multi-level context features
CN112352275A (en) Neural text-to-speech synthesis with multi-level textual information
WO2021127817A1 (en) Speech synthesis method, device, and apparatus for multilingual text, and storage medium
CN110767213A (en) Rhythm prediction method and device
CN112016271A (en) Language style conversion model training method, text processing method and device
CN111986687A (en) Bilingual emotion dialogue generation system based on interactive decoding
CN113539268A (en) End-to-end voice-to-text rare word optimization method
CN115394287A (en) Mixed language voice recognition method, device, system and storage medium
Zhao et al. Tibetan Multi-Dialect Speech and Dialect Identity Recognition.
Wang [Retracted] Research on Open Oral English Scoring System Based on Neural Network
CN114333762B (en) Expressive force-based speech synthesis method, expressive force-based speech synthesis system, electronic device and storage medium
CN113327572B (en) Controllable emotion voice synthesis method and system based on emotion type label
CN113129862B (en) Voice synthesis method, system and server based on world-tacotron
CN115374784A (en) Chinese named entity recognition method based on multi-mode information selective fusion
KR102395702B1 (en) Method for providing english education service using step-by-step expanding sentence structure unit
CN114446278A (en) Speech synthesis method and apparatus, device and storage medium
CN114707503A (en) Front-end text analysis method based on multi-task learning
CN113257225A (en) Emotional voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics
Weweler Single-Speaker End-To-End Neural Text-To-Speech Synthesis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination