CN111354333A - Chinese prosody hierarchy prediction method and system based on self-attention - Google Patents

Chinese prosody hierarchy prediction method and system based on self-attention Download PDF

Info

Publication number
CN111354333A
CN111354333A CN201811571546.7A CN201811571546A CN111354333A CN 111354333 A CN111354333 A CN 111354333A CN 201811571546 A CN201811571546 A CN 201811571546A CN 111354333 A CN111354333 A CN 111354333A
Authority
CN
China
Prior art keywords
word
prosody
prosodic
sequence
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811571546.7A
Other languages
Chinese (zh)
Other versions
CN111354333B (en
Inventor
张鹏远
卢春晖
颜永红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Original Assignee
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, Beijing Kexin Technology Co Ltd filed Critical Institute of Acoustics CAS
Priority to CN201811571546.7A priority Critical patent/CN111354333B/en
Publication of CN111354333A publication Critical patent/CN111354333A/en
Application granted granted Critical
Publication of CN111354333B publication Critical patent/CN111354333B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a Chinese prosody hierarchy prediction method based on self-attention, which comprises the following steps: learning a large amount of unlabelled texts to obtain word vectors of single characters, converting the texts to be predicted into word vector sequences by using the word vectors, inputting the word vector sequences into a trained prosody level prediction model, and outputting the word positions and prosody levels of the texts. The method of the invention utilizes a rhythm level prediction model to carry out Chinese rhythm level prediction, ensures the prediction performance, and simultaneously takes character granularity characteristics as input, avoids dependence on a word segmentation system and negative effects possibly caused by the dependence, and the model directly models the relationship between any two characters in a text by utilizing a self-attention mechanism, thereby realizing parallelization calculation; and the performance of the model is improved by using extra data for pre-training, so that the prosodic levels of the text to be processed can be accurately predicted at the same time, and the error transmission is avoided.

Description

Chinese prosody hierarchy prediction method and system based on self-attention
Technical Field
The invention relates to the technical field of speech synthesis, in particular to a Chinese prosody hierarchy prediction method and system based on self-attention.
Background
In a speech synthesis system, predicting a prosodic hierarchy from an input text to be synthesized is always a crucial step, and the prediction result is used as a part of linguistic features for modeling acoustic features and duration. Therefore, the accuracy of prosody hierarchy prediction determines the naturalness of the synthesized speech to a great extent, and the realization of accurate prosody hierarchy prediction has great significance.
The mainstream method at present is to use a bidirectional long-and-short-term memory network BLSTM, and use word vectors as input to respectively model different prosody levels, i.e., train a model for prosodic words, prosodic phrases, and intonation phrases, respectively, and use the prediction result of the low level as the input of the high level to realize the step-by-step prosody prediction.
However, the above method has the following problems: 1) LSTM, as an RNN structure, requires the use of the output value at the previous time each time it is predicted, this sequential calculation hinders its parallelization and makes the distance between any two words o (n); 2) training and predicting a prosody prediction model on word granularity means that word segmentation processing must be performed on input text, and the result of word segmentation directly influences the performance of prosody level prediction. In addition, the number of Chinese entries is huge, and the storage of the word vectors occupies a large storage space and calculation resources, which is obviously not practical for offline speech synthesis; 3) progressive prosody prediction can cause erroneous results to be continuously transmitted, and subsequent prediction errors are caused.
The realization of prosody hierarchy prediction of a text is an essential step in a speech synthesis system, but the current mainstream method utilizes the characteristics of word levels to depend on the performance of a word segmentation system, and the progressive prosody prediction can cause the continuous transmission of error results.
Disclosure of Invention
The invention aims at solving the problems in the prior related art at least to a certain extent, and provides a prosody hierarchy prediction method which takes characters as basic units of a model, avoids dependence on a word segmentation system and reduces the requirement on a storage space; and the simultaneous prediction of the multi-level prosody is realized by utilizing one model, and the problem of error transmission is solved.
In order to achieve the above object, the present invention provides a chinese prosody hierarchy prediction method based on self-attention, including:
learning a large amount of unlabelled texts to obtain word vectors of single characters, converting the texts to be predicted into word vector sequences by using the word vectors, inputting the word vector sequences into a trained prosody level prediction model, and outputting the word positions and prosody levels of the texts.
As an improvement of the above method, the training step of the prosody-level prediction model includes:
step 1) learning a large amount of unlabeled texts to obtain word vectors of single words;
step 2) converting the text corresponding to the word segmentation data into a word vector sequence by using the word vectors obtained in the step 1), and obtaining a word position marking sequence according to the word segmentation result;
step 3) constructing a prosodic hierarchy prediction model based on a self-attention mechanism, and pre-training the prediction model by respectively taking the word vector sequence and the word position mark sequence of the participle data obtained in the step 2) as input and output;
step 4) converting the text corresponding to the prosody labeling data into a character vector sequence by using the character vector obtained in the step 1), obtaining a word position marking sequence according to a corresponding word segmentation result, and obtaining a labeling sequence corresponding to each prosody level according to prosody labeling;
and 5) on the basis of the model obtained by pre-training in the step 3), training the prosody hierarchy prediction model again according to the word vector sequence, the word position marking sequence and the prosody marking sequence of the prosody data obtained in the step 4), so as to obtain the trained prosody hierarchy prediction model.
As an improvement of the above method, the step 1) is specifically: based on the continuous bag-of-words model CBOW, setting the dimension of a word vector as d, training by using a large amount of non-labeled texts to obtain the initial value of the word vector of all the single words in the texts, and constructing a word table by using the initial value of the word-word vector.
As a modification of the above method, the step 2) further comprises:
step 2-1) searching a word vector of a corresponding word in a word table searching manner according to the text information of the word segmentation data, so as to determine a word vector characteristic sequence of the corresponding text;
step 2-2) determining a word position mark sequence corresponding to the word segmentation data text according to the position of the word in the word, wherein B, M, E, S represents that the word is at the beginning of the word, the word is in the middle of the word, the word is at the end of the word and a single word respectively.
As a modification of the above method, the step 3) further comprises:
step 3-1) constructing a prosody hierarchy prediction model with N layers, wherein each layer comprises a feedforward neural network sublayer and a self-attention sublayer, and the two sublayers are connected by adopting residual errors, and the prosody hierarchy prediction model is as follows:
Y=X+SubLayer(X)
wherein X, Y denotes the input and output of the sub-layer, respectively; the prediction model has four output layers, wherein the three output layers respectively predict prosodic word boundaries, prosodic phrase boundaries and intonation phrase boundaries; an output layer predicts the word position to realize word segmentation of the text;
the feedforward neural network sublayer consists of two linear projections, the middle of the two linear projections is connected by taking a modified linear unit as an activation function, and the formula is as follows:
FFN(X)=max(XW1+b1,0)W2+b2
wherein W1、W2A weight matrix of two linear projections with dimensions d × dfAnd df×d;b1、b2Is a bias vector;
the self-attention sublayer adopts multi-head self-attention, for each head, firstly, an input matrix is subjected to linear projection to obtain three matrixes Q, K, V, then, the three matrixes are subjected to zooming dot product attention operation to obtain a vector M, and the M of all the heads are spliced and subjected to linear projection to obtain the output of the sublayer; m is calculated from the formula:
Figure BDA0001915655890000031
wherein Softmax () is a normalized exponential function;
step 3-2) using sine and cosine functions with different frequencies to code different positions of the input sequence, wherein the coding functions are as follows:
PE(t,2i)=sin(t/100002i/d)
PE(t,2i+1)=sin(t/100002i/d)
wherein t is the position and i is the dimension; the position coding and the input word vector dimension are d, and the position coding and the input word vector dimension are added together to be used as the input of a prosody level prediction model;
step 3-3) pre-training a prosodic hierarchy prediction model;
iteration is carried out by taking the cross entropy between the actual output and the expected output of the minimum word segmentation task as a criterion, and the cost function is as follows:
Figure BDA0001915655890000032
wherein y is the expected output, y is {0,1}, a is the actual output value, a ∈ [0,1], x corresponds to each node of the output layer, n is the number of nodes of the output layer, and the parameters of the model are updated through a back propagation algorithm with descending random gradient.
As a modification of the above method, the step 4) further comprises:
step 4-1) searching a word vector of a corresponding word in a word table searching manner according to the text information of the prosody labeling data, so as to determine a word vector characteristic sequence of the corresponding text;
step 4-2) determining a word position marking sequence corresponding to the prosodic data text according to the corresponding word segmentation result of the prosodic data; b, M, E, S respectively indicates that the character is at the beginning of the word, the character is in the middle of the word, the character is at the end of the word and the single word;
step 4-3) determining the labeling sequence of each prosodic level of prosodic words, prosodic phrases and intonation phrases according to the labeling of the prosodic data; the word is denoted by B as a prosodic boundary and NB as a non-prosodic boundary, respectively.
As an improvement of the above method, the step 5) is specifically: on the basis of the model obtained by pre-training in the step 3), taking a word vector sequence of prosody data as the input of the model, and taking a lexeme marking sequence and a prosody labeling sequence of each level as the output of the model; and (3) taking the sum of the cross entropies between the actual output and the expected output of each minimized output layer as a model training criterion, and updating model parameters by adopting a back propagation algorithm with descending random gradient to obtain a trained prosody level prediction model.
The invention also provides a system for Chinese prosody hierarchy prediction based on self-attention, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of the method as claimed above.
The invention has the advantages that:
1. the prosodic hierarchy prediction model of the invention takes the character granularity characteristics as input while ensuring the prediction performance, thereby avoiding the dependence on a word segmentation system and the negative influence possibly caused by the dependence on the word segmentation system, and simultaneously reducing the size of the model;
2. the prosody hierarchy prediction model directly models the relation between any two characters in the text by using a self-attention mechanism, and can realize parallelization calculation; the performance of the model is improved by using extra data for pre-training, and the accurate prediction of the prosody hierarchy of the text to be processed is realized;
3. the method of the invention adopts one model to predict a plurality of prosody hierarchies simultaneously, thereby avoiding wrong transmission.
Drawings
Fig. 1 is a flow chart of a chinese prosody hierarchy prediction method based on self-attention according to the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The invention provides a Chinese prosody prediction method based on self-attention. The method takes a word vector as an input characteristic, models the dependency relationship among words in a text through a self-attention mechanism, and sets an independent output layer for each level of prosody, thereby realizing the simultaneous prediction of each level of the prosody. The method avoids the dependency on a segmentation system and simultaneously realizes the accurate prediction of the text prosody hierarchy.
The invention provides a Chinese prosody hierarchy prediction model construction method based on self-attention, which comprises the following steps: learning a large amount of unlabeled texts to obtain word vectors of single words; acquiring a word vector sequence and a word position mark sequence of a text corresponding to the data according to the word vector and the word segmentation data; constructing a rhythm prediction model based on a self-attention mechanism, and pre-training the model according to a word vector sequence and a word position marking sequence of word segmentation data; obtaining a word vector sequence, a word position marking sequence and each rhythm level marking sequence of a corresponding text according to the word vector and rhythm marking data with word segmentation information; and continuing training on the basis of the pre-training prosody level prediction model according to the word vector sequence, the lexeme marking sequence and each prosody level marking sequence of the prosody data. The method is based on character level characteristics, direct modeling of the relation between any two characters in the text is achieved through a self-attention mechanism, and pre-training is carried out by utilizing extra data to improve the performance of the model, so that accurate prediction of the prosody hierarchy of the text to be processed is achieved.
The method of the invention comprises the following steps:
step 1) constructing and training a prosodic hierarchy prediction model, as shown in fig. 1, the steps specifically include:
step 101), learning a large amount of unlabeled texts to obtain word vectors of single words.
The method comprises the steps that unlabeled texts are collected from expected texts in various fields, characters in the texts are used as basic training units, the dimension of each character vector is set to be d on the basis of a continuous bag-of-words model CBOW, and an initial character vector of each character is obtained through training. And constructing a word table by using the initial value of the word-word vector.
Step 102), obtaining a word vector sequence and a word position marking sequence of the corresponding text according to the word vector and the word segmentation data.
The character vector feature sequence is obtained by searching each character in the word text into the character vector of the corresponding character through the operation of searching the character table.
And determining the word position marking sequence according to the position of the word corresponding to the word segmentation data text in the word. B, M, E, S indicates the word at the beginning of the word, the word in the middle of the word, the word at the end of the word, and the single word.
Specifically, for the participle text "aribaba is completely different from walmart", the lexeme tag sequence is as follows: [ B, M, M, E, S, B, M, E, B, E, S, S ].
Step 103), constructing a prosody hierarchy prediction model based on a self-attention mechanism, and pre-training the model by utilizing the word vector characteristic sequence and the lexeme mark sequence of the participle data obtained in the step 102).
The constructed prosody hierarchy prediction model is composed of N layers, each layer comprises a feedforward neural network sub-layer and a self-attention sub-layer, and residual errors are connected between every two sub-layers, and the following formula is adopted:
Y=X+SubLayer(X)
where X, Y represent the input and output, respectively, of the sub-layer. The model has four output layers, three of the four output layers are used for carrying out prosody level prediction, namely, prosodic word boundaries, prosodic phrase boundaries and intonation phrase boundaries are predicted respectively, and the simultaneous prediction of multi-level prosody in one model is realized; and the other output layer carries out a word segmentation task, because the prosodic hierarchy boundary is established on the basis of grammatical words, the word segmentation task is introduced to obtain word level information so as to improve the accuracy of prosodic hierarchy prediction.
Specifically, the feedforward neural network sublayer consists of two linear projections, the middle of which is connected by a modified linear unit as an activation function, and the formula is as follows:
FFN(X)=max(XW1+b1,0)W2+b2
wherein W1、W2Weights for two linear projectionsMatrix with dimensions d × dfAnd df×d;b1、b2Is a bias vector.
The self-attention sublayer adopts multi-head self-attention, for each head, firstly, an input matrix is subjected to linear projection to obtain three matrixes Q, K, V, then, scaling dot-product attention (scaled dot-product attention) operation is carried out on the three matrixes to obtain a vector M, and the M of all the heads are spliced and subjected to linear projection to obtain the output of the sublayer. M is calculated from the formula:
Figure BDA0001915655890000061
where Softmax () is a normalized exponential function.
The model does not use sequence models such as RNN and the like, and can not consider time sequence information, so that sine and cosine functions with different frequencies are used for coding different positions of an input sequence so as to introduce the sequence relation among words to a certain extent, and the coding function is as follows:
PE(t,2i)=sin(t/100002i/d)
PE(t,2i+1)=sin(t/100002i/d)
where t is the position and i is the dimension. The position code and the input word vector have the same dimension d, and the position code and the input word vector are added together to be used as model input.
When the model is pre-trained, iteration is carried out by taking the cross entropy between the actual output and the expected output of the minimum word segmentation task as the criterion, and the cost function is as follows:
Figure BDA0001915655890000062
where y is the desired output, y is {0,1}, a is the actual output value of the network, and satisfies a ∈ [0,1], x corresponds to each node of the output layer, and n is the number of nodes of the output layer.
And step 104), obtaining a word vector sequence, a word position marking sequence and each level prosody marking sequence of the corresponding text according to the word vector and the prosody marking data with the word segmentation information.
Wherein the method for obtaining the word vector sequence and the lexeme mark sequence is the same as the step 102). The prosodic words, prosodic phrases and the prosodic phrase labeling sequence of each prosodic level is determined by prosodic labels. The word is denoted by B as a prosodic boundary and NB as a non-prosodic boundary, respectively.
Specifically, for the prosodic annotation text "a rribaba #1 is completely different from #1 in # 3" of #1 walmart #2, the prosodic word annotation sequence is [ NB, B, NB, B, NB.
Step 105), continuing training on the basis of the pre-training prosody level prediction model in the step 103) by utilizing the word vector sequence, the lexeme mark sequence and each level prosody mark sequence obtained in the step 104).
The word vector sequence is input as a model, the lexeme marking sequence and each level of prosody marking sequence are output as a model, and the sum of cross entropies between actual output and expected output of each output layer is minimized as a criterion during training.
And 2) converting the text to be predicted into a word vector sequence by using the word vector in the step 101), inputting the word vector sequence into the trained prosody level prediction model, and outputting the lexeme and prosody level of the text.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (8)

1. A method for Chinese prosody hierarchy prediction based on self-attention, the method comprising:
learning a large amount of unlabelled texts to obtain word vectors of single characters, converting the texts to be predicted into word vector sequences by using the word vectors, inputting the word vector sequences into a trained prosody level prediction model, and outputting the word positions and prosody levels of the texts.
2. The method of claim 1, wherein the training of the prosodic hierarchy prediction model comprises:
step 1) learning a large amount of unlabeled texts to obtain word vectors of single words;
step 2) converting the text corresponding to the word segmentation data into a word vector sequence by using the word vectors obtained in the step 1), and obtaining a word position marking sequence according to the word segmentation result;
step 3) constructing a prosodic hierarchy prediction model based on a self-attention mechanism, and pre-training the prediction model by respectively taking the word vector sequence and the word position mark sequence of the participle data obtained in the step 2) as input and output;
step 4) converting the text corresponding to the prosody labeling data into a character vector sequence by using the character vector obtained in the step 1), obtaining a word position marking sequence according to a corresponding word segmentation result, and obtaining a labeling sequence corresponding to each prosody level according to prosody labeling;
and 5) on the basis of the model obtained by pre-training in the step 3), training the prosody hierarchy prediction model again according to the word vector sequence, the word position marking sequence and the prosody marking sequence of the prosody data obtained in the step 4), so as to obtain the trained prosody hierarchy prediction model.
3. The method for predicting Chinese prosody hierarchy based on self-attention as claimed in claim 2, wherein the step 1) is specifically as follows: based on the continuous bag-of-words model CBOW, setting the dimension of a word vector as d, training by using a large amount of non-labeled texts to obtain the initial value of the word vector of all the single words in the texts, and constructing a word table by using the initial value of the word-word vector.
4. The method for Chinese prosodic hierarchy prediction based on self-attention as claimed in claim 3, wherein said step 2) further comprises:
step 2-1) searching a word vector of a corresponding word in a word table searching manner according to the text information of the word segmentation data, so as to determine a word vector characteristic sequence of the corresponding text;
step 2-2) determining a word position mark sequence corresponding to the word segmentation data text according to the position of the word in the word, wherein B, M, E, S represents that the word is at the beginning of the word, the word is in the middle of the word, the word is at the end of the word and a single word respectively.
5. The method for Chinese prosodic hierarchy prediction based on self-attention as claimed in claim 4, wherein said step 3) further comprises:
step 3-1) constructing a prosody hierarchy prediction model with N layers, wherein each layer comprises a feedforward neural network sublayer and a self-attention sublayer, and the two sublayers are connected by adopting residual errors, and the prosody hierarchy prediction model is as follows:
Y=X+SubLayer(X)
wherein X, Y denotes the input and output of the sub-layer, respectively; the prediction model has four output layers, wherein the three output layers respectively predict prosodic word boundaries, prosodic phrase boundaries and intonation phrase boundaries; an output layer predicts the word position to realize word segmentation of the text;
the feedforward neural network sublayer consists of two linear projections, the middle of the two linear projections is connected by taking a modified linear unit as an activation function, and the formula is as follows:
FFN(X)=max(XW1+b1,0)W2+b2
wherein W1、W2A weight matrix of two linear projections with dimensions d × dfAnd df×d;b1、b2Is a bias vector;
the self-attention sublayer adopts multi-head self-attention, for each head, firstly, an input matrix is subjected to linear projection to obtain three matrixes Q, K, V, then, the three matrixes are subjected to zooming dot product attention operation to obtain a vector M, and the M of all the heads are spliced and subjected to linear projection to obtain the output of the sublayer; m is calculated from the formula:
Figure FDA0001915655880000021
wherein Softmax () is a normalized exponential function;
step 3-2) using sine and cosine functions with different frequencies to code different positions of the input sequence, wherein the coding functions are as follows:
PE(t,2i)=sin(t/100002i/d)
PE(t,2i+1)=sin(t/100002i/d)
wherein t is the position and i is the dimension; the position coding and the input word vector dimension are d, and the position coding and the input word vector dimension are added together to be used as the input of a prosody level prediction model;
step 3-3) pre-training a prosodic hierarchy prediction model;
iteration is carried out by taking the cross entropy between the actual output and the expected output of the minimum word segmentation task as a criterion, and the cost function is as follows:
Figure FDA0001915655880000022
wherein y is the expected output, y is {0,1}, a is the actual output value, a ∈ [0,1], x corresponds to each node of the output layer, n is the number of nodes of the output layer, and the parameters of the model are updated through a back propagation algorithm with descending random gradient.
6. The method for Chinese prosodic hierarchy prediction based on self-attention as claimed in claim 5, wherein said step 4) further comprises:
step 4-1) searching a word vector of a corresponding word in a word table searching manner according to the text information of the prosody labeling data, so as to determine a word vector characteristic sequence of the corresponding text;
step 4-2) determining a word position marking sequence corresponding to the prosodic data text according to the corresponding word segmentation result of the prosodic data; b, M, E, S respectively indicates that the character is at the beginning of the word, the character is in the middle of the word, the character is at the end of the word and the single word;
step 4-3) determining the labeling sequence of each prosodic level of prosodic words, prosodic phrases and intonation phrases according to the labeling of the prosodic data; the word is denoted by B as a prosodic boundary and NB as a non-prosodic boundary, respectively.
7. The method for Chinese prosody hierarchy prediction based on self-attention as claimed in claim 6, wherein the step 5) is specifically as follows: on the basis of the model obtained by pre-training in the step 3), taking a word vector sequence of prosody data as the input of the model, and taking a lexeme marking sequence and a prosody labeling sequence of each level as the output of the model; and (3) taking the sum of the cross entropies between the actual output and the expected output of each minimized output layer as a model training criterion, and updating model parameters by adopting a back propagation algorithm with descending random gradient to obtain a trained prosody level prediction model.
8. A system for chinese prosodic hierarchy prediction based on self-attention, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the method according to any one of claims 1 to 7.
CN201811571546.7A 2018-12-21 2018-12-21 Self-attention-based Chinese prosody level prediction method and system Active CN111354333B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811571546.7A CN111354333B (en) 2018-12-21 2018-12-21 Self-attention-based Chinese prosody level prediction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811571546.7A CN111354333B (en) 2018-12-21 2018-12-21 Self-attention-based Chinese prosody level prediction method and system

Publications (2)

Publication Number Publication Date
CN111354333A true CN111354333A (en) 2020-06-30
CN111354333B CN111354333B (en) 2023-11-10

Family

ID=71195629

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811571546.7A Active CN111354333B (en) 2018-12-21 2018-12-21 Self-attention-based Chinese prosody level prediction method and system

Country Status (1)

Country Link
CN (1) CN111354333B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914551A (en) * 2020-07-29 2020-11-10 北京字节跳动网络技术有限公司 Language representation model system, pre-training method, device, equipment and medium
CN112309368A (en) * 2020-11-23 2021-02-02 北京有竹居网络技术有限公司 Prosody prediction method, device, equipment and storage medium
CN112580361A (en) * 2020-12-18 2021-03-30 蓝舰信息科技南京有限公司 Formula based on unified attention mechanism and character recognition model method
CN112863484A (en) * 2021-01-25 2021-05-28 中国科学技术大学 Training method of prosodic phrase boundary prediction model and prosodic phrase boundary prediction method
CN113129862A (en) * 2021-04-22 2021-07-16 合肥工业大学 World-tacontron-based voice synthesis method and system and server
CN113421550A (en) * 2021-06-25 2021-09-21 北京有竹居网络技术有限公司 Speech synthesis method, device, readable medium and electronic equipment
CN113657118A (en) * 2021-08-16 2021-11-16 北京好欣晴移动医疗科技有限公司 Semantic analysis method, device and system based on call text

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030149558A1 (en) * 2000-04-12 2003-08-07 Martin Holsapfel Method and device for determination of prosodic markers
US20080147405A1 (en) * 2006-12-13 2008-06-19 Fujitsu Limited Chinese prosodic words forming method and apparatus
CN105185374A (en) * 2015-09-11 2015-12-23 百度在线网络技术(北京)有限公司 Prosodic hierarchy annotation method and device
CN105244020A (en) * 2015-09-24 2016-01-13 百度在线网络技术(北京)有限公司 Prosodic hierarchy model training method, text-to-speech method and text-to-speech device
CN107451115A (en) * 2017-07-11 2017-12-08 中国科学院自动化研究所 The construction method and system of Chinese Prosodic Hierarchy forecast model end to end
CN107464559A (en) * 2017-07-11 2017-12-12 中国科学院自动化研究所 Joint forecast model construction method and system based on Chinese rhythm structure and stress
CN108595590A (en) * 2018-04-19 2018-09-28 中国科学院电子学研究所苏州研究院 A kind of Chinese Text Categorization based on fusion attention model
CN108874790A (en) * 2018-06-29 2018-11-23 中译语通科技股份有限公司 A kind of cleaning parallel corpora method and system based on language model and translation model

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030149558A1 (en) * 2000-04-12 2003-08-07 Martin Holsapfel Method and device for determination of prosodic markers
US20080147405A1 (en) * 2006-12-13 2008-06-19 Fujitsu Limited Chinese prosodic words forming method and apparatus
CN105185374A (en) * 2015-09-11 2015-12-23 百度在线网络技术(北京)有限公司 Prosodic hierarchy annotation method and device
CN105244020A (en) * 2015-09-24 2016-01-13 百度在线网络技术(北京)有限公司 Prosodic hierarchy model training method, text-to-speech method and text-to-speech device
CN107451115A (en) * 2017-07-11 2017-12-08 中国科学院自动化研究所 The construction method and system of Chinese Prosodic Hierarchy forecast model end to end
CN107464559A (en) * 2017-07-11 2017-12-12 中国科学院自动化研究所 Joint forecast model construction method and system based on Chinese rhythm structure and stress
CN108595590A (en) * 2018-04-19 2018-09-28 中国科学院电子学研究所苏州研究院 A kind of Chinese Text Categorization based on fusion attention model
CN108874790A (en) * 2018-06-29 2018-11-23 中译语通科技股份有限公司 A kind of cleaning parallel corpora method and system based on language model and translation model

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914551A (en) * 2020-07-29 2020-11-10 北京字节跳动网络技术有限公司 Language representation model system, pre-training method, device, equipment and medium
CN112309368A (en) * 2020-11-23 2021-02-02 北京有竹居网络技术有限公司 Prosody prediction method, device, equipment and storage medium
CN112580361A (en) * 2020-12-18 2021-03-30 蓝舰信息科技南京有限公司 Formula based on unified attention mechanism and character recognition model method
CN112863484A (en) * 2021-01-25 2021-05-28 中国科学技术大学 Training method of prosodic phrase boundary prediction model and prosodic phrase boundary prediction method
CN112863484B (en) * 2021-01-25 2024-04-09 中国科学技术大学 Prosodic phrase boundary prediction model training method and prosodic phrase boundary prediction method
CN113129862A (en) * 2021-04-22 2021-07-16 合肥工业大学 World-tacontron-based voice synthesis method and system and server
CN113129862B (en) * 2021-04-22 2024-03-12 合肥工业大学 Voice synthesis method, system and server based on world-tacotron
CN113421550A (en) * 2021-06-25 2021-09-21 北京有竹居网络技术有限公司 Speech synthesis method, device, readable medium and electronic equipment
CN113657118A (en) * 2021-08-16 2021-11-16 北京好欣晴移动医疗科技有限公司 Semantic analysis method, device and system based on call text
CN113657118B (en) * 2021-08-16 2024-05-14 好心情健康产业集团有限公司 Semantic analysis method, device and system based on call text

Also Published As

Publication number Publication date
CN111354333B (en) 2023-11-10

Similar Documents

Publication Publication Date Title
CN111354333B (en) Self-attention-based Chinese prosody level prediction method and system
US11797822B2 (en) Neural network having input and hidden layers of equal units
US11210306B2 (en) Dialogue system, a method of obtaining a response from a dialogue system, and a method of training a dialogue system
CN106502985B (en) neural network modeling method and device for generating titles
KR101950985B1 (en) Systems and methods for human inspired simple question answering (hisqa)
US11113479B2 (en) Utilizing a gated self-attention memory network model for predicting a candidate answer match to a query
US11741109B2 (en) Dialogue system, a method of obtaining a response from a dialogue system, and a method of training a dialogue system
KR102116518B1 (en) Apparatus for answering a question based on maching reading comprehension and method for answering a question using thereof
US20190266246A1 (en) Sequence modeling via segmentations
CN112329465A (en) Named entity identification method and device and computer readable storage medium
BR112019004524B1 (en) NEURAL NETWORK SYSTEM, ONE OR MORE NON-TRAINER COMPUTER READABLE STORAGE MEDIA AND METHOD FOR AUTOREGRESSIVELY GENERATING AN AUDIO DATA OUTPUT SEQUENCE
CN108153864A (en) Method based on neural network generation text snippet
CN111145718A (en) Chinese mandarin character-voice conversion method based on self-attention mechanism
US11886813B2 (en) Efficient automatic punctuation with robust inference
CN110162789A (en) A kind of vocabulary sign method and device based on the Chinese phonetic alphabet
CN111178036B (en) Text similarity matching model compression method and system for knowledge distillation
CN114860915A (en) Model prompt learning method and device, electronic equipment and storage medium
US20220383119A1 (en) Granular neural network architecture search over low-level primitives
US11907661B2 (en) Method and apparatus for sequence labeling on entity text, and non-transitory computer-readable recording medium
US20240005131A1 (en) Attention neural networks with tree attention mechanisms
CN111026848B (en) Chinese word vector generation method based on similar context and reinforcement learning
CN113468883A (en) Fusion method and device of position information and computer readable storage medium
Heymann et al. Improving ctc using stimulated learning for sequence modeling
CN111259673A (en) Feedback sequence multi-task learning-based law decision prediction method and system
US20230153522A1 (en) Image captioning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant