CN111354333A - Chinese prosody hierarchy prediction method and system based on self-attention - Google Patents
Chinese prosody hierarchy prediction method and system based on self-attention Download PDFInfo
- Publication number
- CN111354333A CN111354333A CN201811571546.7A CN201811571546A CN111354333A CN 111354333 A CN111354333 A CN 111354333A CN 201811571546 A CN201811571546 A CN 201811571546A CN 111354333 A CN111354333 A CN 111354333A
- Authority
- CN
- China
- Prior art keywords
- word
- prosody
- prosodic
- sequence
- attention
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 239000013598 vector Substances 0.000 claims abstract description 81
- 230000011218 segmentation Effects 0.000 claims abstract description 33
- 230000007246 mechanism Effects 0.000 claims abstract description 8
- 230000006870 function Effects 0.000 claims description 15
- 238000002372 labelling Methods 0.000 claims description 15
- 238000013528 artificial neural network Methods 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 5
- 238000004422 calculation algorithm Methods 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 238000004590 computer program Methods 0.000 claims description 2
- 230000033764 rhythmic process Effects 0.000 abstract description 5
- 230000005540 biological transmission Effects 0.000 abstract description 4
- 238000004364 calculation method Methods 0.000 abstract description 4
- 230000015572 biosynthetic process Effects 0.000 description 4
- 238000003786 synthesis reaction Methods 0.000 description 4
- 230000006872 improvement Effects 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000000750 progressive effect Effects 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a Chinese prosody hierarchy prediction method based on self-attention, which comprises the following steps: learning a large amount of unlabelled texts to obtain word vectors of single characters, converting the texts to be predicted into word vector sequences by using the word vectors, inputting the word vector sequences into a trained prosody level prediction model, and outputting the word positions and prosody levels of the texts. The method of the invention utilizes a rhythm level prediction model to carry out Chinese rhythm level prediction, ensures the prediction performance, and simultaneously takes character granularity characteristics as input, avoids dependence on a word segmentation system and negative effects possibly caused by the dependence, and the model directly models the relationship between any two characters in a text by utilizing a self-attention mechanism, thereby realizing parallelization calculation; and the performance of the model is improved by using extra data for pre-training, so that the prosodic levels of the text to be processed can be accurately predicted at the same time, and the error transmission is avoided.
Description
Technical Field
The invention relates to the technical field of speech synthesis, in particular to a Chinese prosody hierarchy prediction method and system based on self-attention.
Background
In a speech synthesis system, predicting a prosodic hierarchy from an input text to be synthesized is always a crucial step, and the prediction result is used as a part of linguistic features for modeling acoustic features and duration. Therefore, the accuracy of prosody hierarchy prediction determines the naturalness of the synthesized speech to a great extent, and the realization of accurate prosody hierarchy prediction has great significance.
The mainstream method at present is to use a bidirectional long-and-short-term memory network BLSTM, and use word vectors as input to respectively model different prosody levels, i.e., train a model for prosodic words, prosodic phrases, and intonation phrases, respectively, and use the prediction result of the low level as the input of the high level to realize the step-by-step prosody prediction.
However, the above method has the following problems: 1) LSTM, as an RNN structure, requires the use of the output value at the previous time each time it is predicted, this sequential calculation hinders its parallelization and makes the distance between any two words o (n); 2) training and predicting a prosody prediction model on word granularity means that word segmentation processing must be performed on input text, and the result of word segmentation directly influences the performance of prosody level prediction. In addition, the number of Chinese entries is huge, and the storage of the word vectors occupies a large storage space and calculation resources, which is obviously not practical for offline speech synthesis; 3) progressive prosody prediction can cause erroneous results to be continuously transmitted, and subsequent prediction errors are caused.
The realization of prosody hierarchy prediction of a text is an essential step in a speech synthesis system, but the current mainstream method utilizes the characteristics of word levels to depend on the performance of a word segmentation system, and the progressive prosody prediction can cause the continuous transmission of error results.
Disclosure of Invention
The invention aims at solving the problems in the prior related art at least to a certain extent, and provides a prosody hierarchy prediction method which takes characters as basic units of a model, avoids dependence on a word segmentation system and reduces the requirement on a storage space; and the simultaneous prediction of the multi-level prosody is realized by utilizing one model, and the problem of error transmission is solved.
In order to achieve the above object, the present invention provides a chinese prosody hierarchy prediction method based on self-attention, including:
learning a large amount of unlabelled texts to obtain word vectors of single characters, converting the texts to be predicted into word vector sequences by using the word vectors, inputting the word vector sequences into a trained prosody level prediction model, and outputting the word positions and prosody levels of the texts.
As an improvement of the above method, the training step of the prosody-level prediction model includes:
step 1) learning a large amount of unlabeled texts to obtain word vectors of single words;
step 2) converting the text corresponding to the word segmentation data into a word vector sequence by using the word vectors obtained in the step 1), and obtaining a word position marking sequence according to the word segmentation result;
step 3) constructing a prosodic hierarchy prediction model based on a self-attention mechanism, and pre-training the prediction model by respectively taking the word vector sequence and the word position mark sequence of the participle data obtained in the step 2) as input and output;
step 4) converting the text corresponding to the prosody labeling data into a character vector sequence by using the character vector obtained in the step 1), obtaining a word position marking sequence according to a corresponding word segmentation result, and obtaining a labeling sequence corresponding to each prosody level according to prosody labeling;
and 5) on the basis of the model obtained by pre-training in the step 3), training the prosody hierarchy prediction model again according to the word vector sequence, the word position marking sequence and the prosody marking sequence of the prosody data obtained in the step 4), so as to obtain the trained prosody hierarchy prediction model.
As an improvement of the above method, the step 1) is specifically: based on the continuous bag-of-words model CBOW, setting the dimension of a word vector as d, training by using a large amount of non-labeled texts to obtain the initial value of the word vector of all the single words in the texts, and constructing a word table by using the initial value of the word-word vector.
As a modification of the above method, the step 2) further comprises:
step 2-1) searching a word vector of a corresponding word in a word table searching manner according to the text information of the word segmentation data, so as to determine a word vector characteristic sequence of the corresponding text;
step 2-2) determining a word position mark sequence corresponding to the word segmentation data text according to the position of the word in the word, wherein B, M, E, S represents that the word is at the beginning of the word, the word is in the middle of the word, the word is at the end of the word and a single word respectively.
As a modification of the above method, the step 3) further comprises:
step 3-1) constructing a prosody hierarchy prediction model with N layers, wherein each layer comprises a feedforward neural network sublayer and a self-attention sublayer, and the two sublayers are connected by adopting residual errors, and the prosody hierarchy prediction model is as follows:
Y=X+SubLayer(X)
wherein X, Y denotes the input and output of the sub-layer, respectively; the prediction model has four output layers, wherein the three output layers respectively predict prosodic word boundaries, prosodic phrase boundaries and intonation phrase boundaries; an output layer predicts the word position to realize word segmentation of the text;
the feedforward neural network sublayer consists of two linear projections, the middle of the two linear projections is connected by taking a modified linear unit as an activation function, and the formula is as follows:
FFN(X)=max(XW1+b1,0)W2+b2
wherein W1、W2A weight matrix of two linear projections with dimensions d × dfAnd df×d;b1、b2Is a bias vector;
the self-attention sublayer adopts multi-head self-attention, for each head, firstly, an input matrix is subjected to linear projection to obtain three matrixes Q, K, V, then, the three matrixes are subjected to zooming dot product attention operation to obtain a vector M, and the M of all the heads are spliced and subjected to linear projection to obtain the output of the sublayer; m is calculated from the formula:
wherein Softmax () is a normalized exponential function;
step 3-2) using sine and cosine functions with different frequencies to code different positions of the input sequence, wherein the coding functions are as follows:
PE(t,2i)=sin(t/100002i/d)
PE(t,2i+1)=sin(t/100002i/d)
wherein t is the position and i is the dimension; the position coding and the input word vector dimension are d, and the position coding and the input word vector dimension are added together to be used as the input of a prosody level prediction model;
step 3-3) pre-training a prosodic hierarchy prediction model;
iteration is carried out by taking the cross entropy between the actual output and the expected output of the minimum word segmentation task as a criterion, and the cost function is as follows:
wherein y is the expected output, y is {0,1}, a is the actual output value, a ∈ [0,1], x corresponds to each node of the output layer, n is the number of nodes of the output layer, and the parameters of the model are updated through a back propagation algorithm with descending random gradient.
As a modification of the above method, the step 4) further comprises:
step 4-1) searching a word vector of a corresponding word in a word table searching manner according to the text information of the prosody labeling data, so as to determine a word vector characteristic sequence of the corresponding text;
step 4-2) determining a word position marking sequence corresponding to the prosodic data text according to the corresponding word segmentation result of the prosodic data; b, M, E, S respectively indicates that the character is at the beginning of the word, the character is in the middle of the word, the character is at the end of the word and the single word;
step 4-3) determining the labeling sequence of each prosodic level of prosodic words, prosodic phrases and intonation phrases according to the labeling of the prosodic data; the word is denoted by B as a prosodic boundary and NB as a non-prosodic boundary, respectively.
As an improvement of the above method, the step 5) is specifically: on the basis of the model obtained by pre-training in the step 3), taking a word vector sequence of prosody data as the input of the model, and taking a lexeme marking sequence and a prosody labeling sequence of each level as the output of the model; and (3) taking the sum of the cross entropies between the actual output and the expected output of each minimized output layer as a model training criterion, and updating model parameters by adopting a back propagation algorithm with descending random gradient to obtain a trained prosody level prediction model.
The invention also provides a system for Chinese prosody hierarchy prediction based on self-attention, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of the method as claimed above.
The invention has the advantages that:
1. the prosodic hierarchy prediction model of the invention takes the character granularity characteristics as input while ensuring the prediction performance, thereby avoiding the dependence on a word segmentation system and the negative influence possibly caused by the dependence on the word segmentation system, and simultaneously reducing the size of the model;
2. the prosody hierarchy prediction model directly models the relation between any two characters in the text by using a self-attention mechanism, and can realize parallelization calculation; the performance of the model is improved by using extra data for pre-training, and the accurate prediction of the prosody hierarchy of the text to be processed is realized;
3. the method of the invention adopts one model to predict a plurality of prosody hierarchies simultaneously, thereby avoiding wrong transmission.
Drawings
Fig. 1 is a flow chart of a chinese prosody hierarchy prediction method based on self-attention according to the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The invention provides a Chinese prosody prediction method based on self-attention. The method takes a word vector as an input characteristic, models the dependency relationship among words in a text through a self-attention mechanism, and sets an independent output layer for each level of prosody, thereby realizing the simultaneous prediction of each level of the prosody. The method avoids the dependency on a segmentation system and simultaneously realizes the accurate prediction of the text prosody hierarchy.
The invention provides a Chinese prosody hierarchy prediction model construction method based on self-attention, which comprises the following steps: learning a large amount of unlabeled texts to obtain word vectors of single words; acquiring a word vector sequence and a word position mark sequence of a text corresponding to the data according to the word vector and the word segmentation data; constructing a rhythm prediction model based on a self-attention mechanism, and pre-training the model according to a word vector sequence and a word position marking sequence of word segmentation data; obtaining a word vector sequence, a word position marking sequence and each rhythm level marking sequence of a corresponding text according to the word vector and rhythm marking data with word segmentation information; and continuing training on the basis of the pre-training prosody level prediction model according to the word vector sequence, the lexeme marking sequence and each prosody level marking sequence of the prosody data. The method is based on character level characteristics, direct modeling of the relation between any two characters in the text is achieved through a self-attention mechanism, and pre-training is carried out by utilizing extra data to improve the performance of the model, so that accurate prediction of the prosody hierarchy of the text to be processed is achieved.
The method of the invention comprises the following steps:
step 1) constructing and training a prosodic hierarchy prediction model, as shown in fig. 1, the steps specifically include:
step 101), learning a large amount of unlabeled texts to obtain word vectors of single words.
The method comprises the steps that unlabeled texts are collected from expected texts in various fields, characters in the texts are used as basic training units, the dimension of each character vector is set to be d on the basis of a continuous bag-of-words model CBOW, and an initial character vector of each character is obtained through training. And constructing a word table by using the initial value of the word-word vector.
Step 102), obtaining a word vector sequence and a word position marking sequence of the corresponding text according to the word vector and the word segmentation data.
The character vector feature sequence is obtained by searching each character in the word text into the character vector of the corresponding character through the operation of searching the character table.
And determining the word position marking sequence according to the position of the word corresponding to the word segmentation data text in the word. B, M, E, S indicates the word at the beginning of the word, the word in the middle of the word, the word at the end of the word, and the single word.
Specifically, for the participle text "aribaba is completely different from walmart", the lexeme tag sequence is as follows: [ B, M, M, E, S, B, M, E, B, E, S, S ].
Step 103), constructing a prosody hierarchy prediction model based on a self-attention mechanism, and pre-training the model by utilizing the word vector characteristic sequence and the lexeme mark sequence of the participle data obtained in the step 102).
The constructed prosody hierarchy prediction model is composed of N layers, each layer comprises a feedforward neural network sub-layer and a self-attention sub-layer, and residual errors are connected between every two sub-layers, and the following formula is adopted:
Y=X+SubLayer(X)
where X, Y represent the input and output, respectively, of the sub-layer. The model has four output layers, three of the four output layers are used for carrying out prosody level prediction, namely, prosodic word boundaries, prosodic phrase boundaries and intonation phrase boundaries are predicted respectively, and the simultaneous prediction of multi-level prosody in one model is realized; and the other output layer carries out a word segmentation task, because the prosodic hierarchy boundary is established on the basis of grammatical words, the word segmentation task is introduced to obtain word level information so as to improve the accuracy of prosodic hierarchy prediction.
Specifically, the feedforward neural network sublayer consists of two linear projections, the middle of which is connected by a modified linear unit as an activation function, and the formula is as follows:
FFN(X)=max(XW1+b1,0)W2+b2
wherein W1、W2Weights for two linear projectionsMatrix with dimensions d × dfAnd df×d;b1、b2Is a bias vector.
The self-attention sublayer adopts multi-head self-attention, for each head, firstly, an input matrix is subjected to linear projection to obtain three matrixes Q, K, V, then, scaling dot-product attention (scaled dot-product attention) operation is carried out on the three matrixes to obtain a vector M, and the M of all the heads are spliced and subjected to linear projection to obtain the output of the sublayer. M is calculated from the formula:
where Softmax () is a normalized exponential function.
The model does not use sequence models such as RNN and the like, and can not consider time sequence information, so that sine and cosine functions with different frequencies are used for coding different positions of an input sequence so as to introduce the sequence relation among words to a certain extent, and the coding function is as follows:
PE(t,2i)=sin(t/100002i/d)
PE(t,2i+1)=sin(t/100002i/d)
where t is the position and i is the dimension. The position code and the input word vector have the same dimension d, and the position code and the input word vector are added together to be used as model input.
When the model is pre-trained, iteration is carried out by taking the cross entropy between the actual output and the expected output of the minimum word segmentation task as the criterion, and the cost function is as follows:
where y is the desired output, y is {0,1}, a is the actual output value of the network, and satisfies a ∈ [0,1], x corresponds to each node of the output layer, and n is the number of nodes of the output layer.
And step 104), obtaining a word vector sequence, a word position marking sequence and each level prosody marking sequence of the corresponding text according to the word vector and the prosody marking data with the word segmentation information.
Wherein the method for obtaining the word vector sequence and the lexeme mark sequence is the same as the step 102). The prosodic words, prosodic phrases and the prosodic phrase labeling sequence of each prosodic level is determined by prosodic labels. The word is denoted by B as a prosodic boundary and NB as a non-prosodic boundary, respectively.
Specifically, for the prosodic annotation text "a rribaba #1 is completely different from #1 in # 3" of #1 walmart #2, the prosodic word annotation sequence is [ NB, B, NB, B, NB.
Step 105), continuing training on the basis of the pre-training prosody level prediction model in the step 103) by utilizing the word vector sequence, the lexeme mark sequence and each level prosody mark sequence obtained in the step 104).
The word vector sequence is input as a model, the lexeme marking sequence and each level of prosody marking sequence are output as a model, and the sum of cross entropies between actual output and expected output of each output layer is minimized as a criterion during training.
And 2) converting the text to be predicted into a word vector sequence by using the word vector in the step 101), inputting the word vector sequence into the trained prosody level prediction model, and outputting the lexeme and prosody level of the text.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.
Claims (8)
1. A method for Chinese prosody hierarchy prediction based on self-attention, the method comprising:
learning a large amount of unlabelled texts to obtain word vectors of single characters, converting the texts to be predicted into word vector sequences by using the word vectors, inputting the word vector sequences into a trained prosody level prediction model, and outputting the word positions and prosody levels of the texts.
2. The method of claim 1, wherein the training of the prosodic hierarchy prediction model comprises:
step 1) learning a large amount of unlabeled texts to obtain word vectors of single words;
step 2) converting the text corresponding to the word segmentation data into a word vector sequence by using the word vectors obtained in the step 1), and obtaining a word position marking sequence according to the word segmentation result;
step 3) constructing a prosodic hierarchy prediction model based on a self-attention mechanism, and pre-training the prediction model by respectively taking the word vector sequence and the word position mark sequence of the participle data obtained in the step 2) as input and output;
step 4) converting the text corresponding to the prosody labeling data into a character vector sequence by using the character vector obtained in the step 1), obtaining a word position marking sequence according to a corresponding word segmentation result, and obtaining a labeling sequence corresponding to each prosody level according to prosody labeling;
and 5) on the basis of the model obtained by pre-training in the step 3), training the prosody hierarchy prediction model again according to the word vector sequence, the word position marking sequence and the prosody marking sequence of the prosody data obtained in the step 4), so as to obtain the trained prosody hierarchy prediction model.
3. The method for predicting Chinese prosody hierarchy based on self-attention as claimed in claim 2, wherein the step 1) is specifically as follows: based on the continuous bag-of-words model CBOW, setting the dimension of a word vector as d, training by using a large amount of non-labeled texts to obtain the initial value of the word vector of all the single words in the texts, and constructing a word table by using the initial value of the word-word vector.
4. The method for Chinese prosodic hierarchy prediction based on self-attention as claimed in claim 3, wherein said step 2) further comprises:
step 2-1) searching a word vector of a corresponding word in a word table searching manner according to the text information of the word segmentation data, so as to determine a word vector characteristic sequence of the corresponding text;
step 2-2) determining a word position mark sequence corresponding to the word segmentation data text according to the position of the word in the word, wherein B, M, E, S represents that the word is at the beginning of the word, the word is in the middle of the word, the word is at the end of the word and a single word respectively.
5. The method for Chinese prosodic hierarchy prediction based on self-attention as claimed in claim 4, wherein said step 3) further comprises:
step 3-1) constructing a prosody hierarchy prediction model with N layers, wherein each layer comprises a feedforward neural network sublayer and a self-attention sublayer, and the two sublayers are connected by adopting residual errors, and the prosody hierarchy prediction model is as follows:
Y=X+SubLayer(X)
wherein X, Y denotes the input and output of the sub-layer, respectively; the prediction model has four output layers, wherein the three output layers respectively predict prosodic word boundaries, prosodic phrase boundaries and intonation phrase boundaries; an output layer predicts the word position to realize word segmentation of the text;
the feedforward neural network sublayer consists of two linear projections, the middle of the two linear projections is connected by taking a modified linear unit as an activation function, and the formula is as follows:
FFN(X)=max(XW1+b1,0)W2+b2
wherein W1、W2A weight matrix of two linear projections with dimensions d × dfAnd df×d;b1、b2Is a bias vector;
the self-attention sublayer adopts multi-head self-attention, for each head, firstly, an input matrix is subjected to linear projection to obtain three matrixes Q, K, V, then, the three matrixes are subjected to zooming dot product attention operation to obtain a vector M, and the M of all the heads are spliced and subjected to linear projection to obtain the output of the sublayer; m is calculated from the formula:
wherein Softmax () is a normalized exponential function;
step 3-2) using sine and cosine functions with different frequencies to code different positions of the input sequence, wherein the coding functions are as follows:
PE(t,2i)=sin(t/100002i/d)
PE(t,2i+1)=sin(t/100002i/d)
wherein t is the position and i is the dimension; the position coding and the input word vector dimension are d, and the position coding and the input word vector dimension are added together to be used as the input of a prosody level prediction model;
step 3-3) pre-training a prosodic hierarchy prediction model;
iteration is carried out by taking the cross entropy between the actual output and the expected output of the minimum word segmentation task as a criterion, and the cost function is as follows:
wherein y is the expected output, y is {0,1}, a is the actual output value, a ∈ [0,1], x corresponds to each node of the output layer, n is the number of nodes of the output layer, and the parameters of the model are updated through a back propagation algorithm with descending random gradient.
6. The method for Chinese prosodic hierarchy prediction based on self-attention as claimed in claim 5, wherein said step 4) further comprises:
step 4-1) searching a word vector of a corresponding word in a word table searching manner according to the text information of the prosody labeling data, so as to determine a word vector characteristic sequence of the corresponding text;
step 4-2) determining a word position marking sequence corresponding to the prosodic data text according to the corresponding word segmentation result of the prosodic data; b, M, E, S respectively indicates that the character is at the beginning of the word, the character is in the middle of the word, the character is at the end of the word and the single word;
step 4-3) determining the labeling sequence of each prosodic level of prosodic words, prosodic phrases and intonation phrases according to the labeling of the prosodic data; the word is denoted by B as a prosodic boundary and NB as a non-prosodic boundary, respectively.
7. The method for Chinese prosody hierarchy prediction based on self-attention as claimed in claim 6, wherein the step 5) is specifically as follows: on the basis of the model obtained by pre-training in the step 3), taking a word vector sequence of prosody data as the input of the model, and taking a lexeme marking sequence and a prosody labeling sequence of each level as the output of the model; and (3) taking the sum of the cross entropies between the actual output and the expected output of each minimized output layer as a model training criterion, and updating model parameters by adopting a back propagation algorithm with descending random gradient to obtain a trained prosody level prediction model.
8. A system for chinese prosodic hierarchy prediction based on self-attention, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811571546.7A CN111354333B (en) | 2018-12-21 | 2018-12-21 | Self-attention-based Chinese prosody level prediction method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811571546.7A CN111354333B (en) | 2018-12-21 | 2018-12-21 | Self-attention-based Chinese prosody level prediction method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111354333A true CN111354333A (en) | 2020-06-30 |
CN111354333B CN111354333B (en) | 2023-11-10 |
Family
ID=71195629
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811571546.7A Active CN111354333B (en) | 2018-12-21 | 2018-12-21 | Self-attention-based Chinese prosody level prediction method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111354333B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111914551A (en) * | 2020-07-29 | 2020-11-10 | 北京字节跳动网络技术有限公司 | Language representation model system, pre-training method, device, equipment and medium |
CN112309368A (en) * | 2020-11-23 | 2021-02-02 | 北京有竹居网络技术有限公司 | Prosody prediction method, device, equipment and storage medium |
CN112580361A (en) * | 2020-12-18 | 2021-03-30 | 蓝舰信息科技南京有限公司 | Formula based on unified attention mechanism and character recognition model method |
CN112863484A (en) * | 2021-01-25 | 2021-05-28 | 中国科学技术大学 | Training method of prosodic phrase boundary prediction model and prosodic phrase boundary prediction method |
CN113129862A (en) * | 2021-04-22 | 2021-07-16 | 合肥工业大学 | World-tacontron-based voice synthesis method and system and server |
CN113421550A (en) * | 2021-06-25 | 2021-09-21 | 北京有竹居网络技术有限公司 | Speech synthesis method, device, readable medium and electronic equipment |
CN113657118A (en) * | 2021-08-16 | 2021-11-16 | 北京好欣晴移动医疗科技有限公司 | Semantic analysis method, device and system based on call text |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030149558A1 (en) * | 2000-04-12 | 2003-08-07 | Martin Holsapfel | Method and device for determination of prosodic markers |
US20080147405A1 (en) * | 2006-12-13 | 2008-06-19 | Fujitsu Limited | Chinese prosodic words forming method and apparatus |
CN105185374A (en) * | 2015-09-11 | 2015-12-23 | 百度在线网络技术(北京)有限公司 | Prosodic hierarchy annotation method and device |
CN105244020A (en) * | 2015-09-24 | 2016-01-13 | 百度在线网络技术(北京)有限公司 | Prosodic hierarchy model training method, text-to-speech method and text-to-speech device |
CN107451115A (en) * | 2017-07-11 | 2017-12-08 | 中国科学院自动化研究所 | The construction method and system of Chinese Prosodic Hierarchy forecast model end to end |
CN107464559A (en) * | 2017-07-11 | 2017-12-12 | 中国科学院自动化研究所 | Joint forecast model construction method and system based on Chinese rhythm structure and stress |
CN108595590A (en) * | 2018-04-19 | 2018-09-28 | 中国科学院电子学研究所苏州研究院 | A kind of Chinese Text Categorization based on fusion attention model |
CN108874790A (en) * | 2018-06-29 | 2018-11-23 | 中译语通科技股份有限公司 | A kind of cleaning parallel corpora method and system based on language model and translation model |
-
2018
- 2018-12-21 CN CN201811571546.7A patent/CN111354333B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030149558A1 (en) * | 2000-04-12 | 2003-08-07 | Martin Holsapfel | Method and device for determination of prosodic markers |
US20080147405A1 (en) * | 2006-12-13 | 2008-06-19 | Fujitsu Limited | Chinese prosodic words forming method and apparatus |
CN105185374A (en) * | 2015-09-11 | 2015-12-23 | 百度在线网络技术(北京)有限公司 | Prosodic hierarchy annotation method and device |
CN105244020A (en) * | 2015-09-24 | 2016-01-13 | 百度在线网络技术(北京)有限公司 | Prosodic hierarchy model training method, text-to-speech method and text-to-speech device |
CN107451115A (en) * | 2017-07-11 | 2017-12-08 | 中国科学院自动化研究所 | The construction method and system of Chinese Prosodic Hierarchy forecast model end to end |
CN107464559A (en) * | 2017-07-11 | 2017-12-12 | 中国科学院自动化研究所 | Joint forecast model construction method and system based on Chinese rhythm structure and stress |
CN108595590A (en) * | 2018-04-19 | 2018-09-28 | 中国科学院电子学研究所苏州研究院 | A kind of Chinese Text Categorization based on fusion attention model |
CN108874790A (en) * | 2018-06-29 | 2018-11-23 | 中译语通科技股份有限公司 | A kind of cleaning parallel corpora method and system based on language model and translation model |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111914551A (en) * | 2020-07-29 | 2020-11-10 | 北京字节跳动网络技术有限公司 | Language representation model system, pre-training method, device, equipment and medium |
CN112309368A (en) * | 2020-11-23 | 2021-02-02 | 北京有竹居网络技术有限公司 | Prosody prediction method, device, equipment and storage medium |
CN112580361A (en) * | 2020-12-18 | 2021-03-30 | 蓝舰信息科技南京有限公司 | Formula based on unified attention mechanism and character recognition model method |
CN112863484A (en) * | 2021-01-25 | 2021-05-28 | 中国科学技术大学 | Training method of prosodic phrase boundary prediction model and prosodic phrase boundary prediction method |
CN112863484B (en) * | 2021-01-25 | 2024-04-09 | 中国科学技术大学 | Prosodic phrase boundary prediction model training method and prosodic phrase boundary prediction method |
CN113129862A (en) * | 2021-04-22 | 2021-07-16 | 合肥工业大学 | World-tacontron-based voice synthesis method and system and server |
CN113129862B (en) * | 2021-04-22 | 2024-03-12 | 合肥工业大学 | Voice synthesis method, system and server based on world-tacotron |
CN113421550A (en) * | 2021-06-25 | 2021-09-21 | 北京有竹居网络技术有限公司 | Speech synthesis method, device, readable medium and electronic equipment |
CN113657118A (en) * | 2021-08-16 | 2021-11-16 | 北京好欣晴移动医疗科技有限公司 | Semantic analysis method, device and system based on call text |
CN113657118B (en) * | 2021-08-16 | 2024-05-14 | 好心情健康产业集团有限公司 | Semantic analysis method, device and system based on call text |
Also Published As
Publication number | Publication date |
---|---|
CN111354333B (en) | 2023-11-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111354333B (en) | Self-attention-based Chinese prosody level prediction method and system | |
US11797822B2 (en) | Neural network having input and hidden layers of equal units | |
US11210306B2 (en) | Dialogue system, a method of obtaining a response from a dialogue system, and a method of training a dialogue system | |
CN106502985B (en) | neural network modeling method and device for generating titles | |
KR101950985B1 (en) | Systems and methods for human inspired simple question answering (hisqa) | |
US11113479B2 (en) | Utilizing a gated self-attention memory network model for predicting a candidate answer match to a query | |
US11741109B2 (en) | Dialogue system, a method of obtaining a response from a dialogue system, and a method of training a dialogue system | |
KR102116518B1 (en) | Apparatus for answering a question based on maching reading comprehension and method for answering a question using thereof | |
US20190266246A1 (en) | Sequence modeling via segmentations | |
CN112329465A (en) | Named entity identification method and device and computer readable storage medium | |
BR112019004524B1 (en) | NEURAL NETWORK SYSTEM, ONE OR MORE NON-TRAINER COMPUTER READABLE STORAGE MEDIA AND METHOD FOR AUTOREGRESSIVELY GENERATING AN AUDIO DATA OUTPUT SEQUENCE | |
CN108153864A (en) | Method based on neural network generation text snippet | |
CN111145718A (en) | Chinese mandarin character-voice conversion method based on self-attention mechanism | |
US11886813B2 (en) | Efficient automatic punctuation with robust inference | |
CN110162789A (en) | A kind of vocabulary sign method and device based on the Chinese phonetic alphabet | |
CN111178036B (en) | Text similarity matching model compression method and system for knowledge distillation | |
CN114860915A (en) | Model prompt learning method and device, electronic equipment and storage medium | |
US20220383119A1 (en) | Granular neural network architecture search over low-level primitives | |
US11907661B2 (en) | Method and apparatus for sequence labeling on entity text, and non-transitory computer-readable recording medium | |
US20240005131A1 (en) | Attention neural networks with tree attention mechanisms | |
CN111026848B (en) | Chinese word vector generation method based on similar context and reinforcement learning | |
CN113468883A (en) | Fusion method and device of position information and computer readable storage medium | |
Heymann et al. | Improving ctc using stimulated learning for sequence modeling | |
CN111259673A (en) | Feedback sequence multi-task learning-based law decision prediction method and system | |
US20230153522A1 (en) | Image captioning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |