CN111354333B - Self-attention-based Chinese prosody level prediction method and system - Google Patents

Self-attention-based Chinese prosody level prediction method and system Download PDF

Info

Publication number
CN111354333B
CN111354333B CN201811571546.7A CN201811571546A CN111354333B CN 111354333 B CN111354333 B CN 111354333B CN 201811571546 A CN201811571546 A CN 201811571546A CN 111354333 B CN111354333 B CN 111354333B
Authority
CN
China
Prior art keywords
word
prosody
sequence
text
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811571546.7A
Other languages
Chinese (zh)
Other versions
CN111354333A (en
Inventor
张鹏远
卢春晖
颜永红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Original Assignee
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, Beijing Kexin Technology Co Ltd filed Critical Institute of Acoustics CAS
Priority to CN201811571546.7A priority Critical patent/CN111354333B/en
Publication of CN111354333A publication Critical patent/CN111354333A/en
Application granted granted Critical
Publication of CN111354333B publication Critical patent/CN111354333B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a Chinese prosody level prediction method based on self-attention, which comprises the following steps: and learning a large number of unlabeled texts to obtain word vectors of single words, converting the text to be predicted into a word vector sequence by using the word vectors, inputting the word vector sequence into a trained prosody level prediction model, and outputting the word positions and prosody levels of the text. According to the method, the prosody level prediction model is used for carrying out Chinese prosody level prediction, the characteristic of word granularity is used as input while the prediction performance is ensured, dependence on a word segmentation system and possible negative influence caused by the dependence are avoided, and the model directly models the relation between any two words in a text by using a self-attention mechanism, so that parallelization calculation can be realized; and the model performance is improved by pre-training with the additional data, so that the simultaneous and accurate prediction of each prosody level of the text to be processed is realized, and the error transmission is avoided.

Description

Self-attention-based Chinese prosody level prediction method and system
Technical Field
The invention relates to the technical field of speech synthesis, in particular to a self-attention-based Chinese prosody level prediction method and a self-attention-based Chinese prosody level prediction system.
Background
In speech synthesis systems, predicting prosody hierarchy from input text to be synthesized has been a crucial step, the prediction of which will be used as part of linguistic features for modeling acoustic features and duration. Therefore, the accuracy of prosody level prediction determines the naturalness of the synthesized voice to a great extent, and the realization of accurate prosody level prediction is significant.
The current mainstream method is to use a bidirectional long and short time memory network BLSTM, respectively model different prosody levels by taking word vectors as input, namely respectively train a model for prosody words, prosody phrases and intonation phrases, and take a low-level prediction result as a high-level input to realize the step-by-step prediction of prosody.
However, the above method has the following problems: 1) LSTM as an RNN structure, which requires the use of the output value of the previous moment each time the output value of the current moment is predicted, this sequential calculation prevents its parallelization and makes the distance between any two words O (n); 2) Training and predicting prosody prediction models at word granularity means that the input text must be first word-segmented, and the result of the segmentation will directly affect the performance of prosody level prediction. In addition, the number of Chinese vocabulary entries is huge, and the storage of these vocabulary vectors occupies a large memory space and computational resources, which is obviously impractical for offline speech synthesis; 3) Progressive prosody prediction may cause erroneous results to be continuously passed on, resulting in subsequent prediction errors.
The realization of prosody level prediction of a text is an indispensable step in a speech synthesis system, but the current mainstream method utilizes word-level characteristics to depend on the performance of a word segmentation system, and progressive prosody prediction can cause continuous transmission of error results.
Disclosure of Invention
The present invention aims to solve the above-mentioned problems occurring in the related art at least to a certain extent, and proposes a prosody level prediction method, which uses words as basic units of a model, and reduces the need for storage space while avoiding reliance on a word segmentation system; and a model is utilized to realize simultaneous prediction of multi-stage prosody, so that the problem of error transmission is solved.
In order to achieve the above object, the present invention proposes a method for predicting prosody level of chinese based on self-attention, the method comprising:
and learning a large number of unlabeled texts to obtain word vectors of single words, converting the text to be predicted into a word vector sequence by using the word vectors, inputting the word vector sequence into a trained prosody level prediction model, and outputting the word positions and prosody levels of the text.
As an improvement of the above method, the training step of the prosody level prediction model includes:
step 1) learning a large number of unlabeled texts to obtain word vectors of single words;
step 2) converting the text corresponding to the word segmentation data into a word vector sequence by using the word vector obtained in the step 1), and obtaining a word position mark sequence according to the word segmentation result;
step 3) constructing a prosody level prediction model based on a self-attention mechanism, and respectively taking the word vector sequence and the word position mark sequence of the word segmentation data obtained in the step 2) as input and output to pretrain the prediction model;
step 4) converting the text corresponding to the prosody marking data into a character vector sequence by utilizing the character vector obtained in the step 1), obtaining a word position marking sequence according to the corresponding word segmentation result, and obtaining marking sequences corresponding to each prosody level according to the prosody marking;
step 5) training the prosody level prediction model again according to the character vector sequence, the word position mark sequence and the prosody mark sequence of the prosody data obtained in the step 4) on the basis of the model obtained in the pre-training of the step 3) to obtain a trained prosody level prediction model.
As an improvement of the above method, the step 1) specifically includes: based on a continuous word bag model CBOW, setting the dimension of a word vector as d, training by using a large number of unlabeled texts to obtain the initial values of the word vectors of all single words in the texts, and constructing a word table by using the initial values of the word-word vectors.
As an improvement of the above method, the step 2) further includes:
step 2-1), according to the text information of the word segmentation data, looking up the word vector of the corresponding word in a word table searching way, so as to determine the word vector feature sequence of the corresponding text;
step 2-2) determining a word position mark sequence corresponding to the word segmentation data text according to the position of the word in the word, wherein B, M, E, S is used for respectively representing the word at the beginning of the word, the word in the middle of the word, the word at the end of the word and the single word.
As an improvement of the above method, the step 3) further includes:
step 3-1) constructing a prosody level prediction model of N layers, wherein each layer comprises a feedforward neural network sublayer and a self-attention sublayer, and residual connection is adopted between every two sublayers, and the following formula is adopted:
Y=X+SubLayer(X)
wherein X, Y represents the input and output of the sub-layers, respectively; the prediction model has four output layers, wherein three output layers respectively predict prosodic word boundaries, prosodic phrase boundaries and intonation phrase boundaries; the output layer predicts word positions and realizes word segmentation of texts;
the feedforward neural network sublayer consists of two linear projections, the middle of which is connected by a correction linear unit as an activation function, and the formula is as follows:
FFN(X)=max(XW 1 +b 1 ,0)W 2 +b 2
wherein W is 1 、W 2 The dimension of the weight matrix for two linear projections is d×d f And d f ×d;b 1 、b 2 Is a bias vector;
the self-attention sub-layer adopts multi-head self-attention, for each head, firstly, linear projection is carried out on an input matrix to obtain three matrixes Q, K, V, then dot product attention scaling operation is carried out on the three matrixes to obtain a vector M, and M of all the heads are spliced and linear projection is carried out to obtain the output of the sub-layer; m is calculated by the formula:
wherein Softmax () is a normalized exponential function;
step 3-2) encoding different positions of the input sequence using sine and cosine functions of different frequencies, the encoding functions being as follows:
PE(t,2i)=sin(t/10000 2i/d )
PE(t,2i+1)=sin(t/10000 2i/d )
wherein t is the position and i is the dimension; the position coding and the vector dimension of the input word are d, and the two are added together to be used as the input of the prosody level prediction model;
step 3-3) pre-training a prosody level prediction model;
iteration is carried out on the basis of cross entropy between the actual output and the expected output of the minimum word segmentation task, and the cost function is as follows:
wherein y is the expected output, y= {0,1}, a is the actual output value, satisfying a e [0,1], x corresponds to each node of the output layer, and n is the number of nodes of the output layer; the parameters of the model are updated by a back propagation algorithm with random gradient descent.
As an improvement of the above method, the step 4) further includes:
step 4-1), according to the text information of the prosody annotation data, looking up the word vector of the corresponding word in a word table searching manner, so as to determine the word vector feature sequence of the corresponding text;
step 4-2), determining a word position mark sequence corresponding to the prosodic data text according to the word segmentation result corresponding to the prosodic data; b, M, E, S is used for respectively representing a character at the beginning of a word, a character in the middle of the word, a character at the end of the word and a single character word;
step 4-3) determining labeling sequences of each prosody level of prosodic words, prosody phrases and intonation phrases according to the labeling of the prosody data; the word is represented as prosodic boundaries by B and not prosodic boundaries by NB, respectively.
As an improvement of the above method, the step 5) specifically includes: based on the model obtained by the pre-training in the step 3), taking a character vector sequence of prosody data as the input of the model, and taking a word position mark sequence and prosody mark sequences of all levels as the output of the model; and updating model parameters by adopting a back propagation algorithm of random gradient descent by taking the sum of the cross entropy between the actual output and the expected output of each output layer as a model training criterion to obtain a trained prosody level prediction model.
In addition, the invention also provides a Chinese prosody hierarchy prediction system based on self-attention, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method of the claims when executing the program.
The invention has the advantages that:
1. the prosody level prediction model takes the characteristics of word granularity as input while ensuring the prediction performance, avoids the dependence on a word segmentation system and the possible negative influence thereof, and reduces the size of the model;
2. the prosody level prediction model of the invention directly models the relationship between any two words in the text by using a self-attention mechanism, and can realize parallelization calculation; the extra data is used for pre-training to improve the model performance, so that accurate prediction of the text prosody level to be processed is realized;
3. the method adopts one model to predict a plurality of prosody levels at the same time, thereby avoiding the transmission of errors.
Drawings
Fig. 1 is a flowchart of the inventive self-attention based chinese prosody level prediction method.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The invention provides a Chinese prosody prediction method based on self-attention. According to the method, word vectors are used as input features, dependency relations among words in a text are modeled through a self-attention mechanism, independent output layers are arranged for each stage of rhythm, and simultaneous prediction of each stage of rhythm is achieved. The method can realize accurate prediction of the text prosody level while avoiding dependence on a word segmentation system.
The invention provides a method for constructing a Chinese prosody level prediction model based on self-attention, which comprises the following steps: learning a large number of unlabeled texts to obtain word vectors of the single words; acquiring a character vector sequence and a word position mark sequence of a text corresponding to the data according to the character vector and the word segmentation data; constructing a prosody prediction model based on a self-attention mechanism, and pre-training the model according to a word vector sequence and a word position mark sequence of word segmentation data; obtaining a character vector sequence, a word position mark sequence and each prosody level mark sequence of a corresponding text according to the character vector and prosody mark data with word segmentation information; and continuing training on the basis of the pre-training prosody level prediction model according to the character vector sequence, the word position marking sequence and the prosody level marking sequences of the prosody data. Based on character level characteristics, the method models the relationship between any two characters in the text directly through a self-attention mechanism, and pre-trains by utilizing additional data to improve model performance, so that accurate prediction of the text prosody level to be processed is realized.
The method of the invention comprises the following steps:
step 1) constructing and training a prosody level prediction model, as shown in fig. 1, the steps specifically include:
step 101), learning a large number of unlabeled texts to obtain word vectors of the single words.
The unlabeled text is collected from the expected text in each field, characters in the text are used as basic training units, the dimension of each character vector is set to be d based on a continuous word bag model CBOW, and the initial character vector of each character is obtained through training. A word table is constructed with word-to-word vector initial values.
Step 102), acquiring a character vector sequence and a word position mark sequence of the corresponding text according to the character vector and the word segmentation data.
The character vector feature sequence is obtained by looking up each character in the word segmentation text by looking up the character vector of the corresponding character.
The word position mark sequence is determined according to the position of the word corresponding to the word segmentation data text in the word. The words are shown at the beginning of the word, the middle of the word, the end of the word, and the single word at B, M, E, S, respectively.
Specifically, for the word segmentation text "the aleba is completely different from the walmar" the lexeme tag sequence is: [ B, M, M, E, S, B, M, E, B, E, S, S ].
Step 103), constructing a prosody level prediction model based on a self-attention mechanism, and pre-training the model by utilizing the character vector feature sequence and the word position mark sequence of the word segmentation data obtained in the step 102).
The constructed prosody level prediction model consists of N layers, wherein each layer comprises a feedforward neural network sublayer and a self-attention sublayer, and residual connection is adopted between every two sublayers, and the following formula is adopted:
Y=X+SubLayer(X)
where X, Y represents the input and output of the sub-layers, respectively. The model has four output layers, three of which are used for predicting prosody levels, namely, predicting prosody word boundaries, prosody phrase boundaries and intonation phrase boundaries respectively, so that simultaneous prediction of multi-level prosodies in one model is realized; the other output layer carries out word segmentation task, because prosody level boundaries are established on the basis of grammar words, word level information can be obtained by introducing the word segmentation task so as to improve accuracy of prosody level prediction.
Specifically, the feed-forward neural network sublayer consists of two linear projections, the middle of which is connected by a modified linear unit as an activation function, with the following formula:
FFN(X)=max(XW 1 +b 1 ,0)W 2 +b 2
wherein W is 1 、W 2 The dimension of the weight matrix for two linear projections is d×d f And d f ×d;b 1 、b 2 Is a bias vector.
The self-attention sub-layer adopts multi-head self-attention, for each head, firstly, linear projection is carried out on an input matrix to obtain three matrices Q, K, V, then dot product attention scaling (scaled dot-product attention) operation is carried out on the three matrices to obtain a vector M, and M of all heads are spliced and linear projection is carried out to obtain the output of the sub-layer. M is calculated by the formula:
where Softmax () is a normalized exponential function.
The model does not use sequence models such as RNN and the like, and time sequence information cannot be considered, so that different positions of an input sequence are encoded by using sine and cosine functions with different frequencies to introduce a sequence relation among words to a certain extent, and the encoding functions are as follows:
PE(t,2i)=sin(t/10000 2i/d )
PE(t,2i+1)=sin(t/10000 2i/d )
where t is the position and i is the dimension. The position code and the input word vector dimension are d, and the two are added together to be used as model input.
When the model is pre-trained, iteration is carried out on the basis of cross entropy between the actual output and the expected output of the minimum word segmentation task, and the cost function is as follows:
wherein y is the expected output, y= {0,1}, a is the actual output value of the network, satisfying a e [0,1], x corresponds to each node of the output layer, and n is the number of nodes of the output layer. The parameters of the neural network are updated by a back propagation algorithm with random gradient descent.
Step 104), obtaining a character vector sequence, a word position mark sequence and prosody mark sequences of all levels of corresponding texts according to the character vector and prosody mark data with word segmentation information.
Wherein the word vector sequence and the lexeme marker sequence are obtained in the same manner as in step 102). The labeling sequence of each prosody level of prosodic words, prosodic phrases and intonation phrases is determined by prosody labeling. The word is represented as prosodic boundaries by B and not prosodic boundaries by NB, respectively.
Specifically, for the prosodic annotation text "aleba #1 and #1 walma #2 are completely different #1, #3", the prosodic word labeling sequences thereof are [ NB, B, B, NB, NB, B, NB, B, NB, B) and prosodic phrase labeling sequence [ NB, NB, NB, NB, NB, NB, NB, B, NB, NB, NB, B ], and intonation phrase labeling sequence [ NB, NB, NB, NB, NB, NB, NB, NB, NB, NB, NB, B ].
Step 105), training is continued on the basis of the pre-training prosody level prediction model in step 103) by using the character vector sequence, the word position mark sequence and the prosody mark sequences of each level obtained in step 104).
The word vector sequence is model input, the word position marking sequence and each level rhythm marking sequence are model output, and the training is based on the criterion of minimizing the sum of the cross entropy between the actual output and the expected output of each output layer.
Step 2) converting the text to be predicted into a character vector sequence by utilizing the character vector in step 101), inputting the character vector sequence into a trained prosody level prediction model, and outputting the word position and prosody level of the text.
Finally, it should be noted that the above embodiments are only for illustrating the technical solution of the present invention and are not limiting. Although the present invention has been described in detail with reference to the embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the appended claims.

Claims (6)

1. A method of chinese prosody level prediction based on self-attention, the method comprising:
learning a large number of unlabeled texts to obtain word vectors of single words, converting the text to be predicted into a word vector sequence by using the word vectors, inputting the word vector sequence into a trained prosody level prediction model, and outputting the word positions and prosody levels of the text;
the training step of the prosody level prediction model comprises the following steps:
step 1) learning a large number of unlabeled texts to obtain word vectors of single words;
step 2) converting the text corresponding to the word segmentation data into a word vector sequence by using the word vector obtained in the step 1), and obtaining a word position mark sequence according to the word segmentation result;
step 3) constructing a prosody level prediction model based on a self-attention mechanism, and respectively taking the word vector sequence and the word position mark sequence of the word segmentation data obtained in the step 2) as input and output to pretrain the prediction model;
step 4) converting the text corresponding to the prosody marking data into a character vector sequence by utilizing the character vector obtained in the step 1), obtaining a word position marking sequence according to the corresponding word segmentation result, and obtaining marking sequences corresponding to each prosody level according to the prosody marking;
step 5) training the prosody level prediction model again according to the character vector sequence, the word position mark sequence and the prosody mark sequence of the prosody data obtained in the step 4) on the basis of the model obtained in the pre-training of the step 3) to obtain a trained prosody level prediction model;
the step 3) further comprises:
step 3-1) constructing a prosody level prediction model of N layers, wherein each layer comprises a feedforward neural network sublayer and a self-attention sublayer, and residual connection is adopted between every two sublayers, and the following formula is adopted:
Y=X+SubLayer(X)
wherein X, Y represents the input and output of the sub-layers, respectively; the prediction model has four output layers, wherein three output layers respectively predict prosodic word boundaries, prosodic phrase boundaries and intonation phrase boundaries; the output layer predicts word positions and realizes word segmentation of texts;
the feedforward neural network sublayer consists of two linear projections, the middle of which is connected by a correction linear unit as an activation function, and the formula is as follows:
FFN(X)=max(XW 1 +b 1 ,0)W 2 +b 2
wherein W is 1 、W 2 The dimension of the weight matrix for two linear projections is d×d f And d f ×d;b 1 、b 2 Is a bias vector;
the self-attention sub-layer adopts multi-head self-attention, for each head, firstly, linear projection is carried out on an input matrix to obtain three matrixes Q, K, V, then dot product attention scaling operation is carried out on the three matrixes to obtain a vector M, and M of all the heads are spliced and linear projection is carried out to obtain the output of the sub-layer; m is calculated by the formula:
wherein Softmax () is a normalized exponential function;
step 3-2) encoding different positions of the input sequence using sine and cosine functions of different frequencies, the encoding functions being as follows:
PE(t,2i)=sin(t/10000 2i/d )
PE(t,2i+1)=sin(t/10000 2i/d )
wherein t is the position and i is the dimension; the position coding and the vector dimension of the input word are d, and the two are added together to be used as the input of the prosody level prediction model;
step 3-3) pre-training a prosody level prediction model;
iteration is carried out on the basis of cross entropy between the actual output and the expected output of the minimum word segmentation task, and the cost function is as follows:
wherein y is the expected output, y= {0,1}, a is the actual output value, satisfying a e [0,1], x corresponds to each node of the output layer, and n is the number of nodes of the output layer; the parameters of the model are updated by a back propagation algorithm with random gradient descent.
2. The method according to claim 1, wherein the step 1) is specifically: based on a continuous word bag model CBOW, setting the dimension of a word vector as d, training by using a large number of unlabeled texts to obtain the initial values of the word vectors of all single words in the texts, and constructing a word table by using the initial values of the word-word vectors.
3. The method according to claim 2, wherein said step 2) further comprises:
step 2-1), according to the text information of the word segmentation data, looking up the word vector of the corresponding word in a word table searching way, so as to determine the word vector feature sequence of the corresponding text;
step 2-2) determining a word position mark sequence corresponding to the word segmentation data text according to the position of the word in the word, wherein B, M, E, S is used for respectively representing the word at the beginning of the word, the word in the middle of the word, the word at the end of the word and the single word.
4. The method of claim 3, wherein the step 4) further comprises:
step 4-1), according to the text information of the prosody annotation data, looking up the word vector of the corresponding word in a word table searching manner, so as to determine the word vector feature sequence of the corresponding text;
step 4-2), determining a word position mark sequence corresponding to the prosodic data text according to the word segmentation result corresponding to the prosodic data; b, M, E, S is used for respectively representing a character at the beginning of a word, a character in the middle of the word, a character at the end of the word and a single character word;
step 4-3) determining labeling sequences of each prosody level of prosodic words, prosody phrases and intonation phrases according to the labeling of the prosody data; the word is represented as prosodic boundaries by B and not prosodic boundaries by NB, respectively.
5. The method according to claim 4, wherein the step 5) is specifically: based on the model obtained by the pre-training in the step 3), taking a character vector sequence of prosody data as the input of the model, and taking a word position mark sequence and prosody mark sequences of all levels as the output of the model; and updating model parameters by adopting a back propagation algorithm of random gradient descent by taking the sum of the cross entropy between the actual output and the expected output of each output layer as a model training criterion to obtain a trained prosody level prediction model.
6. A self-attention based chinese prosody hierarchy prediction system comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method of one of claims 1 to 5 when the program is executed.
CN201811571546.7A 2018-12-21 2018-12-21 Self-attention-based Chinese prosody level prediction method and system Active CN111354333B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811571546.7A CN111354333B (en) 2018-12-21 2018-12-21 Self-attention-based Chinese prosody level prediction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811571546.7A CN111354333B (en) 2018-12-21 2018-12-21 Self-attention-based Chinese prosody level prediction method and system

Publications (2)

Publication Number Publication Date
CN111354333A CN111354333A (en) 2020-06-30
CN111354333B true CN111354333B (en) 2023-11-10

Family

ID=71195629

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811571546.7A Active CN111354333B (en) 2018-12-21 2018-12-21 Self-attention-based Chinese prosody level prediction method and system

Country Status (1)

Country Link
CN (1) CN111354333B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914551B (en) * 2020-07-29 2022-05-20 北京字节跳动网络技术有限公司 Natural language processing method, device, electronic equipment and storage medium
CN112309368A (en) * 2020-11-23 2021-02-02 北京有竹居网络技术有限公司 Prosody prediction method, device, equipment and storage medium
CN112580361A (en) * 2020-12-18 2021-03-30 蓝舰信息科技南京有限公司 Formula based on unified attention mechanism and character recognition model method
CN112863484B (en) * 2021-01-25 2024-04-09 中国科学技术大学 Prosodic phrase boundary prediction model training method and prosodic phrase boundary prediction method
CN113129862B (en) * 2021-04-22 2024-03-12 合肥工业大学 Voice synthesis method, system and server based on world-tacotron
CN113421550A (en) * 2021-06-25 2021-09-21 北京有竹居网络技术有限公司 Speech synthesis method, device, readable medium and electronic equipment
CN113657118B (en) * 2021-08-16 2024-05-14 好心情健康产业集团有限公司 Semantic analysis method, device and system based on call text

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105185374A (en) * 2015-09-11 2015-12-23 百度在线网络技术(北京)有限公司 Prosodic hierarchy annotation method and device
CN105244020A (en) * 2015-09-24 2016-01-13 百度在线网络技术(北京)有限公司 Prosodic hierarchy model training method, text-to-speech method and text-to-speech device
CN107451115A (en) * 2017-07-11 2017-12-08 中国科学院自动化研究所 The construction method and system of Chinese Prosodic Hierarchy forecast model end to end
CN107464559A (en) * 2017-07-11 2017-12-12 中国科学院自动化研究所 Joint forecast model construction method and system based on Chinese rhythm structure and stress
CN108595590A (en) * 2018-04-19 2018-09-28 中国科学院电子学研究所苏州研究院 A kind of Chinese Text Categorization based on fusion attention model
CN108874790A (en) * 2018-06-29 2018-11-23 中译语通科技股份有限公司 A kind of cleaning parallel corpora method and system based on language model and translation model

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE10018134A1 (en) * 2000-04-12 2001-10-18 Siemens Ag Determining prosodic markings for text-to-speech systems - using neural network to determine prosodic markings based on linguistic categories such as number, verb, verb particle, pronoun, preposition etc.
CN101202041B (en) * 2006-12-13 2011-01-05 富士通株式会社 Method and device for making words using Chinese rhythm words

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105185374A (en) * 2015-09-11 2015-12-23 百度在线网络技术(北京)有限公司 Prosodic hierarchy annotation method and device
CN105244020A (en) * 2015-09-24 2016-01-13 百度在线网络技术(北京)有限公司 Prosodic hierarchy model training method, text-to-speech method and text-to-speech device
CN107451115A (en) * 2017-07-11 2017-12-08 中国科学院自动化研究所 The construction method and system of Chinese Prosodic Hierarchy forecast model end to end
CN107464559A (en) * 2017-07-11 2017-12-12 中国科学院自动化研究所 Joint forecast model construction method and system based on Chinese rhythm structure and stress
CN108595590A (en) * 2018-04-19 2018-09-28 中国科学院电子学研究所苏州研究院 A kind of Chinese Text Categorization based on fusion attention model
CN108874790A (en) * 2018-06-29 2018-11-23 中译语通科技股份有限公司 A kind of cleaning parallel corpora method and system based on language model and translation model

Also Published As

Publication number Publication date
CN111354333A (en) 2020-06-30

Similar Documents

Publication Publication Date Title
CN111354333B (en) Self-attention-based Chinese prosody level prediction method and system
CN110782870B (en) Speech synthesis method, device, electronic equipment and storage medium
US11210306B2 (en) Dialogue system, a method of obtaining a response from a dialogue system, and a method of training a dialogue system
US11741109B2 (en) Dialogue system, a method of obtaining a response from a dialogue system, and a method of training a dialogue system
US11797822B2 (en) Neural network having input and hidden layers of equal units
CN106502985B (en) neural network modeling method and device for generating titles
De Mori Spoken language understanding: A survey
JP6222821B2 (en) Error correction model learning device and program
CN111144110B (en) Pinyin labeling method, device, server and storage medium
CN106910497B (en) Chinese word pronunciation prediction method and device
JP2020505650A (en) Voice recognition system and voice recognition method
CN111145718B (en) Chinese mandarin character-voice conversion method based on self-attention mechanism
US11886813B2 (en) Efficient automatic punctuation with robust inference
KR20180001889A (en) Language processing method and apparatus
KR20190101567A (en) Apparatus for answering a question based on maching reading comprehension and method for answering a question using thereof
JP2019159654A (en) Time-series information learning system, method, and neural network model
CN110162789A (en) A kind of vocabulary sign method and device based on the Chinese phonetic alphabet
CN114860915A (en) Model prompt learning method and device, electronic equipment and storage medium
JP7466784B2 (en) Training Neural Networks Using Graph-Based Temporal Classification
JP6973192B2 (en) Devices, methods and programs that utilize the language model
Krantz et al. Language-agnostic syllabification with neural sequence labeling
US20240005131A1 (en) Attention neural networks with tree attention mechanisms
CN115240712A (en) Multi-mode-based emotion classification method, device, equipment and storage medium
CN115374784A (en) Chinese named entity recognition method based on multi-mode information selective fusion
KR102436900B1 (en) Apparatus and method for evaluating sentense by using bidirectional language model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant