CN114707503A - Front-end text analysis method based on multi-task learning - Google Patents
Front-end text analysis method based on multi-task learning Download PDFInfo
- Publication number
- CN114707503A CN114707503A CN202210132522.1A CN202210132522A CN114707503A CN 114707503 A CN114707503 A CN 114707503A CN 202210132522 A CN202210132522 A CN 202210132522A CN 114707503 A CN114707503 A CN 114707503A
- Authority
- CN
- China
- Prior art keywords
- word
- task
- polyphone
- speech
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a front-end text analysis method based on multi-task learning, which uses the same linguistic data to label features and results, uses a CNN network as a sharing layer to extract the features of the linguistic data, then respectively puts the linguistic data into two Bi-LSTMs to be trained in parallel, and specifically comprises the following steps aiming at the two tasks to output the results: s1, marking data; s2, preparing characteristics; s3, fusing features; and S4, classifying. The invention combines polyphone prediction and prosody prediction tasks by using a multi-task learning method, realizes a uniform end-to-end text processing model, namely provides a uniform front end structure, and constructs a high-quality Mandarin TTS system more quickly and easily. The training of the unified model can use the same data as input, can directly predict polyphones and prosody from the original text at the same time, can train two tasks in parallel, reduces the workload of data labeling, saves the training cost, simultaneously outputs two results, and simplifies the training process.
Description
Technical Field
The invention relates to the technical field of speech synthesis, in particular to a front-end text analysis method based on multi-task learning.
Background
Text-to-Speech (TTS), also known as Speech synthesis. The method aims to synthesize understandable natural voice from text, has wide application in human communication, and has been a research subject in the fields of artificial intelligence, natural language processing and voice processing for a long time. Developing TTS systems requires knowledge of the language and human speech generation, involving multiple disciplines including linguistics, acoustics, digital signal processing, and machine learning. With the development of deep learning, TTS based on neural networks is developed vigorously, and a great deal of research work is focused on different aspects of neural TTS. Therefore, the quality of synthesized speech has been greatly improved in recent years.
In mandarin chinese text-to-speech synthesis, the text processing module at the front end has a large impact on the intelligibility and naturalness of the synthesized speech. The classic Mandarin TTS front-end is a pipeline-based system consisting of a series of text processing components, such as Text Normalization (TN), Chinese Word Segmentation (CWS), polyphonic word disambiguation, prosodic prediction, and ZhuYin (G2P). This structure enables us to divide and conquer complex front-end tasks. However, this serial architecture also presents several problems. One is a complex feature engineering and data tagging effort, as each component requires different input and output tags. Another is that the front-end components require separate training and optimization, resulting in a very complex training process.
Disclosure of Invention
In view of the problems identified by the background art, the present invention provides a front-end text analysis method based on multi-task learning.
In order to solve the technical problems, the invention provides a front-end text analysis method based on multi-task learning, which comprises the following steps of carrying out feature and result labeling by using the same linguistic data, carrying out feature extraction on the linguistic data by using a CNN (CNN) network as a sharing layer, respectively putting the linguistic data into two Bi-LSTMs for parallel training, and outputting results aiming at the two tasks, wherein the specific technical scheme is as follows:
s1, data annotation:
manually marking data of the same source material, namely marking different labels for different tasks; the corpora of the same source are corpora with the same text but different labels;
s2, preparing characteristics:
s2-1, extracting word segmentation characteristics:
tokenizing speech using a tokenization model and using [ BMES]Labeling the label as a one-dimensional feature fea1(ii) a The word segmentation model refers to a model for segmenting words of a text;
s2-2, extracting part-of-speech characteristics:
performing part-of-speech analysis on the corpus by using a part-of-speech tagging model, and analyzing the result according to [ POS ]]The tag is marked as a part-of-speech feature as a one-dimensional feature fea2;
The [ POS ] label is a part-of-speech label, if the [ POS ] label is a noun, the [ POS ] n label is marked as [ POS ] n, if the [ POS ] n label is a verb, the [ POS ] v label is marked as [ POS ] v, and so on;
in the text processing task, the machine first simulates understanding the language. To achieve this, it must be able to understand the rules of natural language to some extent; it is to be understood first that the words, and in particular the nature of each word; therefore, the linguistic data are labeled by using the part-of-speech labeling model and are labeled according to the corresponding part-of-speech;
s2-3, constructing polyphone characteristics:
aiming at polyphone task, polyphone feature label [ POLY ] is constructed by polyphone dictionary]And as a one-dimensional feature fea3Judging whether the character is a polyphone character, if the character is the polyphone character, marking the character as 1, and if not, marking the character as 0; the characteristic can well guide the model to complete the task of disambiguation of polyphone;
after the polyphone features are constructed, the polyphone features are also spliced behind the corresponding text, for example: north [ POS ] ns [ BMES ] B [ POLY ] 0-Jing [ POS ] ns [ BMES ] E [ POLY ]0- $, which is the following from left to right: corpus text, part of speech characteristics [ POS ], word segmentation characteristics [ BMES ], polyphone characteristics [ POLY ], polyphone labels (the two characters are not polyphones and are 'minus'), and rhythm labels;
s3, feature fusion:
s3-1, extracting shared layer characteristics:
using CNN as a sharing layer, inputting sentences according to char level and extracting deep level features;
the input of the CNN is a word vector obtained by the word embedding layer, and the output of the CNN is a characteristic vector extracted by the word vector through the CNN;
s3-2, splicing and fusing:
constructing word characteristics and char characteristics through a word embedding layer, and then using fea1、fea2、fea3The three features and words are converted into feature vectors, the sizes of which are respectively the word segmentation features: [ 4X 20 ]]Part of speech characteristics: [ 60X 20 ]]And polyphone characteristics: [ 2X 20 ]]、word:[6048×50]Then, splicing and fusing the characteristic vectors;
s4, classification:
arranging the deep-level features acquired in the step S3 into sentence-level features according to time, and respectively sending the obtained feature vectors to two Bi-LSTM networks to learn context time dependence; then completing the polyphone disambiguation task and the prosody prediction task respectively.
The Bi-LSTM is an abbreviation of Bi-directional Long Short-Term Memory and is formed by combining a forward LSTM and a backward LSTM, the LSTM is a recurrent neural network and can perform prediction output at the current moment according to history information of a text and time step information at the previous moment, and the forward LSTM and the backward LSTM can gradually learn the information of the next moment along with the time. The polyphonic labels include 312 labels and the prosodic labels include two types of #1 and # 3.
Further, in the above solution, before performing data annotation, the step S1 further includes data processing, where the data processing method includes: each sentence in the corpus is segmented by words, and the sentences with the length exceeding 250 are filtered.
Further, in the above scheme, in the step S1, labeling different labels for different tasks specifically includes: and splicing the polyphone label and the prosodic label corresponding to each text.
Further, in the above scheme, in step S2-1, the [ BMES ] label is: b: beginning, M: middle, E: and (S) ending: independent.
Further, in the above scheme, in the step S2-2, the part-of-speech feature includes: the n-shaped noun adjective a, the verb v, the conjunctive c, the auxiliary word u, the adverbial d, the conjunctive c, the word m, the punctuation mark w, the adverbial p, the analogism word o, the adverbial q, and the pronoun r. The part of speech is that there are many kinds not only including 60, and these 60 mean that the corpus used in the method only includes 60 part of speech tags, and does not represent that all the part of speech tags are 60 in total.
Further, in the above scheme, the step S3-1 specifically includes: firstly, converting char into vectors through a word embedding layer (for Chinese, both char and word take a single character as a unit, after each character is extracted, each character is converted into a vector by using the word embedding layer), wherein the size of the layer is [6048 multiplied by 30], and then putting the characters into a CNN network in batches for feature extraction; the characteristic extraction of the CNN specifically refers to that the CNN extracts a numerical vector containing text semantics from a text in a corpus. The CNN network has a convolution layer, wherein 5 convolution kernels are arranged, the size of the convolution kernels is 3 x 3, the activation function is relu, the dimension of char features is the same as that of input, the dimension of char features is 30, and extracted source sentence information is used for feature fusion of two subsequent tasks.
Because the two tasks use the same language material, the CNN network can be used as a sharing layer to extract deep information of the source sentence, and compared with a single-task model, the layer can share the information learned by the two tasks, so that the generalization effect (which means the adaptability of the model to new samples) is better.
Further, in the above scheme, in the step S3-2, in order to ensure that the two tasks can dynamically select features, two feature vectors F1 and F2 are constructed for splicing and fusing, and the dimension is 140;
wherein, the combined feature vector F1 of the ith word in the ith utterance in task 1 can be expressed as:
F1ij=[Wordij,Charij,Fea1ij,Fea2ij,Fea3ij]
the combined feature vector F2 for the ith word in the ith utterance in task 2 can be expressed as:
F2ij=[Wordij,Charij,Fea1ij,Fea2ij,Fea3ij]
the two tasks can dynamically adjust the required characteristics according to different actual conditions to perform characteristic fusion.
The number and sequence of the 5 features can be adjusted as required, and the principle is as follows: the method can screen the features with unobvious effect or without positive influence on the result according to the training result and the actual requirement.
Since the stitching vectors F1 and F2 of the two features are provided, stitching can be performed according to the number of selected features. For example, three features can be selected for F1 to be spliced, and two or more features can be selected for F2 to be spliced. Two tasks need not maintain the same number of features at the same time.
Further, in the above solution, in step S4, the polyphone disambiguation task performs polyphone judgment on each word and then performs phonetic annotation disambiguation, and the prosody prediction task performs prosody labeling on each word in the sentence.
Further, in the foregoing solution, in the step S4, the Bi-LSTM output layer is followed by a linear layer, the output dimension is the number of label labels corresponding to each task, and the activation function is softmax.
Compared with the prior art, the beneficial effects of the invention are embodied in the following points:
firstly, a multi-task learning method is used for combining polyphone prediction and prosody prediction tasks, a uniform end-to-end text processing model is achieved, namely a uniform front end structure is provided, and a high-quality Mandarin TTS system is constructed more quickly and easily.
Secondly, a shared layer is used in the model, so that the calculation resources occupied by the model are reduced, and the synthesis speed of the model is relatively improved.
Thirdly, the training of the unified model can use the same data as input, can predict polyphones and prosody from the original text directly, can train two tasks in parallel, reduces the workload of data labeling, saves the training cost, and outputs two results at the same time, thereby simplifying the training process.
Drawings
FIG. 1 is a model framework diagram of a front-end text analysis method based on multi-task learning according to the present invention.
Detailed Description
Examples
To validate the invention, validation is performed on the self-established database. The training set in the data set comprises 971500 sentences, the test set verification set comprises 3000 sentences, the polyphone dictionary comprises 312 sentences, and the prosody comprises #1 and # 3. The algorithm flow of the whole system is shown in fig. 1, and the invention is further described in detail with reference to fig. 1.
FIG. 1 is a model framework diagram of a front-end text analysis method based on multi-task learning according to the present invention. As shown in fig. 1, the method mainly comprises the following steps:
s1, data annotation:
firstly, data processing is carried out: segmenting each sentence in the corpus according to characters, and filtering out sentences with the length exceeding 250;
then, manually marking the data of the same source material, and splicing a multi-phonetic character label and a rhythm label corresponding to each text; the corpora of the same source are corpora with the same text but different labels;
s2, preparing characteristics:
s2-1, extracting word segmentation characteristics:
tokenizing speech using a tokenization model and using [ BMES]Labeling the label as a one-dimensional feature fea1;[BMES]The label is as follows: b: beginning, M: middle, E: and (S) ending: independent. The word segmentation model refers to a model for segmenting words of a text;
s2-2, extracting part-of-speech characteristics:
performing part-of-speech analysis on the corpus using part-of-speech tagging modelsThe result is according to [ POS]The tag is marked as a part-of-speech feature as a one-dimensional feature fea2;
The part-of-speech characteristics include: the n-shaped noun adjective a, the verb v, the conjunctive c, the auxiliary word u, the adverbial d, the conjunctive c, the word m, the punctuation mark w, the adverbial p, the analogism word o, the adverbial q, and the pronoun r. The part of speech is that there are many kinds not only including 60, and these 60 mean that the corpus used in the method only includes 60 part of speech tags, and does not represent that all the part of speech tags are 60 in total.
The [ POS ] label is a part-of-speech label, if the [ POS ] label is a noun, the [ POS ] n label is marked as [ POS ] n, if the [ POS ] n label is a verb, the [ POS ] v label is marked as [ POS ] v, and so on;
in the text processing task, the machine first simulates understanding the language. To achieve this, it must be able to understand the rules of natural language to some extent; it is to be understood first that the words, and in particular the nature of each word; therefore, the linguistic data are labeled by using the part-of-speech labeling model and are labeled according to the corresponding parts-of-speech, more complete part-of-speech labels are too many to be embodied one by one, and the following example of the part-of-speech labeling of a whole sentence is given: "I afraid of worried about and can not go to Yunnan": i/r (pronoun) wore/d (adverb)/v (verb) wording/vn (verb noun)/u (assistant word) going/vf (trend verb) not/d (adverb)/u (assistant word) Yunnan/ns (place name);
s2-3, constructing polyphone characteristics:
aiming at polyphone task, polyphone feature label [ POLY ] is constructed by polyphone dictionary]And as a one-dimensional feature fea3Judging whether the character is a polyphone character, if the character is the polyphone character, marking the character as 1, and if not, marking the character as 0; the characteristic can well guide the model to complete the task of disambiguation of polyphone;
after the polyphone features are constructed, the polyphone features are also spliced behind the corresponding text, for example: north [ POS ] ns [ BMES ] B [ POLY ] 0-Jing [ POS ] ns [ BMES ] E [ POLY ]0- $, which is the following from left to right: corpus text, part of speech characteristics [ POS ], word segmentation characteristics [ BMES ], polyphone characteristics [ POLY ], polyphone labels (the two characters are not polyphones and are 'minus'), and rhythm labels;
s3, feature fusion:
s3-1, extracting shared layer characteristics:
using CNN as a sharing layer, inputting sentences according to char level and extracting deep level features;
firstly, converting char into vectors through a word embedding layer (for Chinese, both char and word take a single character as a unit, after each character is extracted, each character is converted into a vector by using the word embedding layer), wherein the size of the layer is [6048 multiplied by 30], and then putting the characters into a CNN network in batches for feature extraction; the characteristic extraction of the CNN specifically refers to that the CNN extracts a numerical vector containing text semantics from a text in a corpus. The CNN network has a convolution layer, wherein 5 convolution kernels are arranged, the size of the convolution kernels is 3 x 3, the activation function is relu, the dimension of char features is the same as that of input, the dimension of char features is 30, and extracted source sentence information is used for feature fusion of two subsequent tasks.
Because the two tasks use the same language material, the CNN network can be used as a sharing layer to extract deep information of the source sentence, and compared with a single-task model, the layer can share the information learned by the two tasks, so that the generalization effect (which means the adaptability of the model to new samples) is better.
The input of the CNN is a word vector obtained by the word embedding layer, and the output of the CNN is a characteristic vector extracted by the word vector through the CNN;
s3-2, splicing and fusing:
constructing word characteristics and char characteristics through a word embedding layer, and then using fea1、fea2、fea3The three features and words are converted into feature vectors, the sizes of which are respectively the word segmentation features: [ 4X 20 ]]Part of speech characteristics: [ 60X 20 ]]And polyphone characteristics: [ 2X 20 ]]、word:[6048×50]Then, splicing and fusing the characteristic vectors;
in order to ensure that the two tasks can dynamically select features, two feature vectors F1 and F2 are constructed for splicing and fusion, and the dimensionality is 140;
wherein, the combined feature vector F1 of the jth word in the ith utterance in task 1 can be expressed as:
F1ij=[Wordij,Charij,Fea1ij,Fea2ij,Fea3ij]
the combined feature vector F2 for the jth word in the ith utterance in task 2 can be expressed as:
F2ij=[Wordij,Charij,Fea1ij,Fea2ij,Fea3ij]
the two tasks can dynamically adjust the required characteristics according to different actual conditions to perform characteristic fusion.
The number and sequence of the 5 features can be adjusted as required, and the principle is as follows: the method can screen the characteristics with unobvious effect or without positive influence on the result according to the training result and the actual requirement.
Since the stitching vectors F1 and F2 of the two features are provided, stitching can be performed according to the number of selected features. For example, three features can be selected for F1 to be spliced, and two or more features can be selected for F2 to be spliced. Two tasks need not maintain the same number of features at the same time.
S4, classification:
arranging the deep-level features acquired in the step S3 into sentence-level features according to time, and respectively sending the obtained feature vectors to two Bi-LSTM networks to learn context time dependence; then, the polyphone disambiguation task and the prosody prediction task are respectively completed.
The polyphone disambiguation task performs polyphone judgment on each character and then performs phonetic annotation disambiguation, and the rhythm prediction task performs rhythm annotation on each character in the sentence.
And a linear layer is connected behind the Bi-LSTM output layer, the output dimension is the number of label labels corresponding to each task, and the activation function is softmax.
As shown in fig. 1, to ensure the size of the model and the decoding speed, only one hidden layer in the network has 200 hidden units, the two LSTM networks in fig. 1 have the same structure, the hidden layers are both 200, the input layer is 140 (the input layer unit varies with the dimension of the concatenation, for example, 120 is the concatenation of two features), and the output layer is 200.
The Bi-LSTM is an abbreviation of Bi-directional Long Short-Term Memory and is formed by combining a forward LSTM and a backward LSTM, the LSTM is a recurrent neural network and can perform prediction output at the current moment according to history information of a text and time step information at the previous moment, and the forward LSTM and the backward LSTM can gradually learn the information of the next moment along with the time. The polyphonic labels include 312 labels and the prosodic labels include two types of #1 and # 3.
The invention is based on a unified model of CNN-BLSTM. The configuration is as follows:
CNN has a convolutional layer. The convolution layer has n1 convolution kernels with a convolution size of k1×k1. Then, feature fusion is carried out and two BLSTMs are respectively put in, 1 hidden layer exists in each BLSTM, and each layer has u hidden units. Then respectively using a compound having s1,s2Fully connected layers of hidden units map features to labels1,label2And (5) maintaining.
Table 1 shows the effect obtained by the front-end text analysis method based on multitask learning.
TABLE 1 Effect of the method of the invention on polyphonic disambiguation and prosody prediction
As can be seen from table 1: 1) compared with a single model, the unified model has the advantage that the test accuracy effect is equivalent when the unified model is tested under the same test set. 2) It is superior to the serial single model in terms of model size and decoding speed. 3) The model training time and complexity are much simplified compared to a single model. The method can construct a high-quality TTS system more quickly and efficiently, and proves that the method is effective.
Claims (9)
1. The front-end text analysis method based on multi-task learning is characterized by comprising the following steps of:
s1, data annotation:
manually marking data of the same source material, namely marking different labels for different tasks; the corpora of the same source are corpora with the same text but different labels;
s2, preparing characteristics:
s2-1, extracting word segmentation characteristics:
tokenizing speech using a tokenization model and using [ BMES]Labeling the label as a one-dimensional feature fea1;
S2-2, extracting part-of-speech characteristics:
performing part-of-speech analysis on the corpus by using a part-of-speech tagging model, and analyzing the result according to [ POS ]]The tag is marked as a part-of-speech feature as a one-dimensional feature fea2;
S2-3, constructing polyphone characteristics:
aiming at polyphone task, polyphone feature label [ POLY ] is constructed by polyphone dictionary]And as a one-dimensional feature fea3Judging whether the character is a polyphone character, if the character is the polyphone character, marking the character as 1, and if not, marking the character as 0;
s3, feature fusion:
s3-1, extracting shared layer characteristics:
using CNN as a sharing layer, inputting sentences according to char level and extracting deep level features;
s3-2, splicing and fusing:
constructing word features and cha features through a word embedding layer, and then using fea1、fea2、fea3Converting the three characteristics and word into a characteristic vector, and then splicing and fusing the characteristic vector;
s4, classification:
arranging the deep-level features acquired in the step S3 into sentence-level features according to time, and respectively sending the obtained feature vectors to two Bi-LSTM networks to learn context time dependence; then, the polyphone disambiguation task and the prosody prediction task are respectively completed.
2. The method for analyzing front-end text based on multitask learning according to claim 1, wherein said step S1 further includes data processing before data labeling, and said data processing method is: each sentence in the corpus is segmented by words, and the sentences with the length exceeding 250 are filtered.
3. The method for analyzing front-end text based on multitask learning according to claim 1, wherein in step S1, labeling different labels for different tasks specifically includes: and splicing the polyphone label and the prosodic label corresponding to each text.
4. The method for front-end text analysis based on multitask learning according to claim 1, wherein in said step S2-1, [ BMES ] labels are: b: beginning, M: middle, E: and (S) ending: independent.
5. The method for analyzing front-end text based on multi-task learning of claim 4, wherein in the step S2-2, the part-of-speech features comprise: the n-shaped noun adjective a, the verb v, the conjunctive c, the auxiliary word u, the adverbial d, the conjunctive c, the word m, the punctuation mark w, the adverbial p, the analogism word o, the adverbial q, and the pronoun r.
6. The method for analyzing the front-end text based on the multitask learning according to claim 1, wherein the step S3-1 is specifically as follows: firstly, converting char into vectors through a word embedding layer, wherein the size of the layer is [6048 multiplied by 30], and then putting the vectors into a CNN network in batches for feature extraction; and the extracted source sentence information is used for feature fusion of the subsequent two tasks.
7. The method for analyzing the front-end text based on the multitask learning as claimed in claim 1, wherein in the step S3-2, two eigenvectors F1 and F2 are constructed for splicing and fusion, and the dimension is 140;
wherein, the combined feature vector F1 of the jth word in the ith utterance in task 1 can be expressed as:
F1ij=[Wordij,Charij,Fea1ij,Fea2ij,Fea3ij]
the combined feature vector F2 for the jth word in the ith utterance in task 2 can be expressed as:
F2ij=[Wordij,Charij,Fea1ij,Fea2ij,Fea3ij]
the two tasks can dynamically adjust the required characteristics according to different actual conditions to perform characteristic fusion.
8. The front-end text analysis method based on multi-task learning of claim 1, wherein in step S4, the polyphonic disambiguation task performs polyphonic disambiguation on each word, and then performs phonetic annotation disambiguation, and the prosody prediction task performs prosody labeling on each word in the sentence.
9. The method for analyzing front-end text based on multitask learning according to claim 1, characterized in that in said step S4, Bi-LSTM output layer is followed by a linear layer, the output dimension is the number of label labels corresponding to each task, and the activation function is softmax.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210132522.1A CN114707503B (en) | 2022-02-14 | 2022-02-14 | Front-end text analysis method based on multi-task learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210132522.1A CN114707503B (en) | 2022-02-14 | 2022-02-14 | Front-end text analysis method based on multi-task learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114707503A true CN114707503A (en) | 2022-07-05 |
CN114707503B CN114707503B (en) | 2023-04-07 |
Family
ID=82167283
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210132522.1A Active CN114707503B (en) | 2022-02-14 | 2022-02-14 | Front-end text analysis method based on multi-task learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114707503B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117951547A (en) * | 2024-03-26 | 2024-04-30 | 紫金诚征信有限公司 | Bid and tendered data processing method and device based on artificial intelligence |
CN118332121A (en) * | 2024-04-24 | 2024-07-12 | 江苏侯曦信息科技有限公司 | Front-end text analysis method based on multitask learning |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030135356A1 (en) * | 2002-01-16 | 2003-07-17 | Zhiwei Ying | Method and apparatus for detecting prosodic phrase break in a text to speech (TTS) system |
US20080147405A1 (en) * | 2006-12-13 | 2008-06-19 | Fujitsu Limited | Chinese prosodic words forming method and apparatus |
CN104867491A (en) * | 2015-06-17 | 2015-08-26 | 百度在线网络技术(北京)有限公司 | Training method and device for prosody model used for speech synthesis |
CN111951779A (en) * | 2020-08-19 | 2020-11-17 | 广州华多网络科技有限公司 | Front-end processing method for speech synthesis and related equipment |
CN112464649A (en) * | 2020-11-25 | 2021-03-09 | 平安科技(深圳)有限公司 | Pinyin conversion method and device for polyphone, computer equipment and storage medium |
-
2022
- 2022-02-14 CN CN202210132522.1A patent/CN114707503B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030135356A1 (en) * | 2002-01-16 | 2003-07-17 | Zhiwei Ying | Method and apparatus for detecting prosodic phrase break in a text to speech (TTS) system |
US20080147405A1 (en) * | 2006-12-13 | 2008-06-19 | Fujitsu Limited | Chinese prosodic words forming method and apparatus |
CN104867491A (en) * | 2015-06-17 | 2015-08-26 | 百度在线网络技术(北京)有限公司 | Training method and device for prosody model used for speech synthesis |
CN111951779A (en) * | 2020-08-19 | 2020-11-17 | 广州华多网络科技有限公司 | Front-end processing method for speech synthesis and related equipment |
CN112464649A (en) * | 2020-11-25 | 2021-03-09 | 平安科技(深圳)有限公司 | Pinyin conversion method and device for polyphone, computer equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
周开来: "基于语音数据库的文语转换系统过程分析" * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117951547A (en) * | 2024-03-26 | 2024-04-30 | 紫金诚征信有限公司 | Bid and tendered data processing method and device based on artificial intelligence |
CN117951547B (en) * | 2024-03-26 | 2024-06-21 | 紫金诚征信有限公司 | Bid and tendered data processing method and device based on artificial intelligence |
CN118332121A (en) * | 2024-04-24 | 2024-07-12 | 江苏侯曦信息科技有限公司 | Front-end text analysis method based on multitask learning |
Also Published As
Publication number | Publication date |
---|---|
CN114707503B (en) | 2023-04-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112352275A (en) | Neural text-to-speech synthesis with multi-level textual information | |
CN114707503B (en) | Front-end text analysis method based on multi-task learning | |
Poostchi et al. | BiLSTM-CRF for Persian named-entity recognition ArmanPersoNERCorpus: the first entity-annotated Persian dataset | |
CN112818089B (en) | Text phonetic notation method, electronic equipment and storage medium | |
Sangeetha et al. | Speech translation system for english to dravidian languages | |
CN111339772B (en) | Russian text emotion analysis method, electronic device and storage medium | |
Rebai et al. | Text-to-speech synthesis system with Arabic diacritic recognition system | |
Fashwan et al. | SHAKKIL: an automatic diacritization system for modern standard Arabic texts | |
CN111951781A (en) | Chinese prosody boundary prediction method based on graph-to-sequence | |
Li et al. | Chinese prosody phrase break prediction based on maximum entropy model. | |
Belay et al. | Impacts of homophone normalization on semantic models for amharic | |
Altıntaş et al. | Improving the performance of graph based dependency parsing by guiding bi-affine layer with augmented global and local features | |
Mahata et al. | JUNLP@ Dravidian-CodeMix-FIRE2020: Sentiment classification of code-mixed tweets using bi-directional RNN and language tags | |
Wang et al. | Investigation of using continuous representation of various linguistic units in neural network based text-to-speech synthesis | |
Ali et al. | Gemination prediction using DNN for Arabic text-to-speech synthesis | |
CN115374784A (en) | Chinese named entity recognition method based on multi-mode information selective fusion | |
Park et al. | Jejueo datasets for machine translation and speech synthesis | |
Jauk et al. | Expressive speech synthesis using sentiment embeddings | |
Yadav et al. | Different Models of Transliteration-A Comprehensive Review | |
KR100202292B1 (en) | Text analyzer | |
Kalita et al. | NMT for a Low Resource Language Bodo: Preprocessing and Resource Modelling | |
Liu et al. | Phonologically aware bilstm model for mongolian phrase break prediction with attention mechanism | |
Anto et al. | Text to speech synthesis system for English to Malayalam translation | |
Rahate et al. | An experimental technique on text normalization and its role in speech synthesis | |
Reddy et al. | Creation of GIF dataset and implementation of a speech-to-sign language translator in Telugu |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |