CN111723584A - Punctuation prediction method based on consideration of domain information - Google Patents
Punctuation prediction method based on consideration of domain information Download PDFInfo
- Publication number
- CN111723584A CN111723584A CN202010590707.8A CN202010590707A CN111723584A CN 111723584 A CN111723584 A CN 111723584A CN 202010590707 A CN202010590707 A CN 202010590707A CN 111723584 A CN111723584 A CN 111723584A
- Authority
- CN
- China
- Prior art keywords
- punctuation
- model
- domain
- layer
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 238000011156 evaluation Methods 0.000 claims abstract description 8
- 239000013598 vector Substances 0.000 claims abstract description 8
- 238000013528 artificial neural network Methods 0.000 claims abstract description 5
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 230000002457 bidirectional effect Effects 0.000 claims description 5
- 230000015654 memory Effects 0.000 claims description 5
- 238000012360 testing method Methods 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000013145 classification model Methods 0.000 claims description 3
- 230000006403 short-term memory Effects 0.000 claims description 2
- 230000007787 long-term memory Effects 0.000 claims 1
- 238000012549 training Methods 0.000 description 9
- 238000002474 experimental method Methods 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000003058 natural language processing Methods 0.000 description 4
- 230000002354 daily effect Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 210000001072 colon Anatomy 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 241000590419 Polygonia interrogationis Species 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a punctuation prediction method based on field-considered information, which mainly comprises the following steps: 1) preprocessing punctuation marks of the text; 2) selection of neural network front-end features: converting the words into 300-dimensional word vectors using a GloVe tool; 3) constructing a model: the model simultaneously predicts the domain classification of punctuation and text by using a multi-task learning method, so that the model is integrated with domain information; 4) and selecting an evaluation index. Increasing the robustness of the system. And meanwhile, the accuracy of the overall prediction is improved.
Description
Technical Field
The invention relates to a punctuation prediction task in the field of natural language processing, in particular to a punctuation prediction method based on field information considered aiming at different field information contained in different fields in punctuation prediction.
Background
In recent years, with the remarkable improvement of computer computing power and the continuous effort of computer algorithm researchers, the performance and the accuracy of the automatic speech recognition technology are remarkably improved, and the basic requirements of people in daily life are met. Automatic speech recognition is becoming increasingly popular in industry and everyday life. The voice recognition technology is widely applied to the fields of intelligent home systems, session transcribers, voice dictation technology, simultaneous interpretation and the like, and brings great convenience to daily life and work of people. In most cases, speech recognition techniques transcribe speech signals into textual information and then perform corresponding analysis and post-processing of the text. In this case, the quality of the transcribed text directly affects the execution of subsequent tasks, thereby affecting product performance and user experience. However, most automatic speech recognition systems cannot recognize punctuation marks and only generate text sequences without any punctuation marks because punctuation marks are inaudible when people communicate and speak in daily life. Punctuation marks, however, are an integral part of text. Punctuation plays the role of pause and tone in sentences. It often emphasizes certain words or phrases to better express the meaning of a sentence. The absence of punctuation can cause problems, such as confusing human readers to understand sentences and affecting the performance of existing natural language processing algorithms, such as machine translation, abstract extraction, man-machine conversation, and other tasks. Therefore, how to automatically mark a text is a very important task.
Up to now, there have been many studies on punctuation automatic prediction. Before deep learning becomes a trend, the main approach is manual rules. As the amount of data increases, some statistical-based approaches become mainstream, such as training punctuated text using an N-Gram language model, or treating punctuation prediction tasks as sequence labeling tasks and then solving them using conditional random fields. With the development of intensive learning, many researchers began using it for punctuation prediction tasks. Human communication typically spans many areas. And each domain has its own vocabulary and writing style. Therefore, consideration of field information facilitates punctuation prediction. In previous research, the main method is to use acoustic features and text features, such as part-of-speech tagging, word vectors, inter-word pause duration, and high pitch. However, few studies consider distinctiveness in different fields. Based on the reasons, the invention provides a method for predicting punctuation by using multi-task learning to integrate into domain information, so that the model has good robustness and better performance.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a punctuation prediction method based on field-considered information.
The invention provides a method for integrating field information during punctuation prediction tasks. The data set of the THUCNews text is taken as an experimental object. Mainly relates to four aspects: 1) preprocessing punctuation marks of the text; 2) selecting a front-end characteristic of the neural network; 3) building a model; 4) and selecting an evaluation index.
1) Preprocessing punctuation of text
This section is processed by reading the data first, and in the present invention is primarily directed to predicting the four most important and most common punctuation marks, commas, periods, question marks and exclamation marks. Based on previous studies by researchers, we have replaced semicolons and colons with commas, and other punctuation marks are deleted from the corpus.
Because the Chinese words and phrases have no obvious boundary, the invention cuts the character strings in the text into reasonable word sequences, and then performs other analysis processing on the basis. Because there are many Chinese words, the input vocabulary used in this experiment is composed of the 10 ten thousand Chinese words with the highest frequency of occurrence in the corpus and two special symbols, one for indicating unknown words (words not appearing in this vocabulary), and the other for indicating the end of input.
The punctuation prediction task is considered herein as a classification problem of what punctuation follows each word. A special symbol "O" is also defined to indicate that there is no punctuation mark or space behind the word. For example, "i like her humor, you are? ". The input is "i like her humorous you," and the output is a series of punctuation marks, such as "O ooooo, O? ". As shown in table 1:
TABLE 1 example punctuation sequence annotation of a sentence
… | I am | Like | She is provided with | Is/are as follows | Humor (humor) | Worsted fabric | … |
… | O | O | O | O | , | ? | … |
2) Selection of neural network front-end features
The conventional machine learning method and the deep learning method can process only numerical type data. For character type data, it needs to be converted into digital type data, and glove is an unsupervised word representation method, which converts words into word vectors and can represent different words and relations between words to some extent. The present invention uses glove to convert the words 300 into vectors of dimensions as input to the model.
3) Construction of models
In the invention, in order to consider the field information during punctuation prediction, two methods are used:
one is to use as input the conversion of domain labels into one-hot encoded binding word vectors, so that the model is merged into domain information as shown in fig. 1.
The model mainly comprises a layer of bidirectional long-short term memory network (BILSTM). We encode the sequence of words using word embedding X ═ X (X)1,…,xt) And one-hot encoding D of the domain punctuation of the sentencetagAnd combining the two-way long-short term memory layer and the two-way short-term memory layer as input.
Nt={xt,Dtag} (1)
The bidirectional long-short term memory layer consists of two LSTM layers, wherein the forward LSTM layer processes the forward sequence of values and the backward LSTM layer processes the backward sequence of values. The two LSTM layers use a weighted shared layer to process information. Of the forward LSTM layer are hidden units at time step t
Hidden state of inverted LSTM layerThe calculation method of (a) is the same as that of the forward LSTM layer sequence.
BilsTM layer htThen by combining the hidden elements of the forward LSTM layer at step time tAnd hidden units of backward LSTM layerIs constructed from the states of (a):
thus, a two-way LSTM may learn the expression of each input word using a front-to-back sentence (e.g., "i like her humor"), which may help the model to be betterPunctuation marks, which are often context dependent, are recognized. The output layer then generates a punctuation probability y at time step ttThe following are:
yt=Softmax(htWy+by) (4)
in another method, a multi-task learning method is used, and two tasks in the model and the previous task are respectively punctuation prediction and text domain classification model structure as shown in the second drawing. In this model, a network structure similar to that of the domain label, uses as input the sequence X (X1.. multidot.xt) of word-embedded coded words. DtagFor the output of the classification task of the text field, the main flow is as follows:
yt=Softmax(htWy+by) (7)
f=fltten{h1,h2,…,ht} (8)
Dtag=Sigmod(fWD+dtag) (9)
4) evaluation index selection
In order to evaluate the performance of punctuation prediction tasks, the evaluation indexes used in the present invention are precision (precision), recall (precision) and an index F used in statistics to measure the precision of class models1And the accuracy and the recall rate of the classification model are considered at the same time. And the measurements of the four punctuation marks comma, period, question mark and exclamation mark are displayed in the test set, respectively. The equation is defined as follows:
the recall ratio is as follows:
the accuracy is as follows:
F1the value:
advantageous effects
The invention aims at the punctuation prediction task in the natural language processing field, aims to consider field information during punctuation prediction, and has the following specific gain effects compared with the prior art:
1) the system only trains the model by using the text data, so that punctuations of the corresponding text can be predicted only by using the text as input during model prediction.
2) In this system we use multi-domain text, taking into account domain information. Thereby avoiding the situation of poor performance in a certain field. Increasing the robustness of the system. And meanwhile, the accuracy of the overall prediction is improved.
Drawings
FIG. 1 is a diagram of a domain labeled punctuation prediction model architecture.
FIG. 2 is a diagram of a punctuation prediction model architecture based on multi-task learning incorporating domain information.
Detailed Description
The operation and effect of the present invention will be described in detail with reference to the accompanying drawings and tables.
The invention provides a punctuation prediction system based on field-considered information by taking a THICKNESs corpus collected from a Newcastle news subscription channel from 2005 to 2011 as a processing object.
The method comprises the following specific steps:
1) experimental data set and data processing
In order to consider the field information, the database of the THUCNews corpus collected from the news subscription channel of new wave from 2005 to 2011 in the experiment is built by collection of qinghua schools, contains 74 thousands of news documents (2.19GB), is in a UTF-8 plain text format, and contains data for training punctuation prediction models, wherein the data comprise texts in fields of science and technology, time administration, games, furniture, education and the like. In our experiment we use part of the data, and the experiment has two training sets of 'sports field' and 'multi-field' with similar sizes and a test set. The "sports field" training set contains only text in the sports field, containing 53768 paragraphs, and the "multi-field" training set contains text in multiple fields, 5500 paragraphs in each field. The test set contains text for multiple domains with 500 paragraphs in each domain as shown in table 2.
The present invention is primarily directed to predicting the four most important and most common punctuation marks comma, period, question mark and exclamation mark. According to the research of the researchers before, the semicolon and the colon are replaced by commas, other punctuations are deleted from the corpus, and in the experiment, the text is participled by using a Chinese Lexical analysis toolkit which is developed by the natural language processing and social humanity calculation laboratories of the university of Qinghua and the THU (THU Lexical Analyzer for Chinese).
Table 2 experimental database description
2) Text feature extraction
A 300-dimensional word vector is extracted using the Glove tool. See above for specific description.
3) Experimental setup
In the present invention, we use a back propagation algorithm to train all the deep neural networks. The network updates the weights of each training example using an Adam optimizer, which has a learning rate of 0.001. In our neural network model, with 256 unit LSTM units per LSTM layer, we use an embedding layer between the word input and LSTM layers, the dimension of which is empirically set to 128. In the training process, the input sequence of the training sentences is firstly disturbed, and then 128 sentences are randomly selected as a training batch. The number of data iterations was 30.
4) Analysis of results
In Table 3 we calculated the accuracy, recall, and F1 score for each punctuation mark (F1). We found that the model BilSTM-Multi, which incorporates the domain information by multitasking, achieved the best results on all evaluation criteria compared to the model BilSTM-Multi, which does and does not take into account the domain information.
Table 4 shows the results of two different methods considering the domain information punctuation prediction model. From table 4, we find that the error rate of the Domain tagged model Add Domain-tag is reduced by 7.0% compared with the error rate of the Domain information model BLSTM-Multi not considered due to the consideration of the Domain information, and the error rate of the Domain information considered model MTL-BiLSTM using the multitask method is reduced by 16.8% compared with the error rate of the tagged model Add Domain-tag, which indicates that the multitask structure can better integrate the Domain information into the punctuation prediction model.
Table 3 results of the experiment on each plot
TABLE 4 Overall Experimental results
Claims (2)
1. The punctuation prediction method based on the considered field information is characterized by mainly comprising the following steps:
1) preprocessing punctuation marks of the text;
2) selection of neural network front-end features: converting the words into 300-dimensional word vectors using a GloVe tool;
3) constructing a model: the model simultaneously predicts the domain classification of punctuation and text by using a multi-task learning method, so that the model is integrated with domain information;
4) selecting evaluation indexes: to evaluate the performance of punctuation prediction tasks, the evaluation indicators used in us are precision (precision), recall (precision) and statisticsIndex F for measuring class model accuracy1;
The step 3) has two specific methods:
a method for converting domain labels into one-hot coded combined word vectors as input, thereby enabling a model to be fused into domain information;
the model mainly comprises a layer of bidirectional long and short term memory network (BILSTM), and a sequence X (X) of words embedded and coded words is used1,…,xt) And one-hot encoding D of the domain punctuation of the sentencetagIn combination, as the input of the bidirectional long-short term memory layer,
Nt={xt,Dtag} (1)
the bidirectional long-short term memory layer is composed of two LSTM layers, wherein the forward LSTM layer processes forward value sequence, the backward LSTM layer processes backward value sequence, the two LSTM layers process information by using weighted sharing layer, and the forward LSTM layer is hidden unit at time step t
Hidden state of inverted LSTM layerThe calculation method of (2) is the same as that of the forward LSTM layer sequence;
BilsTM layer htThen by combining the hidden elements of the forward LSTM layer at step time tAnd hidden units of backward LSTM layerIs constructed from the states of (a):
thus, bi-directional LSTM can learn the expression of each input word using preceding and following sentences, identifying punctuation that often depends on context; the output layer then generates a punctuation probability y at time step ttThe following are:
yt=Softmax(htWy+by) (4)
another method is to use multi-task learning, in which two tasks in the model and the last task are punctuation prediction and the other is the domain classification model structure of the text, in the model, the network structure similar to the model of the domain label is used, and the word embedding sequence is used to code the word X (X)1,...,xt) As an input;
Dtagfor the output of the classification task of the text field, the main flow is as follows:
yt=Softmax(htWy+by) (7)
f=fltten{h1,h2,...,ht} (8)
Dtag=Sigmod(fWD+dtag) (9)
2. the punctuation prediction method based on domain-of-interest information as claimed in claim 1, wherein the evaluation index used in step 4) is an index F used to measure the accuracy of class model in accuracy, recall rate and statistics1And displaying the metrics of four punctuation marks, comma, period, question mark and exclamation mark, respectively, in the test set, the equation is defined as follows:
the recall ratio is as follows:
the accuracy is as follows:
F1the value:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010590707.8A CN111723584B (en) | 2020-06-24 | 2020-06-24 | Punctuation prediction method based on consideration field information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010590707.8A CN111723584B (en) | 2020-06-24 | 2020-06-24 | Punctuation prediction method based on consideration field information |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111723584A true CN111723584A (en) | 2020-09-29 |
CN111723584B CN111723584B (en) | 2024-05-07 |
Family
ID=72568858
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010590707.8A Active CN111723584B (en) | 2020-06-24 | 2020-06-24 | Punctuation prediction method based on consideration field information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111723584B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114880990A (en) * | 2022-05-16 | 2022-08-09 | 马上消费金融股份有限公司 | Punctuation mark prediction model training method, punctuation mark prediction method and punctuation mark prediction device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012039686A1 (en) * | 2010-09-24 | 2012-03-29 | National University Of Singapore | Methods and systems for automated text correction |
CN106599933A (en) * | 2016-12-26 | 2017-04-26 | 哈尔滨工业大学 | Text emotion classification method based on the joint deep learning model |
GB201814860D0 (en) * | 2017-11-14 | 2018-10-31 | Adobe Systems Inc | Predicting style breaches within textual content |
CN109741604A (en) * | 2019-03-05 | 2019-05-10 | 南通大学 | Based on tranquilization shot and long term memory network model prediction intersection traffic method of flow |
CN111090981A (en) * | 2019-12-06 | 2020-05-01 | 中国人民解放军战略支援部队信息工程大学 | Method and system for building Chinese text automatic sentence-breaking and punctuation generation model based on bidirectional long-time and short-time memory network |
-
2020
- 2020-06-24 CN CN202010590707.8A patent/CN111723584B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012039686A1 (en) * | 2010-09-24 | 2012-03-29 | National University Of Singapore | Methods and systems for automated text correction |
CN106599933A (en) * | 2016-12-26 | 2017-04-26 | 哈尔滨工业大学 | Text emotion classification method based on the joint deep learning model |
GB201814860D0 (en) * | 2017-11-14 | 2018-10-31 | Adobe Systems Inc | Predicting style breaches within textual content |
CN109741604A (en) * | 2019-03-05 | 2019-05-10 | 南通大学 | Based on tranquilization shot and long term memory network model prediction intersection traffic method of flow |
CN111090981A (en) * | 2019-12-06 | 2020-05-01 | 中国人民解放军战略支援部队信息工程大学 | Method and system for building Chinese text automatic sentence-breaking and punctuation generation model based on bidirectional long-time and short-time memory network |
Non-Patent Citations (3)
Title |
---|
万静;郭雅志;: "基于多段落排序的机器阅读理解研究", 北京化工大学学报(自然科学版), no. 03 * |
李雅昆;潘晴;EVERETT X.WANG;: "基于改进的多层BLSTM的中文分词和标点预测", 计算机应用, no. 05 * |
段大高;梁少虎;赵振东;韩忠明;: "基于自注意力机制的中文标点符号预测模型", 计算机工程, no. 05 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114880990A (en) * | 2022-05-16 | 2022-08-09 | 马上消费金融股份有限公司 | Punctuation mark prediction model training method, punctuation mark prediction method and punctuation mark prediction device |
Also Published As
Publication number | Publication date |
---|---|
CN111723584B (en) | 2024-05-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ghosh et al. | Fracking sarcasm using neural network | |
US11693894B2 (en) | Conversation oriented machine-user interaction | |
CN106599032B (en) | Text event extraction method combining sparse coding and structure sensing machine | |
CN109214003B (en) | The method that Recognition with Recurrent Neural Network based on multilayer attention mechanism generates title | |
Peng et al. | Topic-enhanced emotional conversation generation with attention mechanism | |
CN110489750A (en) | Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF | |
CN114580382A (en) | Text error correction method and device | |
CN112084334B (en) | Label classification method and device for corpus, computer equipment and storage medium | |
CN115146629A (en) | News text and comment correlation analysis method based on comparative learning | |
CN112016320A (en) | English punctuation adding method, system and equipment based on data enhancement | |
CN113178193A (en) | Chinese self-defined awakening and Internet of things interaction method based on intelligent voice chip | |
El Janati et al. | Adaptive e-learning AI-powered chatbot based on multimedia indexing | |
CN113407711B (en) | Gibbs limited text abstract generation method by using pre-training model | |
CN114722832A (en) | Abstract extraction method, device, equipment and storage medium | |
CN111723584B (en) | Punctuation prediction method based on consideration field information | |
KR102297480B1 (en) | System and method for structured-paraphrasing the unstructured query or request sentence | |
Van Enschot et al. | Taming our wild data: On intercoder reliability in discourse research | |
Younes et al. | A deep learning approach for the Romanized Tunisian dialect identification. | |
CN116881446A (en) | Semantic classification method, device, equipment and storage medium thereof | |
CN110750967A (en) | Pronunciation labeling method and device, computer equipment and storage medium | |
CN116108840A (en) | Text fine granularity emotion analysis method, system, medium and computing device | |
Yang | [Retracted] Design of Service Robot Based on User Emotion Recognition and Environmental Monitoring | |
Chanda et al. | Is Meta Embedding better than pre-trained word embedding to perform Sentiment Analysis for Dravidian Languages in Code-Mixed Text? | |
CN114841143A (en) | Voice room quality evaluation method and device, equipment, medium and product thereof | |
Chen et al. | Extractive spoken document summarization for information retrieval |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |