CN111723584B - Punctuation prediction method based on consideration field information - Google Patents
Punctuation prediction method based on consideration field information Download PDFInfo
- Publication number
- CN111723584B CN111723584B CN202010590707.8A CN202010590707A CN111723584B CN 111723584 B CN111723584 B CN 111723584B CN 202010590707 A CN202010590707 A CN 202010590707A CN 111723584 B CN111723584 B CN 111723584B
- Authority
- CN
- China
- Prior art keywords
- punctuation
- model
- layer
- word
- lstm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 28
- 238000011156 evaluation Methods 0.000 claims abstract description 8
- 239000013598 vector Substances 0.000 claims abstract description 8
- 238000013528 artificial neural network Methods 0.000 claims abstract description 5
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 238000010276 construction Methods 0.000 claims abstract description 3
- 238000013459 approach Methods 0.000 claims description 4
- 238000012360 testing method Methods 0.000 claims description 4
- 238000013145 classification model Methods 0.000 claims description 3
- 230000007787 long-term memory Effects 0.000 claims description 3
- 230000015654 memory Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 2
- 230000001419 dependent effect Effects 0.000 claims description 2
- 238000012549 training Methods 0.000 description 8
- 238000002474 experimental method Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000003058 natural language processing Methods 0.000 description 4
- 230000002354 daily effect Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 241000590419 Polygonia interrogationis Species 0.000 description 2
- 210000001072 colon Anatomy 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- RSPISYXLHRIGJD-UHFFFAOYSA-N OOOO Chemical compound OOOO RSPISYXLHRIGJD-UHFFFAOYSA-N 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a punctuation prediction method based on consideration field information, which mainly comprises the following steps: 1) Preprocessing the mark point symbol of the text; 2) Selection of neural network front-end characteristics: converting the word into a 300-dimensional word vector using GloVe tools; 3) Model construction: the model uses a multitask learning method, and simultaneously predicts punctuation and field classification of texts, so that the model is integrated with field information; 4) And (5) selecting an evaluation index. The robustness of the system is increased. And meanwhile, the accuracy of overall prediction is improved.
Description
Technical Field
The invention relates to a punctuation prediction task in the field of natural language processing, in particular to a punctuation prediction method based on consideration of field information, aiming at the difference of the field information contained in different fields in punctuation prediction.
Background
In recent years, with the remarkable improvement of the computing capability of a computer and the continuous effort of researchers of computer algorithms, the performance and the accuracy of an automatic voice recognition technology are remarkably improved, and the basic requirements of daily life of people are met. Automatic speech recognition is becoming increasingly popular in industry and in everyday life. The voice recognition technology is widely applied to the fields of intelligent home systems, session transcriber, voice dictation technology, simultaneous interpretation and the like, and brings great convenience to daily life and work of people. In most cases, speech recognition techniques transcribe speech signals into text information, which is then analyzed and post-processed accordingly. In this case, the quality of the transcribed text directly affects the performance of subsequent tasks, thereby affecting product performance and user experience. However, most automatic speech recognition systems are not able to recognize punctuation marks and only generate text sequences without any punctuation marks, because punctuation marks are inaudible when people communicate and speak in daily life. Punctuation marks, however, are an integral part of text. Punctuation plays a role in pausing and intonation in sentences. It usually emphasizes certain words or phrases to better convey the meaning of the sentence. The absence of punctuation can cause problems such as disturbing human readers to understand sentences, affecting the performance of existing natural language processing algorithms, such as machine translation, abstract extraction, human-machine dialogue, and the like. Therefore, how to automatically mark a text with a dot is a very important task.
To date, there have been many studies on punctuation automatic prediction. Before deep learning becomes a trend, the main approach is manual rules. As the amount of data increases, some statistical-based approaches have become mainstream, such as using N-Gram language models to train punctuation-bearing text, or treating punctuation prediction tasks as sequence labeling tasks, and then using conditional random fields to solve them. As deep learning progresses, many researchers begin using it for punctuation prediction tasks. People's communication generally spans many areas. And each domain has its own vocabulary and writing style. Thus, considering field information facilitates punctuation prediction. In the prior researches, the main method is to use acoustic features and text features, such as part-of-speech labels, word vectors, inter-word pause time, pitch and the like. However, few studies consider the particulars of different fields. For these reasons, the invention proposes a method for merging field information by using multitask learning to predict punctuation, so that the model has good robustness and better performance.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a punctuation prediction method based on consideration field information.
The invention provides a method for integrating field information in punctuation prediction task. The dataset of THUCNews text was used as the subject. Mainly involves four aspects: 1) Preprocessing punctuation marks of the text; 2) Selecting front-end characteristics of the neural network; 3) Building a model; 4) And (5) selecting an evaluation index.
1) Preprocessing punctuation of text
This part is processed by first reading the data, and is primarily directed to predicting the four most important and common punctuation marks, commas, periods, question marks and exclamation marks. According to previous researchers' studies, we have replaced the semicolons and the colon with commas and other punctuations are deleted from the corpus.
Because there is no obvious demarcation between Chinese words and words, the invention cuts the character strings in the text into reasonable word sequences, and then carries out other analysis processing on the basis. Since there are many Chinese words, the input vocabulary used in this experiment is composed of the 10 ten thousand Chinese words with the highest frequency of occurrence in the corpus and two special symbols, one for representing unknown words (words not appearing in the vocabulary), and the other for representing the end of input.
Punctuation prediction tasks are considered herein as a matter of classification of what the punctuation following each word is. A special symbol "O" is also defined to indicate that no punctuation marks, i.e., spaces, follow the word. For example, "i like her humor, you? ". The input is "i like her humor you," the output is a corresponding series of punctuation marks, such as "O OOOO, O? ". As shown in table 1:
TABLE 1 example of punctuation sequence labeling of a sentence
… | I am | Like to like | She (her) | A kind of electronic device | Humor of humor | You woolen cloth | … |
… | O | O | O | O | , | ? | … |
2) Selection of neural network front-end features
Conventional machine learning methods and deep learning methods can only process numerical type data. For character type data, it needs to be converted into digital type data, glove is an unsupervised word representation method, converts words into word vectors, and can represent different words and relationships between words to a certain extent. The present invention uses glove as the input to the model the vector that converts the word 300 into dimensions.
3) Construction of a model
In the present invention, in order to consider domain information at the time of punctuation prediction, two methods are used:
One is to use as input the conversion of the domain label into one-hot coding in combination with word vectors, thereby enabling the model to incorporate the domain information as shown in fig. 1.
The model mainly comprises a layer of two-way long-short-term memory network (BILSTM). We use the sequence x= (X 1,…,xt) of the word embedded encoded words in combination with the one-hot encoding D tag of the punctuation of the sentence in the field as input to the two-way long and short term memory layer.
Nt={xt,Dtag} (1)
The bidirectional long and short-term memory layer is composed of two LSTM layers, wherein the forward LSTM layer processes a forward value sequence and the reverse LSTM layer processes a reverse value sequence. The two LSTM layers use weighted shared layers to process information. Forward LSTM layer is the hidden unit at time step t
Hidden state of reverse LSTM layerThe calculation method of (2) is the same as that of the forward LSTM layer sequence.
The hidden state of BiLSTM layer h t is then obtained by combining the hidden units of the forward LSTM layer at step time tAnd hidden units of the backward LSTM layer/>Is constructed from the state of:
thus, the bi-directional LSTM can learn the expression of each input word using a sentence back and forth (e.g., "i like her humor"), which can help the model better identify punctuations that are often context dependent. The output layer then generates punctuation probabilities y t at time step t as follows:
yt=Softmax(htWy+by) (4)
Another approach to using multi-task learning is to have two tasks, one punctuation prediction and the other text, in the model, the domain classification model structure is shown in figure two. In this model, a network structure similar to that of the domain label uses as input the sequence x= (X1,..and xt) of word embedding encoded words. D tag is the output of the text field classification task, and the main flow is as follows:
yt=Softmax(htWy+by) (7)
f=fltten{h1,h2,…,ht} (8)
Dtag=Sigmod(fWD+dtag) (9)
4) Evaluation index selection
In order to evaluate the performance of the punctuation mark prediction task, the evaluation index used in the present invention is precision (precision), recall and an index F 1 used in statistics to measure the precision of the class model, which simultaneously combines the precision and recall of the classification model. And the metrics of comma, period, question mark and exclamation mark are displayed in the test set. The equation is defined as follows:
Recall rate:
Accuracy rate:
F 1 value:
Advantageous effects
Aiming at punctuation prediction tasks in the field of natural language processing, the invention aims at considering field information in punctuation prediction, and has the following specific gain effects compared with the prior art:
1) The system only trains the model by using the text data, and only needs the text as input to predict punctuation of the corresponding text during model prediction.
2) In this system we use multi-domain text, considering domain information. Thus avoiding poor performance in a certain area. The robustness of the system is increased. And meanwhile, the accuracy of overall prediction is improved.
Drawings
FIG. 1 is a block diagram of a domain labeled punctuation prediction model.
FIG. 2 is a block diagram of a punctuation prediction model based on multi-tasking learning integration domain information.
Detailed Description
The operation and effects of the present invention will be described in detail with reference to the accompanying drawings and tables.
The invention provides a punctuation prediction system based on consideration field information, and the whole system flow comprises the steps of experimental data set and data processing, text feature extraction, experimental setting and result analysis by taking THUCNews corpus collected by using News and wave news subscription channels from 2005 to 2011 as a processing object.
The method comprises the following specific steps:
1) Experimental data set and data processing
In order to consider the field information, a database, namely THUCNews corpus collected from the New wave news subscription channels from 2005 to 2011, is built by collecting from Qinghai, and comprises 74 ten thousand news documents (2.19 GB) of various types, which are all data in UTF-8 plain text format, wherein texts in fields of science and technology, time administration, games, furniture, education and the like are used for training punctuation prediction models. In our experiment we used his part of the data, this experiment had two similar sized domain different training sets "sports domain" and "multi-domain" and one test set. The "sports field" training set contains only text in the sports field, contains 53768 paragraphs, and the "multi-field" training set contains text in multiple fields, 5500 paragraphs in each field. The test set contains 500 paragraphs in each of the multiple fields of text, as shown in Table 2.
The present invention is primarily directed to predicting the four most important and common punctuation marks, commas, periods, question marks and exclamation marks. According to the previous research of researchers, the comma and the colon are replaced by commas, other punctuation marks are deleted from a corpus, and in the experiment, a set of Chinese lexical analysis tool kit developed by the university of Qinghai natural language processing and the social human computing laboratory is used for carrying out word segmentation processing on texts.
Table 2 description of the experimental database
2) Text feature extraction
A Glove tool is used to extract the 300-dimensional word vector. The specific description is as above.
3) Experimental setup
In the present invention, we train all deep neural networks using a back propagation algorithm. The network updates the weights of each training example using an Adam optimizer with a learning rate of 0.001. In our neural network model, with 256 units of LSTM cells per LSTM layer, we use an embedding layer between the word input and LSTM layer, the dimension of which is empirically set to 128. In the training process, firstly, the input sequence of training sentences is disturbed, and then 128 sentences are randomly selected as a training batch. The number of data iterations is 30.
4) Analysis of results
In Table 3 we calculate the accuracy, recall, and F1 score (F1) for each punctuation. We have found that model BiLSTM-Multi, which incorporates domain information by multitasking, model MTL-BiLSTM achieves the best results on all evaluation criteria compared to and without consideration of domain information.
Table 4 shows the results of two different methods to consider the domain information punctuation prediction model. From Table 4 we find that the error rate of the Domain information-taking model of the Domain tag is reduced by 7.0% compared with that of the Domain information-not-taking model BLSTM-Multi, and the error rate of the Domain information-taking model MTL-BiLSTM is reduced by 16.8% compared with that of the tag model of the Domain tag by using the multitasking method, which indicates that the multitasking structure can better integrate the Domain information into the punctuation prediction model.
TABLE 3 results of experiments at each punctuation
TABLE 4 results of the overall experiments
Claims (2)
1. The punctuation prediction method based on the consideration of the field information is characterized by mainly comprising the following steps:
1) Preprocessing punctuation marks of the text;
2) Selection of neural network front-end characteristics: converting the word into a 300-dimensional word vector using GloVe tools;
3) Model construction: the model uses a multitask learning method, and simultaneously predicts punctuation and field classification of texts, so that the model is integrated with field information;
4) And (3) selecting an evaluation index: to evaluate the performance of punctuation prediction tasks, the evaluation index used in we is precision, recall and statistics, an index F 1 used to measure class model accuracy;
The step 3) comprises two specific methods:
Converting the domain label into one-hot coding combined word vector as input, so that the model is integrated with the domain information;
the model mainly comprises a layer of two-way long-short-term memory network (BILSTM), which uses the combination of a sequence X= (X 1,…,xt) of word embedded coded words and one-hot coding D tag of punctuation in the field of sentences as the input of the two-way long-term memory layer,
Nt={xt,Dtag} (1)
The two-way long-short-term memory layer consists of two LSTM layers, wherein the forward LSTM layer processes a forward value sequence, the reverse LSTM layer processes a reverse value sequence, the two LSTM layers process information using weighted shared layers, and the forward LSTM layer is a hidden unit at time step t
Hidden state of reverse LSTM layerThe calculation method of (2) is the same as that of the forward LSTM layer sequence;
the hidden state of BiLSTM layer h t is then obtained by combining the hidden units of the forward LSTM layer at step time t And hidden units of the backward LSTM layer/>Is constructed from the state of:
Thus, bi-directional LSTM can learn the expression of each input word using the front-to-back sentences, identifying punctuation that is often context dependent; the output layer then generates punctuation probabilities y t at time step t as follows:
yt=Softmax(htWy+by) (4)
Another approach to using multi-task learning is to use two tasks, one punctuation prediction and the other a domain classification model structure of text in the model and a network structure similar to the domain label model, using the sequence of word embedded encoded words x= (X 1,...,xt) as input;
D tag is the output of the text field classification task, and the main flow is as follows:
yt=Softmax(htWy+by) (7)
f=fltten{h1,h2,...,ht} (8)
Dtag=Sigmod(fWD+dtag) (9)。
2. the punctuation prediction method based on consideration field information according to claim 1, wherein the evaluation index used in the step 4) is an index F 1 for measuring accuracy of class models in terms of accuracy, recall and statistics, and the metrics of comma, period, question mark and exclamation mark are displayed in the test set, respectively, and the equation is defined as follows:
Recall rate:
Accuracy rate:
F 1 value:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010590707.8A CN111723584B (en) | 2020-06-24 | 2020-06-24 | Punctuation prediction method based on consideration field information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010590707.8A CN111723584B (en) | 2020-06-24 | 2020-06-24 | Punctuation prediction method based on consideration field information |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111723584A CN111723584A (en) | 2020-09-29 |
CN111723584B true CN111723584B (en) | 2024-05-07 |
Family
ID=72568858
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010590707.8A Active CN111723584B (en) | 2020-06-24 | 2020-06-24 | Punctuation prediction method based on consideration field information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111723584B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012039686A1 (en) * | 2010-09-24 | 2012-03-29 | National University Of Singapore | Methods and systems for automated text correction |
CN106599933A (en) * | 2016-12-26 | 2017-04-26 | 哈尔滨工业大学 | Text emotion classification method based on the joint deep learning model |
GB201814860D0 (en) * | 2017-11-14 | 2018-10-31 | Adobe Systems Inc | Predicting style breaches within textual content |
CN109741604A (en) * | 2019-03-05 | 2019-05-10 | 南通大学 | Based on tranquilization shot and long term memory network model prediction intersection traffic method of flow |
CN111090981A (en) * | 2019-12-06 | 2020-05-01 | 中国人民解放军战略支援部队信息工程大学 | Method and system for building Chinese text automatic sentence-breaking and punctuation generation model based on bidirectional long-time and short-time memory network |
-
2020
- 2020-06-24 CN CN202010590707.8A patent/CN111723584B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012039686A1 (en) * | 2010-09-24 | 2012-03-29 | National University Of Singapore | Methods and systems for automated text correction |
CN106599933A (en) * | 2016-12-26 | 2017-04-26 | 哈尔滨工业大学 | Text emotion classification method based on the joint deep learning model |
GB201814860D0 (en) * | 2017-11-14 | 2018-10-31 | Adobe Systems Inc | Predicting style breaches within textual content |
CN109741604A (en) * | 2019-03-05 | 2019-05-10 | 南通大学 | Based on tranquilization shot and long term memory network model prediction intersection traffic method of flow |
CN111090981A (en) * | 2019-12-06 | 2020-05-01 | 中国人民解放军战略支援部队信息工程大学 | Method and system for building Chinese text automatic sentence-breaking and punctuation generation model based on bidirectional long-time and short-time memory network |
Non-Patent Citations (3)
Title |
---|
基于多段落排序的机器阅读理解研究;万静;郭雅志;;北京化工大学学报(自然科学版)(第03期);全文 * |
基于改进的多层BLSTM的中文分词和标点预测;李雅昆;潘晴;Everett X.WANG;;计算机应用(第05期);全文 * |
基于自注意力机制的中文标点符号预测模型;段大高;梁少虎;赵振东;韩忠明;;计算机工程(第05期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111723584A (en) | 2020-09-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110717031B (en) | Intelligent conference summary generation method and system | |
CN109241255B (en) | Intention identification method based on deep learning | |
US11693894B2 (en) | Conversation oriented machine-user interaction | |
CN107798140B (en) | Dialog system construction method, semantic controlled response method and device | |
CN106599032B (en) | Text event extraction method combining sparse coding and structure sensing machine | |
CN113435203B (en) | Multi-modal named entity recognition method and device and electronic equipment | |
CN108733647B (en) | Word vector generation method based on Gaussian distribution | |
CN109508441B (en) | Method and device for realizing data statistical analysis through natural language and electronic equipment | |
CN112016320A (en) | English punctuation adding method, system and equipment based on data enhancement | |
CN115759119A (en) | Financial text emotion analysis method, system, medium and equipment | |
Graja et al. | Statistical framework with knowledge base integration for robust speech understanding of the Tunisian dialect | |
CN113469163B (en) | Medical information recording method and device based on intelligent paper pen | |
Nithya et al. | Deep learning based analysis on code-mixed tamil text for sentiment classification with pre-trained ulmfit | |
CN110866087A (en) | Entity-oriented text emotion analysis method based on topic model | |
CN107562907B (en) | Intelligent lawyer expert case response device | |
KR102297480B1 (en) | System and method for structured-paraphrasing the unstructured query or request sentence | |
CN115481313A (en) | News recommendation method based on text semantic mining | |
CN117591648A (en) | Power grid customer service co-emotion dialogue reply generation method based on emotion fine perception | |
CN111723584B (en) | Punctuation prediction method based on consideration field information | |
Younes et al. | A deep learning approach for the Romanized Tunisian dialect identification. | |
CN116108840A (en) | Text fine granularity emotion analysis method, system, medium and computing device | |
Fernández-Martínez et al. | An approach to intent detection and classification based on attentive recurrent neural networks | |
CN113342964B (en) | Recommendation type determination method and system based on mobile service | |
CN114841143A (en) | Voice room quality evaluation method and device, equipment, medium and product thereof | |
CN114896966A (en) | Method, system, equipment and medium for positioning grammar error of Chinese text |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |