CN111723584B - Punctuation prediction method based on consideration field information - Google Patents

Punctuation prediction method based on consideration field information Download PDF

Info

Publication number
CN111723584B
CN111723584B CN202010590707.8A CN202010590707A CN111723584B CN 111723584 B CN111723584 B CN 111723584B CN 202010590707 A CN202010590707 A CN 202010590707A CN 111723584 B CN111723584 B CN 111723584B
Authority
CN
China
Prior art keywords
punctuation
model
layer
word
lstm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010590707.8A
Other languages
Chinese (zh)
Other versions
CN111723584A (en
Inventor
王龙标
魏文青
党建武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202010590707.8A priority Critical patent/CN111723584B/en
Publication of CN111723584A publication Critical patent/CN111723584A/en
Application granted granted Critical
Publication of CN111723584B publication Critical patent/CN111723584B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a punctuation prediction method based on consideration field information, which mainly comprises the following steps: 1) Preprocessing the mark point symbol of the text; 2) Selection of neural network front-end characteristics: converting the word into a 300-dimensional word vector using GloVe tools; 3) Model construction: the model uses a multitask learning method, and simultaneously predicts punctuation and field classification of texts, so that the model is integrated with field information; 4) And (5) selecting an evaluation index. The robustness of the system is increased. And meanwhile, the accuracy of overall prediction is improved.

Description

Punctuation prediction method based on consideration field information
Technical Field
The invention relates to a punctuation prediction task in the field of natural language processing, in particular to a punctuation prediction method based on consideration of field information, aiming at the difference of the field information contained in different fields in punctuation prediction.
Background
In recent years, with the remarkable improvement of the computing capability of a computer and the continuous effort of researchers of computer algorithms, the performance and the accuracy of an automatic voice recognition technology are remarkably improved, and the basic requirements of daily life of people are met. Automatic speech recognition is becoming increasingly popular in industry and in everyday life. The voice recognition technology is widely applied to the fields of intelligent home systems, session transcriber, voice dictation technology, simultaneous interpretation and the like, and brings great convenience to daily life and work of people. In most cases, speech recognition techniques transcribe speech signals into text information, which is then analyzed and post-processed accordingly. In this case, the quality of the transcribed text directly affects the performance of subsequent tasks, thereby affecting product performance and user experience. However, most automatic speech recognition systems are not able to recognize punctuation marks and only generate text sequences without any punctuation marks, because punctuation marks are inaudible when people communicate and speak in daily life. Punctuation marks, however, are an integral part of text. Punctuation plays a role in pausing and intonation in sentences. It usually emphasizes certain words or phrases to better convey the meaning of the sentence. The absence of punctuation can cause problems such as disturbing human readers to understand sentences, affecting the performance of existing natural language processing algorithms, such as machine translation, abstract extraction, human-machine dialogue, and the like. Therefore, how to automatically mark a text with a dot is a very important task.
To date, there have been many studies on punctuation automatic prediction. Before deep learning becomes a trend, the main approach is manual rules. As the amount of data increases, some statistical-based approaches have become mainstream, such as using N-Gram language models to train punctuation-bearing text, or treating punctuation prediction tasks as sequence labeling tasks, and then using conditional random fields to solve them. As deep learning progresses, many researchers begin using it for punctuation prediction tasks. People's communication generally spans many areas. And each domain has its own vocabulary and writing style. Thus, considering field information facilitates punctuation prediction. In the prior researches, the main method is to use acoustic features and text features, such as part-of-speech labels, word vectors, inter-word pause time, pitch and the like. However, few studies consider the particulars of different fields. For these reasons, the invention proposes a method for merging field information by using multitask learning to predict punctuation, so that the model has good robustness and better performance.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a punctuation prediction method based on consideration field information.
The invention provides a method for integrating field information in punctuation prediction task. The dataset of THUCNews text was used as the subject. Mainly involves four aspects: 1) Preprocessing punctuation marks of the text; 2) Selecting front-end characteristics of the neural network; 3) Building a model; 4) And (5) selecting an evaluation index.
1) Preprocessing punctuation of text
This part is processed by first reading the data, and is primarily directed to predicting the four most important and common punctuation marks, commas, periods, question marks and exclamation marks. According to previous researchers' studies, we have replaced the semicolons and the colon with commas and other punctuations are deleted from the corpus.
Because there is no obvious demarcation between Chinese words and words, the invention cuts the character strings in the text into reasonable word sequences, and then carries out other analysis processing on the basis. Since there are many Chinese words, the input vocabulary used in this experiment is composed of the 10 ten thousand Chinese words with the highest frequency of occurrence in the corpus and two special symbols, one for representing unknown words (words not appearing in the vocabulary), and the other for representing the end of input.
Punctuation prediction tasks are considered herein as a matter of classification of what the punctuation following each word is. A special symbol "O" is also defined to indicate that no punctuation marks, i.e., spaces, follow the word. For example, "i like her humor, you? ". The input is "i like her humor you," the output is a corresponding series of punctuation marks, such as "O OOOO, O? ". As shown in table 1:
TABLE 1 example of punctuation sequence labeling of a sentence
I am Like to like She (her) A kind of electronic device Humor of humor You woolen cloth
O O O O
2) Selection of neural network front-end features
Conventional machine learning methods and deep learning methods can only process numerical type data. For character type data, it needs to be converted into digital type data, glove is an unsupervised word representation method, converts words into word vectors, and can represent different words and relationships between words to a certain extent. The present invention uses glove as the input to the model the vector that converts the word 300 into dimensions.
3) Construction of a model
In the present invention, in order to consider domain information at the time of punctuation prediction, two methods are used:
One is to use as input the conversion of the domain label into one-hot coding in combination with word vectors, thereby enabling the model to incorporate the domain information as shown in fig. 1.
The model mainly comprises a layer of two-way long-short-term memory network (BILSTM). We use the sequence x= (X 1,…,xt) of the word embedded encoded words in combination with the one-hot encoding D tag of the punctuation of the sentence in the field as input to the two-way long and short term memory layer.
Nt={xt,Dtag} (1)
The bidirectional long and short-term memory layer is composed of two LSTM layers, wherein the forward LSTM layer processes a forward value sequence and the reverse LSTM layer processes a reverse value sequence. The two LSTM layers use weighted shared layers to process information. Forward LSTM layer is the hidden unit at time step t
Hidden state of reverse LSTM layerThe calculation method of (2) is the same as that of the forward LSTM layer sequence.
The hidden state of BiLSTM layer h t is then obtained by combining the hidden units of the forward LSTM layer at step time tAnd hidden units of the backward LSTM layer/>Is constructed from the state of:
thus, the bi-directional LSTM can learn the expression of each input word using a sentence back and forth (e.g., "i like her humor"), which can help the model better identify punctuations that are often context dependent. The output layer then generates punctuation probabilities y t at time step t as follows:
yt=Softmax(htWy+by) (4)
Another approach to using multi-task learning is to have two tasks, one punctuation prediction and the other text, in the model, the domain classification model structure is shown in figure two. In this model, a network structure similar to that of the domain label uses as input the sequence x= (X1,..and xt) of word embedding encoded words. D tag is the output of the text field classification task, and the main flow is as follows:
yt=Softmax(htWy+by) (7)
f=fltten{h1,h2,…,ht} (8)
Dtag=Sigmod(fWD+dtag) (9)
4) Evaluation index selection
In order to evaluate the performance of the punctuation mark prediction task, the evaluation index used in the present invention is precision (precision), recall and an index F 1 used in statistics to measure the precision of the class model, which simultaneously combines the precision and recall of the classification model. And the metrics of comma, period, question mark and exclamation mark are displayed in the test set. The equation is defined as follows:
Recall rate:
Accuracy rate:
F 1 value:
Advantageous effects
Aiming at punctuation prediction tasks in the field of natural language processing, the invention aims at considering field information in punctuation prediction, and has the following specific gain effects compared with the prior art:
1) The system only trains the model by using the text data, and only needs the text as input to predict punctuation of the corresponding text during model prediction.
2) In this system we use multi-domain text, considering domain information. Thus avoiding poor performance in a certain area. The robustness of the system is increased. And meanwhile, the accuracy of overall prediction is improved.
Drawings
FIG. 1 is a block diagram of a domain labeled punctuation prediction model.
FIG. 2 is a block diagram of a punctuation prediction model based on multi-tasking learning integration domain information.
Detailed Description
The operation and effects of the present invention will be described in detail with reference to the accompanying drawings and tables.
The invention provides a punctuation prediction system based on consideration field information, and the whole system flow comprises the steps of experimental data set and data processing, text feature extraction, experimental setting and result analysis by taking THUCNews corpus collected by using News and wave news subscription channels from 2005 to 2011 as a processing object.
The method comprises the following specific steps:
1) Experimental data set and data processing
In order to consider the field information, a database, namely THUCNews corpus collected from the New wave news subscription channels from 2005 to 2011, is built by collecting from Qinghai, and comprises 74 ten thousand news documents (2.19 GB) of various types, which are all data in UTF-8 plain text format, wherein texts in fields of science and technology, time administration, games, furniture, education and the like are used for training punctuation prediction models. In our experiment we used his part of the data, this experiment had two similar sized domain different training sets "sports domain" and "multi-domain" and one test set. The "sports field" training set contains only text in the sports field, contains 53768 paragraphs, and the "multi-field" training set contains text in multiple fields, 5500 paragraphs in each field. The test set contains 500 paragraphs in each of the multiple fields of text, as shown in Table 2.
The present invention is primarily directed to predicting the four most important and common punctuation marks, commas, periods, question marks and exclamation marks. According to the previous research of researchers, the comma and the colon are replaced by commas, other punctuation marks are deleted from a corpus, and in the experiment, a set of Chinese lexical analysis tool kit developed by the university of Qinghai natural language processing and the social human computing laboratory is used for carrying out word segmentation processing on texts.
Table 2 description of the experimental database
2) Text feature extraction
A Glove tool is used to extract the 300-dimensional word vector. The specific description is as above.
3) Experimental setup
In the present invention, we train all deep neural networks using a back propagation algorithm. The network updates the weights of each training example using an Adam optimizer with a learning rate of 0.001. In our neural network model, with 256 units of LSTM cells per LSTM layer, we use an embedding layer between the word input and LSTM layer, the dimension of which is empirically set to 128. In the training process, firstly, the input sequence of training sentences is disturbed, and then 128 sentences are randomly selected as a training batch. The number of data iterations is 30.
4) Analysis of results
In Table 3 we calculate the accuracy, recall, and F1 score (F1) for each punctuation. We have found that model BiLSTM-Multi, which incorporates domain information by multitasking, model MTL-BiLSTM achieves the best results on all evaluation criteria compared to and without consideration of domain information.
Table 4 shows the results of two different methods to consider the domain information punctuation prediction model. From Table 4 we find that the error rate of the Domain information-taking model of the Domain tag is reduced by 7.0% compared with that of the Domain information-not-taking model BLSTM-Multi, and the error rate of the Domain information-taking model MTL-BiLSTM is reduced by 16.8% compared with that of the tag model of the Domain tag by using the multitasking method, which indicates that the multitasking structure can better integrate the Domain information into the punctuation prediction model.
TABLE 3 results of experiments at each punctuation
TABLE 4 results of the overall experiments

Claims (2)

1. The punctuation prediction method based on the consideration of the field information is characterized by mainly comprising the following steps:
1) Preprocessing punctuation marks of the text;
2) Selection of neural network front-end characteristics: converting the word into a 300-dimensional word vector using GloVe tools;
3) Model construction: the model uses a multitask learning method, and simultaneously predicts punctuation and field classification of texts, so that the model is integrated with field information;
4) And (3) selecting an evaluation index: to evaluate the performance of punctuation prediction tasks, the evaluation index used in we is precision, recall and statistics, an index F 1 used to measure class model accuracy;
The step 3) comprises two specific methods:
Converting the domain label into one-hot coding combined word vector as input, so that the model is integrated with the domain information;
the model mainly comprises a layer of two-way long-short-term memory network (BILSTM), which uses the combination of a sequence X= (X 1,…,xt) of word embedded coded words and one-hot coding D tag of punctuation in the field of sentences as the input of the two-way long-term memory layer,
Nt={xt,Dtag} (1)
The two-way long-short-term memory layer consists of two LSTM layers, wherein the forward LSTM layer processes a forward value sequence, the reverse LSTM layer processes a reverse value sequence, the two LSTM layers process information using weighted shared layers, and the forward LSTM layer is a hidden unit at time step t
Hidden state of reverse LSTM layerThe calculation method of (2) is the same as that of the forward LSTM layer sequence;
the hidden state of BiLSTM layer h t is then obtained by combining the hidden units of the forward LSTM layer at step time t And hidden units of the backward LSTM layer/>Is constructed from the state of:
Thus, bi-directional LSTM can learn the expression of each input word using the front-to-back sentences, identifying punctuation that is often context dependent; the output layer then generates punctuation probabilities y t at time step t as follows:
yt=Softmax(htWy+by) (4)
Another approach to using multi-task learning is to use two tasks, one punctuation prediction and the other a domain classification model structure of text in the model and a network structure similar to the domain label model, using the sequence of word embedded encoded words x= (X 1,...,xt) as input;
D tag is the output of the text field classification task, and the main flow is as follows:
yt=Softmax(htWy+by) (7)
f=fltten{h1,h2,...,ht} (8)
Dtag=Sigmod(fWD+dtag) (9)。
2. the punctuation prediction method based on consideration field information according to claim 1, wherein the evaluation index used in the step 4) is an index F 1 for measuring accuracy of class models in terms of accuracy, recall and statistics, and the metrics of comma, period, question mark and exclamation mark are displayed in the test set, respectively, and the equation is defined as follows:
Recall rate:
Accuracy rate:
F 1 value:
CN202010590707.8A 2020-06-24 2020-06-24 Punctuation prediction method based on consideration field information Active CN111723584B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010590707.8A CN111723584B (en) 2020-06-24 2020-06-24 Punctuation prediction method based on consideration field information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010590707.8A CN111723584B (en) 2020-06-24 2020-06-24 Punctuation prediction method based on consideration field information

Publications (2)

Publication Number Publication Date
CN111723584A CN111723584A (en) 2020-09-29
CN111723584B true CN111723584B (en) 2024-05-07

Family

ID=72568858

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010590707.8A Active CN111723584B (en) 2020-06-24 2020-06-24 Punctuation prediction method based on consideration field information

Country Status (1)

Country Link
CN (1) CN111723584B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012039686A1 (en) * 2010-09-24 2012-03-29 National University Of Singapore Methods and systems for automated text correction
CN106599933A (en) * 2016-12-26 2017-04-26 哈尔滨工业大学 Text emotion classification method based on the joint deep learning model
GB201814860D0 (en) * 2017-11-14 2018-10-31 Adobe Systems Inc Predicting style breaches within textual content
CN109741604A (en) * 2019-03-05 2019-05-10 南通大学 Based on tranquilization shot and long term memory network model prediction intersection traffic method of flow
CN111090981A (en) * 2019-12-06 2020-05-01 中国人民解放军战略支援部队信息工程大学 Method and system for building Chinese text automatic sentence-breaking and punctuation generation model based on bidirectional long-time and short-time memory network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012039686A1 (en) * 2010-09-24 2012-03-29 National University Of Singapore Methods and systems for automated text correction
CN106599933A (en) * 2016-12-26 2017-04-26 哈尔滨工业大学 Text emotion classification method based on the joint deep learning model
GB201814860D0 (en) * 2017-11-14 2018-10-31 Adobe Systems Inc Predicting style breaches within textual content
CN109741604A (en) * 2019-03-05 2019-05-10 南通大学 Based on tranquilization shot and long term memory network model prediction intersection traffic method of flow
CN111090981A (en) * 2019-12-06 2020-05-01 中国人民解放军战略支援部队信息工程大学 Method and system for building Chinese text automatic sentence-breaking and punctuation generation model based on bidirectional long-time and short-time memory network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
基于多段落排序的机器阅读理解研究;万静;郭雅志;;北京化工大学学报(自然科学版)(第03期);全文 *
基于改进的多层BLSTM的中文分词和标点预测;李雅昆;潘晴;Everett X.WANG;;计算机应用(第05期);全文 *
基于自注意力机制的中文标点符号预测模型;段大高;梁少虎;赵振东;韩忠明;;计算机工程(第05期);全文 *

Also Published As

Publication number Publication date
CN111723584A (en) 2020-09-29

Similar Documents

Publication Publication Date Title
CN110717031B (en) Intelligent conference summary generation method and system
CN109241255B (en) Intention identification method based on deep learning
US11693894B2 (en) Conversation oriented machine-user interaction
CN107798140B (en) Dialog system construction method, semantic controlled response method and device
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN113435203B (en) Multi-modal named entity recognition method and device and electronic equipment
CN108733647B (en) Word vector generation method based on Gaussian distribution
CN109508441B (en) Method and device for realizing data statistical analysis through natural language and electronic equipment
CN112016320A (en) English punctuation adding method, system and equipment based on data enhancement
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
Graja et al. Statistical framework with knowledge base integration for robust speech understanding of the Tunisian dialect
CN113469163B (en) Medical information recording method and device based on intelligent paper pen
Nithya et al. Deep learning based analysis on code-mixed tamil text for sentiment classification with pre-trained ulmfit
CN110866087A (en) Entity-oriented text emotion analysis method based on topic model
CN107562907B (en) Intelligent lawyer expert case response device
KR102297480B1 (en) System and method for structured-paraphrasing the unstructured query or request sentence
CN115481313A (en) News recommendation method based on text semantic mining
CN117591648A (en) Power grid customer service co-emotion dialogue reply generation method based on emotion fine perception
CN111723584B (en) Punctuation prediction method based on consideration field information
Younes et al. A deep learning approach for the Romanized Tunisian dialect identification.
CN116108840A (en) Text fine granularity emotion analysis method, system, medium and computing device
Fernández-Martínez et al. An approach to intent detection and classification based on attentive recurrent neural networks
CN113342964B (en) Recommendation type determination method and system based on mobile service
CN114841143A (en) Voice room quality evaluation method and device, equipment, medium and product thereof
CN114896966A (en) Method, system, equipment and medium for positioning grammar error of Chinese text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant