CN111723584A - Punctuation prediction method based on consideration of domain information - Google Patents

Punctuation prediction method based on consideration of domain information Download PDF

Info

Publication number
CN111723584A
CN111723584A CN202010590707.8A CN202010590707A CN111723584A CN 111723584 A CN111723584 A CN 111723584A CN 202010590707 A CN202010590707 A CN 202010590707A CN 111723584 A CN111723584 A CN 111723584A
Authority
CN
China
Prior art keywords
punctuation
model
domain
layer
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010590707.8A
Other languages
Chinese (zh)
Other versions
CN111723584B (en
Inventor
王龙标
魏文青
党建武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202010590707.8A priority Critical patent/CN111723584B/en
Publication of CN111723584A publication Critical patent/CN111723584A/en
Application granted granted Critical
Publication of CN111723584B publication Critical patent/CN111723584B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a punctuation prediction method based on field-considered information, which mainly comprises the following steps: 1) preprocessing punctuation marks of the text; 2) selection of neural network front-end features: converting the words into 300-dimensional word vectors using a GloVe tool; 3) constructing a model: the model simultaneously predicts the domain classification of punctuation and text by using a multi-task learning method, so that the model is integrated with domain information; 4) and selecting an evaluation index. Increasing the robustness of the system. And meanwhile, the accuracy of the overall prediction is improved.

Description

Punctuation prediction method based on consideration of domain information
Technical Field
The invention relates to a punctuation prediction task in the field of natural language processing, in particular to a punctuation prediction method based on field information considered aiming at different field information contained in different fields in punctuation prediction.
Background
In recent years, with the remarkable improvement of computer computing power and the continuous effort of computer algorithm researchers, the performance and the accuracy of the automatic speech recognition technology are remarkably improved, and the basic requirements of people in daily life are met. Automatic speech recognition is becoming increasingly popular in industry and everyday life. The voice recognition technology is widely applied to the fields of intelligent home systems, session transcribers, voice dictation technology, simultaneous interpretation and the like, and brings great convenience to daily life and work of people. In most cases, speech recognition techniques transcribe speech signals into textual information and then perform corresponding analysis and post-processing of the text. In this case, the quality of the transcribed text directly affects the execution of subsequent tasks, thereby affecting product performance and user experience. However, most automatic speech recognition systems cannot recognize punctuation marks and only generate text sequences without any punctuation marks because punctuation marks are inaudible when people communicate and speak in daily life. Punctuation marks, however, are an integral part of text. Punctuation plays the role of pause and tone in sentences. It often emphasizes certain words or phrases to better express the meaning of a sentence. The absence of punctuation can cause problems, such as confusing human readers to understand sentences and affecting the performance of existing natural language processing algorithms, such as machine translation, abstract extraction, man-machine conversation, and other tasks. Therefore, how to automatically mark a text is a very important task.
Up to now, there have been many studies on punctuation automatic prediction. Before deep learning becomes a trend, the main approach is manual rules. As the amount of data increases, some statistical-based approaches become mainstream, such as training punctuated text using an N-Gram language model, or treating punctuation prediction tasks as sequence labeling tasks and then solving them using conditional random fields. With the development of intensive learning, many researchers began using it for punctuation prediction tasks. Human communication typically spans many areas. And each domain has its own vocabulary and writing style. Therefore, consideration of field information facilitates punctuation prediction. In previous research, the main method is to use acoustic features and text features, such as part-of-speech tagging, word vectors, inter-word pause duration, and high pitch. However, few studies consider distinctiveness in different fields. Based on the reasons, the invention provides a method for predicting punctuation by using multi-task learning to integrate into domain information, so that the model has good robustness and better performance.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a punctuation prediction method based on field-considered information.
The invention provides a method for integrating field information during punctuation prediction tasks. The data set of the THUCNews text is taken as an experimental object. Mainly relates to four aspects: 1) preprocessing punctuation marks of the text; 2) selecting a front-end characteristic of the neural network; 3) building a model; 4) and selecting an evaluation index.
1) Preprocessing punctuation of text
This section is processed by reading the data first, and in the present invention is primarily directed to predicting the four most important and most common punctuation marks, commas, periods, question marks and exclamation marks. Based on previous studies by researchers, we have replaced semicolons and colons with commas, and other punctuation marks are deleted from the corpus.
Because the Chinese words and phrases have no obvious boundary, the invention cuts the character strings in the text into reasonable word sequences, and then performs other analysis processing on the basis. Because there are many Chinese words, the input vocabulary used in this experiment is composed of the 10 ten thousand Chinese words with the highest frequency of occurrence in the corpus and two special symbols, one for indicating unknown words (words not appearing in this vocabulary), and the other for indicating the end of input.
The punctuation prediction task is considered herein as a classification problem of what punctuation follows each word. A special symbol "O" is also defined to indicate that there is no punctuation mark or space behind the word. For example, "i like her humor, you are? ". The input is "i like her humorous you," and the output is a series of punctuation marks, such as "O ooooo, O? ". As shown in table 1:
TABLE 1 example punctuation sequence annotation of a sentence
I am Like She is provided with Is/are as follows Humor (humor) Worsted fabric
O O O O
2) Selection of neural network front-end features
The conventional machine learning method and the deep learning method can process only numerical type data. For character type data, it needs to be converted into digital type data, and glove is an unsupervised word representation method, which converts words into word vectors and can represent different words and relations between words to some extent. The present invention uses glove to convert the words 300 into vectors of dimensions as input to the model.
3) Construction of models
In the invention, in order to consider the field information during punctuation prediction, two methods are used:
one is to use as input the conversion of domain labels into one-hot encoded binding word vectors, so that the model is merged into domain information as shown in fig. 1.
The model mainly comprises a layer of bidirectional long-short term memory network (BILSTM). We encode the sequence of words using word embedding X ═ X (X)1,…,xt) And one-hot encoding D of the domain punctuation of the sentencetagAnd combining the two-way long-short term memory layer and the two-way short-term memory layer as input.
Nt={xt,Dtag} (1)
The bidirectional long-short term memory layer consists of two LSTM layers, wherein the forward LSTM layer processes the forward sequence of values and the backward LSTM layer processes the backward sequence of values. The two LSTM layers use a weighted shared layer to process information. Of the forward LSTM layer are hidden units at time step t
Figure BDA0002555394590000031
Figure BDA0002555394590000032
Hidden state of inverted LSTM layer
Figure BDA0002555394590000033
The calculation method of (a) is the same as that of the forward LSTM layer sequence.
BilsTM layer htThen by combining the hidden elements of the forward LSTM layer at step time t
Figure BDA0002555394590000034
And hidden units of backward LSTM layer
Figure BDA0002555394590000035
Is constructed from the states of (a):
Figure BDA0002555394590000036
thus, a two-way LSTM may learn the expression of each input word using a front-to-back sentence (e.g., "i like her humor"), which may help the model to be betterPunctuation marks, which are often context dependent, are recognized. The output layer then generates a punctuation probability y at time step ttThe following are:
yt=Softmax(htWy+by) (4)
in another method, a multi-task learning method is used, and two tasks in the model and the previous task are respectively punctuation prediction and text domain classification model structure as shown in the second drawing. In this model, a network structure similar to that of the domain label, uses as input the sequence X (X1.. multidot.xt) of word-embedded coded words. DtagFor the output of the classification task of the text field, the main flow is as follows:
Figure BDA0002555394590000041
yt=Softmax(htWy+by) (7)
f=fltten{h1,h2,…,ht} (8)
Dtag=Sigmod(fWD+dtag) (9)
4) evaluation index selection
In order to evaluate the performance of punctuation prediction tasks, the evaluation indexes used in the present invention are precision (precision), recall (precision) and an index F used in statistics to measure the precision of class models1And the accuracy and the recall rate of the classification model are considered at the same time. And the measurements of the four punctuation marks comma, period, question mark and exclamation mark are displayed in the test set, respectively. The equation is defined as follows:
the recall ratio is as follows:
Figure BDA0002555394590000042
the accuracy is as follows:
Figure BDA0002555394590000043
F1the value:
Figure BDA0002555394590000044
advantageous effects
The invention aims at the punctuation prediction task in the natural language processing field, aims to consider field information during punctuation prediction, and has the following specific gain effects compared with the prior art:
1) the system only trains the model by using the text data, so that punctuations of the corresponding text can be predicted only by using the text as input during model prediction.
2) In this system we use multi-domain text, taking into account domain information. Thereby avoiding the situation of poor performance in a certain field. Increasing the robustness of the system. And meanwhile, the accuracy of the overall prediction is improved.
Drawings
FIG. 1 is a diagram of a domain labeled punctuation prediction model architecture.
FIG. 2 is a diagram of a punctuation prediction model architecture based on multi-task learning incorporating domain information.
Detailed Description
The operation and effect of the present invention will be described in detail with reference to the accompanying drawings and tables.
The invention provides a punctuation prediction system based on field-considered information by taking a THICKNESs corpus collected from a Newcastle news subscription channel from 2005 to 2011 as a processing object.
The method comprises the following specific steps:
1) experimental data set and data processing
In order to consider the field information, the database of the THUCNews corpus collected from the news subscription channel of new wave from 2005 to 2011 in the experiment is built by collection of qinghua schools, contains 74 thousands of news documents (2.19GB), is in a UTF-8 plain text format, and contains data for training punctuation prediction models, wherein the data comprise texts in fields of science and technology, time administration, games, furniture, education and the like. In our experiment we use part of the data, and the experiment has two training sets of 'sports field' and 'multi-field' with similar sizes and a test set. The "sports field" training set contains only text in the sports field, containing 53768 paragraphs, and the "multi-field" training set contains text in multiple fields, 5500 paragraphs in each field. The test set contains text for multiple domains with 500 paragraphs in each domain as shown in table 2.
The present invention is primarily directed to predicting the four most important and most common punctuation marks comma, period, question mark and exclamation mark. According to the research of the researchers before, the semicolon and the colon are replaced by commas, other punctuations are deleted from the corpus, and in the experiment, the text is participled by using a Chinese Lexical analysis toolkit which is developed by the natural language processing and social humanity calculation laboratories of the university of Qinghua and the THU (THU Lexical Analyzer for Chinese).
Table 2 experimental database description
Figure BDA0002555394590000061
2) Text feature extraction
A 300-dimensional word vector is extracted using the Glove tool. See above for specific description.
3) Experimental setup
In the present invention, we use a back propagation algorithm to train all the deep neural networks. The network updates the weights of each training example using an Adam optimizer, which has a learning rate of 0.001. In our neural network model, with 256 unit LSTM units per LSTM layer, we use an embedding layer between the word input and LSTM layers, the dimension of which is empirically set to 128. In the training process, the input sequence of the training sentences is firstly disturbed, and then 128 sentences are randomly selected as a training batch. The number of data iterations was 30.
4) Analysis of results
In Table 3 we calculated the accuracy, recall, and F1 score for each punctuation mark (F1). We found that the model BilSTM-Multi, which incorporates the domain information by multitasking, achieved the best results on all evaluation criteria compared to the model BilSTM-Multi, which does and does not take into account the domain information.
Table 4 shows the results of two different methods considering the domain information punctuation prediction model. From table 4, we find that the error rate of the Domain tagged model Add Domain-tag is reduced by 7.0% compared with the error rate of the Domain information model BLSTM-Multi not considered due to the consideration of the Domain information, and the error rate of the Domain information considered model MTL-BiLSTM using the multitask method is reduced by 16.8% compared with the error rate of the tagged model Add Domain-tag, which indicates that the multitask structure can better integrate the Domain information into the punctuation prediction model.
Table 3 results of the experiment on each plot
Figure BDA0002555394590000071
TABLE 4 Overall Experimental results
Figure BDA0002555394590000072

Claims (2)

1. The punctuation prediction method based on the considered field information is characterized by mainly comprising the following steps:
1) preprocessing punctuation marks of the text;
2) selection of neural network front-end features: converting the words into 300-dimensional word vectors using a GloVe tool;
3) constructing a model: the model simultaneously predicts the domain classification of punctuation and text by using a multi-task learning method, so that the model is integrated with domain information;
4) selecting evaluation indexes: to evaluate the performance of punctuation prediction tasks, the evaluation indicators used in us are precision (precision), recall (precision) and statisticsIndex F for measuring class model accuracy1
The step 3) has two specific methods:
a method for converting domain labels into one-hot coded combined word vectors as input, thereby enabling a model to be fused into domain information;
the model mainly comprises a layer of bidirectional long and short term memory network (BILSTM), and a sequence X (X) of words embedded and coded words is used1,…,xt) And one-hot encoding D of the domain punctuation of the sentencetagIn combination, as the input of the bidirectional long-short term memory layer,
Nt={xt,Dtag} (1)
the bidirectional long-short term memory layer is composed of two LSTM layers, wherein the forward LSTM layer processes forward value sequence, the backward LSTM layer processes backward value sequence, the two LSTM layers process information by using weighted sharing layer, and the forward LSTM layer is hidden unit at time step t
Figure FDA0002555394580000011
Figure FDA0002555394580000012
Hidden state of inverted LSTM layer
Figure FDA0002555394580000013
The calculation method of (2) is the same as that of the forward LSTM layer sequence;
BilsTM layer htThen by combining the hidden elements of the forward LSTM layer at step time t
Figure FDA0002555394580000014
And hidden units of backward LSTM layer
Figure FDA0002555394580000015
Is constructed from the states of (a):
Figure FDA0002555394580000016
thus, bi-directional LSTM can learn the expression of each input word using preceding and following sentences, identifying punctuation that often depends on context; the output layer then generates a punctuation probability y at time step ttThe following are:
yt=Softmax(htWy+by) (4)
another method is to use multi-task learning, in which two tasks in the model and the last task are punctuation prediction and the other is the domain classification model structure of the text, in the model, the network structure similar to the model of the domain label is used, and the word embedding sequence is used to code the word X (X)1,...,xt) As an input;
Dtagfor the output of the classification task of the text field, the main flow is as follows:
Figure FDA0002555394580000021
Figure FDA0002555394580000022
yt=Softmax(htWy+by) (7)
f=fltten{h1,h2,...,ht} (8)
Dtag=Sigmod(fWD+dtag) (9)
2. the punctuation prediction method based on domain-of-interest information as claimed in claim 1, wherein the evaluation index used in step 4) is an index F used to measure the accuracy of class model in accuracy, recall rate and statistics1And displaying the metrics of four punctuation marks, comma, period, question mark and exclamation mark, respectively, in the test set, the equation is defined as follows:
the recall ratio is as follows:
Figure FDA0002555394580000023
the accuracy is as follows:
Figure FDA0002555394580000024
F1the value:
Figure FDA0002555394580000025
CN202010590707.8A 2020-06-24 2020-06-24 Punctuation prediction method based on consideration field information Active CN111723584B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010590707.8A CN111723584B (en) 2020-06-24 2020-06-24 Punctuation prediction method based on consideration field information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010590707.8A CN111723584B (en) 2020-06-24 2020-06-24 Punctuation prediction method based on consideration field information

Publications (2)

Publication Number Publication Date
CN111723584A true CN111723584A (en) 2020-09-29
CN111723584B CN111723584B (en) 2024-05-07

Family

ID=72568858

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010590707.8A Active CN111723584B (en) 2020-06-24 2020-06-24 Punctuation prediction method based on consideration field information

Country Status (1)

Country Link
CN (1) CN111723584B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114880990A (en) * 2022-05-16 2022-08-09 马上消费金融股份有限公司 Punctuation mark prediction model training method, punctuation mark prediction method and punctuation mark prediction device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012039686A1 (en) * 2010-09-24 2012-03-29 National University Of Singapore Methods and systems for automated text correction
CN106599933A (en) * 2016-12-26 2017-04-26 哈尔滨工业大学 Text emotion classification method based on the joint deep learning model
GB201814860D0 (en) * 2017-11-14 2018-10-31 Adobe Systems Inc Predicting style breaches within textual content
CN109741604A (en) * 2019-03-05 2019-05-10 南通大学 Based on tranquilization shot and long term memory network model prediction intersection traffic method of flow
CN111090981A (en) * 2019-12-06 2020-05-01 中国人民解放军战略支援部队信息工程大学 Method and system for building Chinese text automatic sentence-breaking and punctuation generation model based on bidirectional long-time and short-time memory network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012039686A1 (en) * 2010-09-24 2012-03-29 National University Of Singapore Methods and systems for automated text correction
CN106599933A (en) * 2016-12-26 2017-04-26 哈尔滨工业大学 Text emotion classification method based on the joint deep learning model
GB201814860D0 (en) * 2017-11-14 2018-10-31 Adobe Systems Inc Predicting style breaches within textual content
CN109741604A (en) * 2019-03-05 2019-05-10 南通大学 Based on tranquilization shot and long term memory network model prediction intersection traffic method of flow
CN111090981A (en) * 2019-12-06 2020-05-01 中国人民解放军战略支援部队信息工程大学 Method and system for building Chinese text automatic sentence-breaking and punctuation generation model based on bidirectional long-time and short-time memory network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
万静;郭雅志;: "基于多段落排序的机器阅读理解研究", 北京化工大学学报(自然科学版), no. 03 *
李雅昆;潘晴;EVERETT X.WANG;: "基于改进的多层BLSTM的中文分词和标点预测", 计算机应用, no. 05 *
段大高;梁少虎;赵振东;韩忠明;: "基于自注意力机制的中文标点符号预测模型", 计算机工程, no. 05 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114880990A (en) * 2022-05-16 2022-08-09 马上消费金融股份有限公司 Punctuation mark prediction model training method, punctuation mark prediction method and punctuation mark prediction device

Also Published As

Publication number Publication date
CN111723584B (en) 2024-05-07

Similar Documents

Publication Publication Date Title
Ghosh et al. Fracking sarcasm using neural network
US11693894B2 (en) Conversation oriented machine-user interaction
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN109214003B (en) The method that Recognition with Recurrent Neural Network based on multilayer attention mechanism generates title
Peng et al. Topic-enhanced emotional conversation generation with attention mechanism
CN110489750A (en) Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF
CN114580382A (en) Text error correction method and device
CN112084334B (en) Label classification method and device for corpus, computer equipment and storage medium
CN115146629A (en) News text and comment correlation analysis method based on comparative learning
CN112016320A (en) English punctuation adding method, system and equipment based on data enhancement
CN113178193A (en) Chinese self-defined awakening and Internet of things interaction method based on intelligent voice chip
El Janati et al. Adaptive e-learning AI-powered chatbot based on multimedia indexing
CN113407711B (en) Gibbs limited text abstract generation method by using pre-training model
CN114722832A (en) Abstract extraction method, device, equipment and storage medium
CN111723584B (en) Punctuation prediction method based on consideration field information
KR102297480B1 (en) System and method for structured-paraphrasing the unstructured query or request sentence
Van Enschot et al. Taming our wild data: On intercoder reliability in discourse research
Younes et al. A deep learning approach for the Romanized Tunisian dialect identification.
CN116881446A (en) Semantic classification method, device, equipment and storage medium thereof
CN110750967A (en) Pronunciation labeling method and device, computer equipment and storage medium
CN116108840A (en) Text fine granularity emotion analysis method, system, medium and computing device
Yang [Retracted] Design of Service Robot Based on User Emotion Recognition and Environmental Monitoring
Chanda et al. Is Meta Embedding better than pre-trained word embedding to perform Sentiment Analysis for Dravidian Languages in Code-Mixed Text?
CN114841143A (en) Voice room quality evaluation method and device, equipment, medium and product thereof
Chen et al. Extractive spoken document summarization for information retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant