CN111723584A

CN111723584A - Punctuation prediction method based on consideration of domain information

Info

Publication number: CN111723584A
Application number: CN202010590707.8A
Authority: CN
Inventors: 王龙标; 魏文青; 党建武
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2020-09-29
Anticipated expiration: 2040-06-24
Also published as: CN111723584B

Abstract

The invention discloses a punctuation prediction method based on field-considered information, which mainly comprises the following steps: 1) preprocessing punctuation marks of the text; 2) selection of neural network front-end features: converting the words into 300-dimensional word vectors using a GloVe tool; 3) constructing a model: the model simultaneously predicts the domain classification of punctuation and text by using a multi-task learning method, so that the model is integrated with domain information; 4) and selecting an evaluation index. Increasing the robustness of the system. And meanwhile, the accuracy of the overall prediction is improved.

Description

Punctuation prediction method based on consideration of domain information

Technical Field

The invention relates to a punctuation prediction task in the field of natural language processing, in particular to a punctuation prediction method based on field information considered aiming at different field information contained in different fields in punctuation prediction.

Background

In recent years, with the remarkable improvement of computer computing power and the continuous effort of computer algorithm researchers, the performance and the accuracy of the automatic speech recognition technology are remarkably improved, and the basic requirements of people in daily life are met. Automatic speech recognition is becoming increasingly popular in industry and everyday life. The voice recognition technology is widely applied to the fields of intelligent home systems, session transcribers, voice dictation technology, simultaneous interpretation and the like, and brings great convenience to daily life and work of people. In most cases, speech recognition techniques transcribe speech signals into textual information and then perform corresponding analysis and post-processing of the text. In this case, the quality of the transcribed text directly affects the execution of subsequent tasks, thereby affecting product performance and user experience. However, most automatic speech recognition systems cannot recognize punctuation marks and only generate text sequences without any punctuation marks because punctuation marks are inaudible when people communicate and speak in daily life. Punctuation marks, however, are an integral part of text. Punctuation plays the role of pause and tone in sentences. It often emphasizes certain words or phrases to better express the meaning of a sentence. The absence of punctuation can cause problems, such as confusing human readers to understand sentences and affecting the performance of existing natural language processing algorithms, such as machine translation, abstract extraction, man-machine conversation, and other tasks. Therefore, how to automatically mark a text is a very important task.

Up to now, there have been many studies on punctuation automatic prediction. Before deep learning becomes a trend, the main approach is manual rules. As the amount of data increases, some statistical-based approaches become mainstream, such as training punctuated text using an N-Gram language model, or treating punctuation prediction tasks as sequence labeling tasks and then solving them using conditional random fields. With the development of intensive learning, many researchers began using it for punctuation prediction tasks. Human communication typically spans many areas. And each domain has its own vocabulary and writing style. Therefore, consideration of field information facilitates punctuation prediction. In previous research, the main method is to use acoustic features and text features, such as part-of-speech tagging, word vectors, inter-word pause duration, and high pitch. However, few studies consider distinctiveness in different fields. Based on the reasons, the invention provides a method for predicting punctuation by using multi-task learning to integrate into domain information, so that the model has good robustness and better performance.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a punctuation prediction method based on field-considered information.

The invention provides a method for integrating field information during punctuation prediction tasks. The data set of the THUCNews text is taken as an experimental object. Mainly relates to four aspects: 1) preprocessing punctuation marks of the text; 2) selecting a front-end characteristic of the neural network; 3) building a model; 4) and selecting an evaluation index.

1) Preprocessing punctuation of text

This section is processed by reading the data first, and in the present invention is primarily directed to predicting the four most important and most common punctuation marks, commas, periods, question marks and exclamation marks. Based on previous studies by researchers, we have replaced semicolons and colons with commas, and other punctuation marks are deleted from the corpus.

Because the Chinese words and phrases have no obvious boundary, the invention cuts the character strings in the text into reasonable word sequences, and then performs other analysis processing on the basis. Because there are many Chinese words, the input vocabulary used in this experiment is composed of the 10 ten thousand Chinese words with the highest frequency of occurrence in the corpus and two special symbols, one for indicating unknown words (words not appearing in this vocabulary), and the other for indicating the end of input.

The punctuation prediction task is considered herein as a classification problem of what punctuation follows each word. A special symbol "O" is also defined to indicate that there is no punctuation mark or space behind the word. For example, "i like her humor, you are? ". The input is "i like her humorous you," and the output is a series of punctuation marks, such as "O ooooo, O? ". As shown in table 1:

TABLE 1 example punctuation sequence annotation of a sentence

…

I am

Like

She is provided with

Is/are as follows

Humor (humor)

Worsted fabric

…

O

，

？

…

2) Selection of neural network front-end features

The conventional machine learning method and the deep learning method can process only numerical type data. For character type data, it needs to be converted into digital type data, and glove is an unsupervised word representation method, which converts words into word vectors and can represent different words and relations between words to some extent. The present invention uses glove to convert the words 300 into vectors of dimensions as input to the model.

3) Construction of models

In the invention, in order to consider the field information during punctuation prediction, two methods are used:

one is to use as input the conversion of domain labels into one-hot encoded binding word vectors, so that the model is merged into domain information as shown in fig. 1.

The model mainly comprises a layer of bidirectional long-short term memory network (BILSTM). We encode the sequence of words using word embedding X ═ X (X)₁,…,x_t) And one-hot encoding D of the domain punctuation of the sentence_tagAnd combining the two-way long-short term memory layer and the two-way short-term memory layer as input.

N_t＝{x_t，D_tag} (1)

The bidirectional long-short term memory layer consists of two LSTM layers, wherein the forward LSTM layer processes the forward sequence of values and the backward LSTM layer processes the backward sequence of values. The two LSTM layers use a weighted shared layer to process information. Of the forward LSTM layer are hidden units at time step t

Hidden state of inverted LSTM layer

The calculation method of (a) is the same as that of the forward LSTM layer sequence.

BilsTM layer h_tThen by combining the hidden elements of the forward LSTM layer at step time t

And hidden units of backward LSTM layer

Is constructed from the states of (a):

thus, a two-way LSTM may learn the expression of each input word using a front-to-back sentence (e.g., "i like her humor"), which may help the model to be betterPunctuation marks, which are often context dependent, are recognized. The output layer then generates a punctuation probability y at time step t_tThe following are:

y_t＝Softmax(h_tW_y+b_y) (4)

in another method, a multi-task learning method is used, and two tasks in the model and the previous task are respectively punctuation prediction and text domain classification model structure as shown in the second drawing. In this model, a network structure similar to that of the domain label, uses as input the sequence X (X1.. multidot.xt) of word-embedded coded words. D_tagFor the output of the classification task of the text field, the main flow is as follows:

y_t＝Softmax(h_tW_y+b_y) (7)

f＝fltten{h₁，h₂，…，h_t} (8)

D_tag＝Sigmod(fW_D+d_tag) (9)

4) evaluation index selection

In order to evaluate the performance of punctuation prediction tasks, the evaluation indexes used in the present invention are precision (precision), recall (precision) and an index F used in statistics to measure the precision of class models₁And the accuracy and the recall rate of the classification model are considered at the same time. And the measurements of the four punctuation marks comma, period, question mark and exclamation mark are displayed in the test set, respectively. The equation is defined as follows:

the recall ratio is as follows:

the accuracy is as follows:

F₁the value:

advantageous effects

The invention aims at the punctuation prediction task in the natural language processing field, aims to consider field information during punctuation prediction, and has the following specific gain effects compared with the prior art:

1) the system only trains the model by using the text data, so that punctuations of the corresponding text can be predicted only by using the text as input during model prediction.

2) In this system we use multi-domain text, taking into account domain information. Thereby avoiding the situation of poor performance in a certain field. Increasing the robustness of the system. And meanwhile, the accuracy of the overall prediction is improved.

Drawings

FIG. 1 is a diagram of a domain labeled punctuation prediction model architecture.

FIG. 2 is a diagram of a punctuation prediction model architecture based on multi-task learning incorporating domain information.

Detailed Description

The operation and effect of the present invention will be described in detail with reference to the accompanying drawings and tables.

The invention provides a punctuation prediction system based on field-considered information by taking a THICKNESs corpus collected from a Newcastle news subscription channel from 2005 to 2011 as a processing object.

The method comprises the following specific steps:

1) experimental data set and data processing

In order to consider the field information, the database of the THUCNews corpus collected from the news subscription channel of new wave from 2005 to 2011 in the experiment is built by collection of qinghua schools, contains 74 thousands of news documents (2.19GB), is in a UTF-8 plain text format, and contains data for training punctuation prediction models, wherein the data comprise texts in fields of science and technology, time administration, games, furniture, education and the like. In our experiment we use part of the data, and the experiment has two training sets of 'sports field' and 'multi-field' with similar sizes and a test set. The "sports field" training set contains only text in the sports field, containing 53768 paragraphs, and the "multi-field" training set contains text in multiple fields, 5500 paragraphs in each field. The test set contains text for multiple domains with 500 paragraphs in each domain as shown in table 2.

The present invention is primarily directed to predicting the four most important and most common punctuation marks comma, period, question mark and exclamation mark. According to the research of the researchers before, the semicolon and the colon are replaced by commas, other punctuations are deleted from the corpus, and in the experiment, the text is participled by using a Chinese Lexical analysis toolkit which is developed by the natural language processing and social humanity calculation laboratories of the university of Qinghua and the THU (THU Lexical Analyzer for Chinese).

Table 2 experimental database description

2) Text feature extraction

A 300-dimensional word vector is extracted using the Glove tool. See above for specific description.

3) Experimental setup

In the present invention, we use a back propagation algorithm to train all the deep neural networks. The network updates the weights of each training example using an Adam optimizer, which has a learning rate of 0.001. In our neural network model, with 256 unit LSTM units per LSTM layer, we use an embedding layer between the word input and LSTM layers, the dimension of which is empirically set to 128. In the training process, the input sequence of the training sentences is firstly disturbed, and then 128 sentences are randomly selected as a training batch. The number of data iterations was 30.

4) Analysis of results

In Table 3 we calculated the accuracy, recall, and F1 score for each punctuation mark (F1). We found that the model BilSTM-Multi, which incorporates the domain information by multitasking, achieved the best results on all evaluation criteria compared to the model BilSTM-Multi, which does and does not take into account the domain information.

Table 4 shows the results of two different methods considering the domain information punctuation prediction model. From table 4, we find that the error rate of the Domain tagged model Add Domain-tag is reduced by 7.0% compared with the error rate of the Domain information model BLSTM-Multi not considered due to the consideration of the Domain information, and the error rate of the Domain information considered model MTL-BiLSTM using the multitask method is reduced by 16.8% compared with the error rate of the tagged model Add Domain-tag, which indicates that the multitask structure can better integrate the Domain information into the punctuation prediction model.

Table 3 results of the experiment on each plot

TABLE 4 Overall Experimental results

Claims

1. The punctuation prediction method based on the considered field information is characterized by mainly comprising the following steps:

1) preprocessing punctuation marks of the text;

2) selection of neural network front-end features: converting the words into 300-dimensional word vectors using a GloVe tool;

3) constructing a model: the model simultaneously predicts the domain classification of punctuation and text by using a multi-task learning method, so that the model is integrated with domain information;

4) selecting evaluation indexes: to evaluate the performance of punctuation prediction tasks, the evaluation indicators used in us are precision (precision), recall (precision) and statisticsIndex F for measuring class model accuracy₁；

The step 3) has two specific methods:

a method for converting domain labels into one-hot coded combined word vectors as input, thereby enabling a model to be fused into domain information;

the model mainly comprises a layer of bidirectional long and short term memory network (BILSTM), and a sequence X (X) of words embedded and coded words is used₁,…,x_t) And one-hot encoding D of the domain punctuation of the sentence_tagIn combination, as the input of the bidirectional long-short term memory layer,

N_t＝{x_t，D_tag} (1)

the bidirectional long-short term memory layer is composed of two LSTM layers, wherein the forward LSTM layer processes forward value sequence, the backward LSTM layer processes backward value sequence, the two LSTM layers process information by using weighted sharing layer, and the forward LSTM layer is hidden unit at time step t

Hidden state of inverted LSTM layer

The calculation method of (2) is the same as that of the forward LSTM layer sequence;

And hidden units of backward LSTM layer

Is constructed from the states of (a):

thus, bi-directional LSTM can learn the expression of each input word using preceding and following sentences, identifying punctuation that often depends on context; the output layer then generates a punctuation probability y at time step t_tThe following are:

y_t＝Softmax(h_tW_y+b_y) (4)

another method is to use multi-task learning, in which two tasks in the model and the last task are punctuation prediction and the other is the domain classification model structure of the text, in the model, the network structure similar to the model of the domain label is used, and the word embedding sequence is used to code the word X (X)₁，...，x_t) As an input;

D_tagfor the output of the classification task of the text field, the main flow is as follows:

y_t＝Softmax(h_tW_y+b_y) (7)

f＝fltten{h₁，h₂，...，h_t} (8)

D_tag＝Sigmod(fW_D+d_tag) (9)

2. the punctuation prediction method based on domain-of-interest information as claimed in claim 1, wherein the evaluation index used in step 4) is an index F used to measure the accuracy of class model in accuracy, recall rate and statistics₁And displaying the metrics of four punctuation marks, comma, period, question mark and exclamation mark, respectively, in the test set, the equation is defined as follows:

the recall ratio is as follows:

the accuracy is as follows:

F₁the value: