CN111723584B

CN111723584B - Punctuation prediction method based on consideration field information

Info

Publication number: CN111723584B
Application number: CN202010590707.8A
Authority: CN
Inventors: 王龙标; 魏文青; 党建武
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2024-05-07
Anticipated expiration: 2040-06-24
Also published as: CN111723584A

Abstract

The invention discloses a punctuation prediction method based on consideration field information, which mainly comprises the following steps: 1) Preprocessing the mark point symbol of the text; 2) Selection of neural network front-end characteristics: converting the word into a 300-dimensional word vector using GloVe tools; 3) Model construction: the model uses a multitask learning method, and simultaneously predicts punctuation and field classification of texts, so that the model is integrated with field information; 4) And (5) selecting an evaluation index. The robustness of the system is increased. And meanwhile, the accuracy of overall prediction is improved.

Description

Punctuation prediction method based on consideration field information

Technical Field

The invention relates to a punctuation prediction task in the field of natural language processing, in particular to a punctuation prediction method based on consideration of field information, aiming at the difference of the field information contained in different fields in punctuation prediction.

Background

In recent years, with the remarkable improvement of the computing capability of a computer and the continuous effort of researchers of computer algorithms, the performance and the accuracy of an automatic voice recognition technology are remarkably improved, and the basic requirements of daily life of people are met. Automatic speech recognition is becoming increasingly popular in industry and in everyday life. The voice recognition technology is widely applied to the fields of intelligent home systems, session transcriber, voice dictation technology, simultaneous interpretation and the like, and brings great convenience to daily life and work of people. In most cases, speech recognition techniques transcribe speech signals into text information, which is then analyzed and post-processed accordingly. In this case, the quality of the transcribed text directly affects the performance of subsequent tasks, thereby affecting product performance and user experience. However, most automatic speech recognition systems are not able to recognize punctuation marks and only generate text sequences without any punctuation marks, because punctuation marks are inaudible when people communicate and speak in daily life. Punctuation marks, however, are an integral part of text. Punctuation plays a role in pausing and intonation in sentences. It usually emphasizes certain words or phrases to better convey the meaning of the sentence. The absence of punctuation can cause problems such as disturbing human readers to understand sentences, affecting the performance of existing natural language processing algorithms, such as machine translation, abstract extraction, human-machine dialogue, and the like. Therefore, how to automatically mark a text with a dot is a very important task.

To date, there have been many studies on punctuation automatic prediction. Before deep learning becomes a trend, the main approach is manual rules. As the amount of data increases, some statistical-based approaches have become mainstream, such as using N-Gram language models to train punctuation-bearing text, or treating punctuation prediction tasks as sequence labeling tasks, and then using conditional random fields to solve them. As deep learning progresses, many researchers begin using it for punctuation prediction tasks. People's communication generally spans many areas. And each domain has its own vocabulary and writing style. Thus, considering field information facilitates punctuation prediction. In the prior researches, the main method is to use acoustic features and text features, such as part-of-speech labels, word vectors, inter-word pause time, pitch and the like. However, few studies consider the particulars of different fields. For these reasons, the invention proposes a method for merging field information by using multitask learning to predict punctuation, so that the model has good robustness and better performance.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a punctuation prediction method based on consideration field information.

The invention provides a method for integrating field information in punctuation prediction task. The dataset of THUCNews text was used as the subject. Mainly involves four aspects: 1) Preprocessing punctuation marks of the text; 2) Selecting front-end characteristics of the neural network; 3) Building a model; 4) And (5) selecting an evaluation index.

1) Preprocessing punctuation of text

This part is processed by first reading the data, and is primarily directed to predicting the four most important and common punctuation marks, commas, periods, question marks and exclamation marks. According to previous researchers' studies, we have replaced the semicolons and the colon with commas and other punctuations are deleted from the corpus.

Because there is no obvious demarcation between Chinese words and words, the invention cuts the character strings in the text into reasonable word sequences, and then carries out other analysis processing on the basis. Since there are many Chinese words, the input vocabulary used in this experiment is composed of the 10 ten thousand Chinese words with the highest frequency of occurrence in the corpus and two special symbols, one for representing unknown words (words not appearing in the vocabulary), and the other for representing the end of input.

Punctuation prediction tasks are considered herein as a matter of classification of what the punctuation following each word is. A special symbol "O" is also defined to indicate that no punctuation marks, i.e., spaces, follow the word. For example, "i like her humor, you? ". The input is "i like her humor you," the output is a corresponding series of punctuation marks, such as "O OOOO, O? ". As shown in table 1:

TABLE 1 example of punctuation sequence labeling of a sentence

…

I am

Like to like

She (her)

A kind of electronic device

Humor of humor

You woolen cloth

…

O

，

？

…

2) Selection of neural network front-end features

Conventional machine learning methods and deep learning methods can only process numerical type data. For character type data, it needs to be converted into digital type data, glove is an unsupervised word representation method, converts words into word vectors, and can represent different words and relationships between words to a certain extent. The present invention uses glove as the input to the model the vector that converts the word 300 into dimensions.

3) Construction of a model

In the present invention, in order to consider domain information at the time of punctuation prediction, two methods are used:

One is to use as input the conversion of the domain label into one-hot coding in combination with word vectors, thereby enabling the model to incorporate the domain information as shown in fig. 1.

The model mainly comprises a layer of two-way long-short-term memory network (BILSTM). We use the sequence x= (X ₁,…,x_t) of the word embedded encoded words in combination with the one-hot encoding D _tag of the punctuation of the sentence in the field as input to the two-way long and short term memory layer.

N_t＝{x_t,D_tag} (1)

The bidirectional long and short-term memory layer is composed of two LSTM layers, wherein the forward LSTM layer processes a forward value sequence and the reverse LSTM layer processes a reverse value sequence. The two LSTM layers use weighted shared layers to process information. Forward LSTM layer is the hidden unit at time step t

Hidden state of reverse LSTM layerThe calculation method of (2) is the same as that of the forward LSTM layer sequence.

The hidden state of BiLSTM layer h _t is then obtained by combining the hidden units of the forward LSTM layer at step time tAnd hidden units of the backward LSTM layer/>Is constructed from the state of:

thus, the bi-directional LSTM can learn the expression of each input word using a sentence back and forth (e.g., "i like her humor"), which can help the model better identify punctuations that are often context dependent. The output layer then generates punctuation probabilities y _t at time step t as follows:

y_t＝Softmax(h_tW_y+b_y) (4)

Another approach to using multi-task learning is to have two tasks, one punctuation prediction and the other text, in the model, the domain classification model structure is shown in figure two. In this model, a network structure similar to that of the domain label uses as input the sequence x= (X1,..and xt) of word embedding encoded words. D _tag is the output of the text field classification task, and the main flow is as follows:

y_t＝Softmax(h_tW_y+b_y) (7)

f＝fltten{h₁,h₂,…,h_t} (8)

D_tag＝Sigmod(fW_D+d_tag) (9)

4) Evaluation index selection

In order to evaluate the performance of the punctuation mark prediction task, the evaluation index used in the present invention is precision (precision), recall and an index F ₁ used in statistics to measure the precision of the class model, which simultaneously combines the precision and recall of the classification model. And the metrics of comma, period, question mark and exclamation mark are displayed in the test set. The equation is defined as follows:

Recall rate:

Accuracy rate:

F ₁ value:

Advantageous effects

Aiming at punctuation prediction tasks in the field of natural language processing, the invention aims at considering field information in punctuation prediction, and has the following specific gain effects compared with the prior art:

1) The system only trains the model by using the text data, and only needs the text as input to predict punctuation of the corresponding text during model prediction.

2) In this system we use multi-domain text, considering domain information. Thus avoiding poor performance in a certain area. The robustness of the system is increased. And meanwhile, the accuracy of overall prediction is improved.

Drawings

FIG. 1 is a block diagram of a domain labeled punctuation prediction model.

FIG. 2 is a block diagram of a punctuation prediction model based on multi-tasking learning integration domain information.

Detailed Description

The operation and effects of the present invention will be described in detail with reference to the accompanying drawings and tables.

The invention provides a punctuation prediction system based on consideration field information, and the whole system flow comprises the steps of experimental data set and data processing, text feature extraction, experimental setting and result analysis by taking THUCNews corpus collected by using News and wave news subscription channels from 2005 to 2011 as a processing object.

The method comprises the following specific steps:

1) Experimental data set and data processing

In order to consider the field information, a database, namely THUCNews corpus collected from the New wave news subscription channels from 2005 to 2011, is built by collecting from Qinghai, and comprises 74 ten thousand news documents (2.19 GB) of various types, which are all data in UTF-8 plain text format, wherein texts in fields of science and technology, time administration, games, furniture, education and the like are used for training punctuation prediction models. In our experiment we used his part of the data, this experiment had two similar sized domain different training sets "sports domain" and "multi-domain" and one test set. The "sports field" training set contains only text in the sports field, contains 53768 paragraphs, and the "multi-field" training set contains text in multiple fields, 5500 paragraphs in each field. The test set contains 500 paragraphs in each of the multiple fields of text, as shown in Table 2.

The present invention is primarily directed to predicting the four most important and common punctuation marks, commas, periods, question marks and exclamation marks. According to the previous research of researchers, the comma and the colon are replaced by commas, other punctuation marks are deleted from a corpus, and in the experiment, a set of Chinese lexical analysis tool kit developed by the university of Qinghai natural language processing and the social human computing laboratory is used for carrying out word segmentation processing on texts.

Table 2 description of the experimental database

2) Text feature extraction

A Glove tool is used to extract the 300-dimensional word vector. The specific description is as above.

3) Experimental setup

In the present invention, we train all deep neural networks using a back propagation algorithm. The network updates the weights of each training example using an Adam optimizer with a learning rate of 0.001. In our neural network model, with 256 units of LSTM cells per LSTM layer, we use an embedding layer between the word input and LSTM layer, the dimension of which is empirically set to 128. In the training process, firstly, the input sequence of training sentences is disturbed, and then 128 sentences are randomly selected as a training batch. The number of data iterations is 30.

4) Analysis of results

In Table 3 we calculate the accuracy, recall, and F1 score (F1) for each punctuation. We have found that model BiLSTM-Multi, which incorporates domain information by multitasking, model MTL-BiLSTM achieves the best results on all evaluation criteria compared to and without consideration of domain information.

Table 4 shows the results of two different methods to consider the domain information punctuation prediction model. From Table 4 we find that the error rate of the Domain information-taking model of the Domain tag is reduced by 7.0% compared with that of the Domain information-not-taking model BLSTM-Multi, and the error rate of the Domain information-taking model MTL-BiLSTM is reduced by 16.8% compared with that of the tag model of the Domain tag by using the multitasking method, which indicates that the multitasking structure can better integrate the Domain information into the punctuation prediction model.

TABLE 3 results of experiments at each punctuation

TABLE 4 results of the overall experiments

Claims

1. The punctuation prediction method based on the consideration of the field information is characterized by mainly comprising the following steps:

1) Preprocessing punctuation marks of the text;

2) Selection of neural network front-end characteristics: converting the word into a 300-dimensional word vector using GloVe tools;

3) Model construction: the model uses a multitask learning method, and simultaneously predicts punctuation and field classification of texts, so that the model is integrated with field information;

4) And (3) selecting an evaluation index: to evaluate the performance of punctuation prediction tasks, the evaluation index used in we is precision, recall and statistics, an index F ₁ used to measure class model accuracy;

The step 3) comprises two specific methods:

Converting the domain label into one-hot coding combined word vector as input, so that the model is integrated with the domain information;

the model mainly comprises a layer of two-way long-short-term memory network (BILSTM), which uses the combination of a sequence X= (X ₁,…,x_t) of word embedded coded words and one-hot coding D _tag of punctuation in the field of sentences as the input of the two-way long-term memory layer,

N_t＝{x_t,D_tag} (1)

The two-way long-short-term memory layer consists of two LSTM layers, wherein the forward LSTM layer processes a forward value sequence, the reverse LSTM layer processes a reverse value sequence, the two LSTM layers process information using weighted shared layers, and the forward LSTM layer is a hidden unit at time step t

Hidden state of reverse LSTM layerThe calculation method of (2) is the same as that of the forward LSTM layer sequence;

the hidden state of BiLSTM layer h _t is then obtained by combining the hidden units of the forward LSTM layer at step time t And hidden units of the backward LSTM layer/>Is constructed from the state of:

Thus, bi-directional LSTM can learn the expression of each input word using the front-to-back sentences, identifying punctuation that is often context dependent; the output layer then generates punctuation probabilities y _t at time step t as follows:

y_t＝Softmax(h_tW_y+b_y) (4)

Another approach to using multi-task learning is to use two tasks, one punctuation prediction and the other a domain classification model structure of text in the model and a network structure similar to the domain label model, using the sequence of word embedded encoded words x= (X ₁,...,x_t) as input;

D _tag is the output of the text field classification task, and the main flow is as follows:

y_t＝Softmax(h_tW_y+b_y) (7)

f＝fltten{h₁,h₂,...,h_t} (8)

D_tag＝Sigmod(fW_D+d_tag) (9)。

2. the punctuation prediction method based on consideration field information according to claim 1, wherein the evaluation index used in the step 4) is an index F ₁ for measuring accuracy of class models in terms of accuracy, recall and statistics, and the metrics of comma, period, question mark and exclamation mark are displayed in the test set, respectively, and the equation is defined as follows:

Recall rate:

Accuracy rate:

F ₁ value: