CN116628192A

CN116628192A - Text theme representation method based on Seq2Seq-Attention

Info

Publication number: CN116628192A
Application number: CN202310255979.6A
Authority: CN
Inventors: 夏琳杰; 王子成; 许后磊; 张礼兵; 舒德伟; 邓键; 唐季; 陈昌黎; 崔庆玲
Original assignee: PowerChina Kunming Engineering Corp Ltd
Current assignee: PowerChina Kunming Engineering Corp Ltd
Priority date: 2023-03-16
Filing date: 2023-03-16
Publication date: 2023-08-22

Abstract

A text topic representation method based on a Seq2Seq-Attention relates to the technical field of computer information, in particular to a text topic representation method. The method of the invention specifically comprises the following steps: step 1, segmenting the collected text data by using a Jieba segmentation tool, and removing stop words; step 2, K-means clustering is carried out on the texts, and texts describing the same theme are classified into one type; step 3, extracting text information characteristics by using a double-layer bidirectional cyclic neural network through an Encoder module aiming at texts of the same subject; step 4, decoding is completed by a Decoder sub-module through a double-layer unidirectional circulating neural network and combining with an attention mechanism model; and 5, outputting the expected topical theme character sequence. By the method, the text theme generation is concise and correct, and ambiguity can not occur.

Description

Text theme representation method based on Seq2Seq-Attention

Technical Field

The invention relates to the technical field of computer information, in particular to a text theme representation method.

Background

In the internet era, texts aiming at the same theme can be published on different websites in different angles and modes, in order to solve the problem of information overload, a topic discovery technology is generally used, and a plurality of texts are classified and organized according to the theme and presented to readers, so that the readers can be helped to acquire the information wanted by the readers more quickly. However, it is not enough to organize texts into clusters of topics, and from the perspective of the user, there are many text titles and texts under one topic, and the user cannot know the rough content of these topics in a short time. Therefore, the topics need to be presented to readers in a concise form, so that the readers can quickly know the topics, the readers do not need to further consume mental and time to understand, the burden of the readers is lightened, and meanwhile, the analysis of the topics by public opinion workers is facilitated.

Language systems are very complex structures, and different arrangements may have different meanings. How to condense out a concise topic representation of refining is a research difficulty of the task. At present, the research on the topic representation at home and abroad is less, and most students use keywords to represent the topic, so that the difficulty of the method is how to find out keywords which contain key information and can be quickly understood by users. The methods proposed at present are: 1. a linear weighting algorithm is used for finding out two words with the greatest weights in noun phrases as the representation of the subject; 2. representing the topic by extracting 5W1H six-tuple features of the text; 3. the method of concept bag, characterized the text as vector cluster by word2vec, use the frequency of the cluster to represent the file; 4. a text representation model utilizes a Bi-LSTM recurrent neural network to extract feature word vectors of text. In addition, students represent topics by using a deep learning method, such as providing a potential topic text representation model, and obtaining a text representation by measuring the distance between texts; and providing a comment representation model based on a neural network, and classifying each sentence into a combination of text representations by calculating the weight of each sentence.

However, keywords generated by the topic model of the method are not always continuous, the topics are simply and directly represented in the form of the keywords, but people can easily connect the keywords in series according to own ideas, different meanings can be generated by the keywords in different sequences, the association relationship among multiple documents is complex, and the simple and correct topic representation is difficult to generate.

Disclosure of Invention

The invention aims to solve the problem of how to automatically generate a concise topic representation, and provides a text topic representation model based on a Seq2 Seq-Attention.

The text theme representation method based on the Seq2Seq-Attention is characterized by comprising the following steps of:

step 1, segmenting the collected text data by using a Jieba segmentation tool, and removing stop words;

step 2, K-means clustering is carried out on the texts, and texts describing the same theme are classified into one type;

step 3, extracting text information characteristics by using a double-layer bidirectional cyclic neural network through an Encoder module aiming at texts of the same subject;

step 4, decoding is completed by a Decoder sub-module through a double-layer unidirectional circulating neural network and combining with an attention mechanism model;

and 5, outputting the expected topical theme character sequence.

Wherein:

the K-means clustering algorithm forms a cluster from the subject words with similar contents. Firstly, determining the value of a constant K, wherein the value of K is the category number of the final cluster. K points are randomly selected as initial centroids, and the distance from each word vector to the centroid is calculated. By comparing the distance of the vector to centroid C, the vector is assigned to the nearest topic cluster and the centroid of each cluster is recalculated. The calculation formula is as follows:

wherein the centroids of the K clusters represent centroids to which the word vector belongs. Repeating the above process until the centroid is not changed any more, thereby obtaining the clustered subject cluster.

The Encoder submodule model maps the text content input from outside into text vectors with fixed length, the text vectors fuse all text characteristic information Ct of the context of the article, the model of the module can select a characteristic extraction model such as a cyclic neural network (GRU, LSTM) or a convolutional neural network, and the like, and the sequence 2 Seq-attribute model Encoder submodule selects a double-layer bidirectional cyclic neural network for extracting text information characteristics. And the Decoder sub-module is corresponding to the right part of the upper graph, takes the output of the Encoder sub-module model as input to finish decoding tasks through the Decoder sub-module model, and outputs a desired character sequence. The Decoder submodule of the Seq2Seq-Attention model selects a double-layer unidirectional cyclic neural network for decoding, and firstly defines two special characters: the < start > identifier represents a start character of decoding, and the < end > identifier represents an end character of decoding. When the Seq2Seq-Attention model predicts, firstly, a < start > identifier is input to a Decoder module, then text feature vectors of a hidden layer are obtained through Attention weighting and the like, finally, the feature vectors are input to a Softmax classifier to obtain a prediction result, and the model prediction is completed through the Decoder module in the similar way, and finally, a strategy of model termination is that a < end > ending identifier is encountered, namely, the ending of the whole model prediction is represented. The Seq2Seq-Attention model is used for automatically generating tasks of text subjects, the input of the Encoder sub-module is a Chinese character sequence, the output of the Decode sub-module is a Chinese character sequence, and in order to improve the accuracy of the model and reduce the complexity of the model, the Encoder sub-module and the Decode sub-module use the same dictionary mapping table, namely, the two modules of parameters of the vector conversion layer coding layer are shared. The vector conversion layer uses word vectors in the pre-training field to complete training of a word vector model based on the crawled large-scale text corpus data.

The working modes of the Decoder submodule Decoder in the training stage and the prediction stage of the Seq2Seq-Attention model are inconsistent, the output text content of the model training stage Decoder is known, and the actual correct character at the last moment is used as the input of the current moment model; the content of the decoding end in the model prediction stage is unknown, a < start > start identifier is required to be input, and an Attention vector is fused to perform model prediction, and the difference from the training stage is that incorrect text characters are input;

P(Y|X)＝P(Y1|X)P(Y2|X)....P(Yn|X,Y1,Y2,.....Yn-1)

the objective of the Decoder sub-module decoding prediction stage is to maximize the probability of the above formula, and the algorithm is usually to select the most predicted result with the highest probability at each moment through a greedy strategy, which has the disadvantage of easily obtaining a locally optimal result instead of a globally optimal result. Therefore, the paper adopts a set element search model, and the first N candidate prediction results of the current output are selected when each moment is predicted. For example, when n=2, the search algorithm selects P (y1|x) with maximum Top 2 at each time, then obtains P (y2|x, Y1) by taking two maximum P (y1|x) as input, and the model end is marked by predicting < end > end character through recursive prediction.

The Mask module of the Seq2Seq-Attention model uses two Mask patterns: the first is a Padding Mask, which mainly solves the problem of inaccurate prediction caused by a growth supplementing strategy in an Encoder; the second is a Sequence Mask, which mainly solves the problem that the text information at the next moment cannot be utilized in the prediction stage of the Decoder submodule. The Padding Mask mode solution is to subtract the part with the length 0 from a particularly large number at the same time, and the prediction probability of the position with the length 0 is made to be close to 0 by the method. The Sequence Mask mode solution is to initialize the upper triangular matrix, i.e. the values of the upper triangular matrix are all 0, and apply the upper triangular matrix in the special format to each predicted Sequence.

In order to better capture important information in the input text, attention mechanisms are introduced in the decoder. The mechanism calculates the attention profile based on the weighted average of the current concealment state of the decoder and all concealment states of the encoder so as to be able to better focus on the information in the input text that is relevant to the current time step.

In the Seq2Seq-Attention model, an Attention mechanism can be used to calculate the importance of each hidden state of the encoder to the current time step of the decoder. In this way the decoder can pay better attention to the information in the input text that is relevant to the current time step. Specifically, at each time step t, we use an attention weight vector α _t To indicate the importance of each position in the input text to the current time step of the decoder. To calculate alpha _t We can first use a matrix W to be learned _a Converting all hidden states of an encoder into a query vector q _t ：

q _t ＝W _a s _t

Wherein s is _t Is the hidden state of the current time step t of the decoder. We can then calculate the per encoder concealment state h _j And query vector q _t Similarity between to obtain an unnormalized attention weight e _t,j ：

e _t,j ＝score(q _t ,h _j )

The score here may be a different function, such as dot product, additive or multi-layer perceptron (MLP), etc., for measuring q _t And h _j Similarity between them. Finally, we use the softmax function to fit all e _t,j Normalizing to obtain attention weight vector alpha _t ：

Where T is the sequence length of the encoder. With attention weight vector alpha _t We can then use it to calculate the weighted sum z of the encoder _t Thereby capturing information in the input text relating to the current time step:

finally, z _t Will be referred to as the subject representation of the decoder at time step t. The attention mechanism enables the decoder to better focus on the incoming text information associated with the current time step, thereby improving the performance of the model.

The stop words, such as-! At one go, shortly, here, very etc.

By the method, the text theme generation is concise and correct, and ambiguity can not occur. Compared with the traditional method for representing the theme by the key words, the method does not cause the reader to generate ambiguity on the generated theme representation; compared with the existing deep learning model-based method, the method can extract key topics from texts, and experimental results prove that the ROUGE score of the method is higher than that of other methods, so that the generated topic representation is better than that of other methods.

Drawings

FIG. 1 is a diagram of the Seq2Seq-Attention model.

FIG. 2 is a diagram of the Seq2Seq-Attention architecture.

Detailed Description

Example 1: the keyword search is adopted in each large website by adopting a Scrapy crawler framework, and 8868 texts of ten subjects of 'sensitive word A', 'sensitive word B', 'sensitive word C', 'sensitive word D', 'sensitive word E', 'sensitive word F', 'sensitive word G', 'sensitive word H', 'sensitive word I', 'sensitive word J' are crawled in the water conservancy and hydropower industry, so as to be used as text data of the embodiment. The data consists of title, body, time, ID, the text of the same subject being identified with the same ID.

The text theme representation method specifically comprises the following steps:

step 1, segmenting the collected 8868 text data by using a Jieba segmentation tool, and removing stop words;

step 2, K-means clustering is carried out on the texts, and texts describing the same theme are classified into one type; the results were as follows:

table K-means clustering results

and 5, outputting the expected topical theme character sequence.

The results of this example 1 were evaluated, and the evaluation method used ROUGE (Recall-Oriented Understudy for Gistin Evaluation) to evaluate the index. The ROUGE is a set of indexes for evaluating an automatic extraction task of a text abstract and a machine translation task, and mainly performs similarity comparison of two texts with a standard text through texts generated by a comparison model. The mathematical calculation formula of ROUGE is as follows:

in the formula, the denominator represents the total number of N-Gram of the standard text, and the numerator is the same number of N-Gram of the text generated by the Seq2Seq-Attention model generated by the automatic text subject matter as the standard text N-Gram. Recall of major relationship topics for the ROUGE scoring system.

In the Seq2Seq-Attention model architecture, the number of hidden layers of the cyclic neural network LSTM of the Encoder sub-module and the Encoder sub-module needs to be set by an individual, and the number of hidden layers is set to 32, 64, 128 according to the crawled text. The model of the present invention was evaluated with ROUGE-1, ROUGE-2, ROUGE-3, respectively, according to the different parameters set, and the results are shown in Table 1 below.

TABLE 1 comparison of the number of different hidden layers

As can be seen from table 1, the time for model training learning increases as the number of LSTM hidden layer data units of the recurrent neural network increases. Through the comparison of different hidden layer units, the ROUGE-1, ROUGE-2 and ROUGE-3 evaluation indexes corresponding to 128 neural networks are found to be optimal, so 128 is selected as the final value of the Seq2Seq model.

A neural network method similar to the method of the invention is selected for comparison, and the methods are used for inputting texts into the neural network and outputting theme representations. The following models were specifically selected for comparison:

(1) LSTM-Seq2Seq model: are often used as baseline methods in generating tasks, which use LSTM neural networks at the feature extraction module, in contrast to the methods herein.

(2) GRU (Gated Recurrent Unit) is a gating circulation unit, and the GRU-Seq2Seq model combines an input gate and a forgetting gate by using GRU in a feature extraction module, so that compared with LSTM, the model has the advantages of high training speed and simpler model.

(3) BiLSTM-Seq2Seq model: the feature extraction module of the model uses BiLSTM (two-way long-short-term memory model), and other modules are consistent with the method.

The inventive model was compared with the above three models on a self-built dataset and the results are shown in table 2 below.

Table 2 subject matter representation contrast verification

As can be seen from the results in Table 2, the model ROUGE score of the present invention is superior to other comparative models. The improvement of the ROUGE-1 score by about 5% in the model compared with the traditional LSTM-Seq2Seq model fully shows that the extraction of the characteristics by using the double-layer bidirectional cyclic neural network plays an important role, and the attention mechanism can assign higher weight to key information, so that the model gives more attention. The GRU-Seq2Seq model is slightly lifted compared to the LSTM-Seq2Seq model, indicating that reset gates and update gates can also capture features very well.

Claims

1. The text theme representation method based on the Seq2Seq-Attention is characterized by comprising the following steps of:

and 5, outputting the expected topical theme character sequence.