CN116628192A - Text theme representation method based on Seq2Seq-Attention - Google Patents

Text theme representation method based on Seq2Seq-Attention Download PDF

Info

Publication number
CN116628192A
CN116628192A CN202310255979.6A CN202310255979A CN116628192A CN 116628192 A CN116628192 A CN 116628192A CN 202310255979 A CN202310255979 A CN 202310255979A CN 116628192 A CN116628192 A CN 116628192A
Authority
CN
China
Prior art keywords
text
texts
attention
model
theme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310255979.6A
Other languages
Chinese (zh)
Inventor
夏琳杰
王子成
许后磊
张礼兵
舒德伟
邓键
唐季
陈昌黎
崔庆玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
PowerChina Kunming Engineering Corp Ltd
Original Assignee
PowerChina Kunming Engineering Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by PowerChina Kunming Engineering Corp Ltd filed Critical PowerChina Kunming Engineering Corp Ltd
Priority to CN202310255979.6A priority Critical patent/CN116628192A/en
Publication of CN116628192A publication Critical patent/CN116628192A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

A text topic representation method based on a Seq2Seq-Attention relates to the technical field of computer information, in particular to a text topic representation method. The method of the invention specifically comprises the following steps: step 1, segmenting the collected text data by using a Jieba segmentation tool, and removing stop words; step 2, K-means clustering is carried out on the texts, and texts describing the same theme are classified into one type; step 3, extracting text information characteristics by using a double-layer bidirectional cyclic neural network through an Encoder module aiming at texts of the same subject; step 4, decoding is completed by a Decoder sub-module through a double-layer unidirectional circulating neural network and combining with an attention mechanism model; and 5, outputting the expected topical theme character sequence. By the method, the text theme generation is concise and correct, and ambiguity can not occur.

Description

Text theme representation method based on Seq2Seq-Attention
Technical Field
The invention relates to the technical field of computer information, in particular to a text theme representation method.
Background
In the internet era, texts aiming at the same theme can be published on different websites in different angles and modes, in order to solve the problem of information overload, a topic discovery technology is generally used, and a plurality of texts are classified and organized according to the theme and presented to readers, so that the readers can be helped to acquire the information wanted by the readers more quickly. However, it is not enough to organize texts into clusters of topics, and from the perspective of the user, there are many text titles and texts under one topic, and the user cannot know the rough content of these topics in a short time. Therefore, the topics need to be presented to readers in a concise form, so that the readers can quickly know the topics, the readers do not need to further consume mental and time to understand, the burden of the readers is lightened, and meanwhile, the analysis of the topics by public opinion workers is facilitated.
Language systems are very complex structures, and different arrangements may have different meanings. How to condense out a concise topic representation of refining is a research difficulty of the task. At present, the research on the topic representation at home and abroad is less, and most students use keywords to represent the topic, so that the difficulty of the method is how to find out keywords which contain key information and can be quickly understood by users. The methods proposed at present are: 1. a linear weighting algorithm is used for finding out two words with the greatest weights in noun phrases as the representation of the subject; 2. representing the topic by extracting 5W1H six-tuple features of the text; 3. the method of concept bag, characterized the text as vector cluster by word2vec, use the frequency of the cluster to represent the file; 4. a text representation model utilizes a Bi-LSTM recurrent neural network to extract feature word vectors of text. In addition, students represent topics by using a deep learning method, such as providing a potential topic text representation model, and obtaining a text representation by measuring the distance between texts; and providing a comment representation model based on a neural network, and classifying each sentence into a combination of text representations by calculating the weight of each sentence.
However, keywords generated by the topic model of the method are not always continuous, the topics are simply and directly represented in the form of the keywords, but people can easily connect the keywords in series according to own ideas, different meanings can be generated by the keywords in different sequences, the association relationship among multiple documents is complex, and the simple and correct topic representation is difficult to generate.
Disclosure of Invention
The invention aims to solve the problem of how to automatically generate a concise topic representation, and provides a text topic representation model based on a Seq2 Seq-Attention.
The text theme representation method based on the Seq2Seq-Attention is characterized by comprising the following steps of:
step 1, segmenting the collected text data by using a Jieba segmentation tool, and removing stop words;
step 2, K-means clustering is carried out on the texts, and texts describing the same theme are classified into one type;
step 3, extracting text information characteristics by using a double-layer bidirectional cyclic neural network through an Encoder module aiming at texts of the same subject;
step 4, decoding is completed by a Decoder sub-module through a double-layer unidirectional circulating neural network and combining with an attention mechanism model;
and 5, outputting the expected topical theme character sequence.
Wherein:
the K-means clustering algorithm forms a cluster from the subject words with similar contents. Firstly, determining the value of a constant K, wherein the value of K is the category number of the final cluster. K points are randomly selected as initial centroids, and the distance from each word vector to the centroid is calculated. By comparing the distance of the vector to centroid C, the vector is assigned to the nearest topic cluster and the centroid of each cluster is recalculated. The calculation formula is as follows:
wherein the centroids of the K clusters represent centroids to which the word vector belongs. Repeating the above process until the centroid is not changed any more, thereby obtaining the clustered subject cluster.
The Encoder submodule model maps the text content input from outside into text vectors with fixed length, the text vectors fuse all text characteristic information Ct of the context of the article, the model of the module can select a characteristic extraction model such as a cyclic neural network (GRU, LSTM) or a convolutional neural network, and the like, and the sequence 2 Seq-attribute model Encoder submodule selects a double-layer bidirectional cyclic neural network for extracting text information characteristics. And the Decoder sub-module is corresponding to the right part of the upper graph, takes the output of the Encoder sub-module model as input to finish decoding tasks through the Decoder sub-module model, and outputs a desired character sequence. The Decoder submodule of the Seq2Seq-Attention model selects a double-layer unidirectional cyclic neural network for decoding, and firstly defines two special characters: the < start > identifier represents a start character of decoding, and the < end > identifier represents an end character of decoding. When the Seq2Seq-Attention model predicts, firstly, a < start > identifier is input to a Decoder module, then text feature vectors of a hidden layer are obtained through Attention weighting and the like, finally, the feature vectors are input to a Softmax classifier to obtain a prediction result, and the model prediction is completed through the Decoder module in the similar way, and finally, a strategy of model termination is that a < end > ending identifier is encountered, namely, the ending of the whole model prediction is represented. The Seq2Seq-Attention model is used for automatically generating tasks of text subjects, the input of the Encoder sub-module is a Chinese character sequence, the output of the Decode sub-module is a Chinese character sequence, and in order to improve the accuracy of the model and reduce the complexity of the model, the Encoder sub-module and the Decode sub-module use the same dictionary mapping table, namely, the two modules of parameters of the vector conversion layer coding layer are shared. The vector conversion layer uses word vectors in the pre-training field to complete training of a word vector model based on the crawled large-scale text corpus data.
The working modes of the Decoder submodule Decoder in the training stage and the prediction stage of the Seq2Seq-Attention model are inconsistent, the output text content of the model training stage Decoder is known, and the actual correct character at the last moment is used as the input of the current moment model; the content of the decoding end in the model prediction stage is unknown, a < start > start identifier is required to be input, and an Attention vector is fused to perform model prediction, and the difference from the training stage is that incorrect text characters are input;
P(Y|X)=P(Y1|X)P(Y2|X)....P(Yn|X,Y1,Y2,.....Yn-1)
the objective of the Decoder sub-module decoding prediction stage is to maximize the probability of the above formula, and the algorithm is usually to select the most predicted result with the highest probability at each moment through a greedy strategy, which has the disadvantage of easily obtaining a locally optimal result instead of a globally optimal result. Therefore, the paper adopts a set element search model, and the first N candidate prediction results of the current output are selected when each moment is predicted. For example, when n=2, the search algorithm selects P (y1|x) with maximum Top 2 at each time, then obtains P (y2|x, Y1) by taking two maximum P (y1|x) as input, and the model end is marked by predicting < end > end character through recursive prediction.
The Mask module of the Seq2Seq-Attention model uses two Mask patterns: the first is a Padding Mask, which mainly solves the problem of inaccurate prediction caused by a growth supplementing strategy in an Encoder; the second is a Sequence Mask, which mainly solves the problem that the text information at the next moment cannot be utilized in the prediction stage of the Decoder submodule. The Padding Mask mode solution is to subtract the part with the length 0 from a particularly large number at the same time, and the prediction probability of the position with the length 0 is made to be close to 0 by the method. The Sequence Mask mode solution is to initialize the upper triangular matrix, i.e. the values of the upper triangular matrix are all 0, and apply the upper triangular matrix in the special format to each predicted Sequence.
In order to better capture important information in the input text, attention mechanisms are introduced in the decoder. The mechanism calculates the attention profile based on the weighted average of the current concealment state of the decoder and all concealment states of the encoder so as to be able to better focus on the information in the input text that is relevant to the current time step.
In the Seq2Seq-Attention model, an Attention mechanism can be used to calculate the importance of each hidden state of the encoder to the current time step of the decoder. In this way the decoder can pay better attention to the information in the input text that is relevant to the current time step. Specifically, at each time step t, we use an attention weight vector α t To indicate the importance of each position in the input text to the current time step of the decoder. To calculate alpha t We can first use a matrix W to be learned a Converting all hidden states of an encoder into a query vector q t
q t =W a s t
Wherein s is t Is the hidden state of the current time step t of the decoder. We can then calculate the per encoder concealment state h j And query vector q t Similarity between to obtain an unnormalized attention weight e t,j
e t,j =score(q t ,h j )
The score here may be a different function, such as dot product, additive or multi-layer perceptron (MLP), etc., for measuring q t And h j Similarity between them. Finally, we use the softmax function to fit all e t,j Normalizing to obtain attention weight vector alpha t
Where T is the sequence length of the encoder. With attention weight vector alpha t We can then use it to calculate the weighted sum z of the encoder t Thereby capturing information in the input text relating to the current time step:
finally, z t Will be referred to as the subject representation of the decoder at time step t. The attention mechanism enables the decoder to better focus on the incoming text information associated with the current time step, thereby improving the performance of the model.
The stop words, such as-! At one go, shortly, here, very etc.
By the method, the text theme generation is concise and correct, and ambiguity can not occur. Compared with the traditional method for representing the theme by the key words, the method does not cause the reader to generate ambiguity on the generated theme representation; compared with the existing deep learning model-based method, the method can extract key topics from texts, and experimental results prove that the ROUGE score of the method is higher than that of other methods, so that the generated topic representation is better than that of other methods.
Drawings
FIG. 1 is a diagram of the Seq2Seq-Attention model.
FIG. 2 is a diagram of the Seq2Seq-Attention architecture.
Detailed Description
Example 1: the keyword search is adopted in each large website by adopting a Scrapy crawler framework, and 8868 texts of ten subjects of 'sensitive word A', 'sensitive word B', 'sensitive word C', 'sensitive word D', 'sensitive word E', 'sensitive word F', 'sensitive word G', 'sensitive word H', 'sensitive word I', 'sensitive word J' are crawled in the water conservancy and hydropower industry, so as to be used as text data of the embodiment. The data consists of title, body, time, ID, the text of the same subject being identified with the same ID.
The text theme representation method specifically comprises the following steps:
step 1, segmenting the collected 8868 text data by using a Jieba segmentation tool, and removing stop words;
step 2, K-means clustering is carried out on the texts, and texts describing the same theme are classified into one type; the results were as follows:
table K-means clustering results
Step 3, extracting text information characteristics by using a double-layer bidirectional cyclic neural network through an Encoder module aiming at texts of the same subject;
step 4, decoding is completed by a Decoder sub-module through a double-layer unidirectional circulating neural network and combining with an attention mechanism model;
and 5, outputting the expected topical theme character sequence.
The results of this example 1 were evaluated, and the evaluation method used ROUGE (Recall-Oriented Understudy for Gistin Evaluation) to evaluate the index. The ROUGE is a set of indexes for evaluating an automatic extraction task of a text abstract and a machine translation task, and mainly performs similarity comparison of two texts with a standard text through texts generated by a comparison model. The mathematical calculation formula of ROUGE is as follows:
in the formula, the denominator represents the total number of N-Gram of the standard text, and the numerator is the same number of N-Gram of the text generated by the Seq2Seq-Attention model generated by the automatic text subject matter as the standard text N-Gram. Recall of major relationship topics for the ROUGE scoring system.
In the Seq2Seq-Attention model architecture, the number of hidden layers of the cyclic neural network LSTM of the Encoder sub-module and the Encoder sub-module needs to be set by an individual, and the number of hidden layers is set to 32, 64, 128 according to the crawled text. The model of the present invention was evaluated with ROUGE-1, ROUGE-2, ROUGE-3, respectively, according to the different parameters set, and the results are shown in Table 1 below.
TABLE 1 comparison of the number of different hidden layers
As can be seen from table 1, the time for model training learning increases as the number of LSTM hidden layer data units of the recurrent neural network increases. Through the comparison of different hidden layer units, the ROUGE-1, ROUGE-2 and ROUGE-3 evaluation indexes corresponding to 128 neural networks are found to be optimal, so 128 is selected as the final value of the Seq2Seq model.
A neural network method similar to the method of the invention is selected for comparison, and the methods are used for inputting texts into the neural network and outputting theme representations. The following models were specifically selected for comparison:
(1) LSTM-Seq2Seq model: are often used as baseline methods in generating tasks, which use LSTM neural networks at the feature extraction module, in contrast to the methods herein.
(2) GRU (Gated Recurrent Unit) is a gating circulation unit, and the GRU-Seq2Seq model combines an input gate and a forgetting gate by using GRU in a feature extraction module, so that compared with LSTM, the model has the advantages of high training speed and simpler model.
(3) BiLSTM-Seq2Seq model: the feature extraction module of the model uses BiLSTM (two-way long-short-term memory model), and other modules are consistent with the method.
The inventive model was compared with the above three models on a self-built dataset and the results are shown in table 2 below.
Table 2 subject matter representation contrast verification
As can be seen from the results in Table 2, the model ROUGE score of the present invention is superior to other comparative models. The improvement of the ROUGE-1 score by about 5% in the model compared with the traditional LSTM-Seq2Seq model fully shows that the extraction of the characteristics by using the double-layer bidirectional cyclic neural network plays an important role, and the attention mechanism can assign higher weight to key information, so that the model gives more attention. The GRU-Seq2Seq model is slightly lifted compared to the LSTM-Seq2Seq model, indicating that reset gates and update gates can also capture features very well.

Claims (1)

1. The text theme representation method based on the Seq2Seq-Attention is characterized by comprising the following steps of:
step 1, segmenting the collected text data by using a Jieba segmentation tool, and removing stop words;
step 2, K-means clustering is carried out on the texts, and texts describing the same theme are classified into one type;
step 3, extracting text information characteristics by using a double-layer bidirectional cyclic neural network through an Encoder module aiming at texts of the same subject;
step 4, decoding is completed by a Decoder sub-module through a double-layer unidirectional circulating neural network and combining with an attention mechanism model;
and 5, outputting the expected topical theme character sequence.
CN202310255979.6A 2023-03-16 2023-03-16 Text theme representation method based on Seq2Seq-Attention Pending CN116628192A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310255979.6A CN116628192A (en) 2023-03-16 2023-03-16 Text theme representation method based on Seq2Seq-Attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310255979.6A CN116628192A (en) 2023-03-16 2023-03-16 Text theme representation method based on Seq2Seq-Attention

Publications (1)

Publication Number Publication Date
CN116628192A true CN116628192A (en) 2023-08-22

Family

ID=87619989

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310255979.6A Pending CN116628192A (en) 2023-03-16 2023-03-16 Text theme representation method based on Seq2Seq-Attention

Country Status (1)

Country Link
CN (1) CN116628192A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116932686A (en) * 2023-09-19 2023-10-24 苏州元脑智能科技有限公司 Theme mining method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116932686A (en) * 2023-09-19 2023-10-24 苏州元脑智能科技有限公司 Theme mining method and device, electronic equipment and storage medium
CN116932686B (en) * 2023-09-19 2024-01-23 苏州元脑智能科技有限公司 Theme mining method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN111259127B (en) Long text answer selection method based on transfer learning sentence vector
Wang et al. Multilayer dense attention model for image caption
Perez-Martin et al. Improving video captioning with temporal composition of a visual-syntactic embedding
CN113239181A (en) Scientific and technological literature citation recommendation method based on deep learning
CN106776548A (en) A kind of method and apparatus of the Similarity Measure of text
CN113076465A (en) Universal cross-modal retrieval model based on deep hash
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
CN111061939B (en) Scientific research academic news keyword matching recommendation method based on deep learning
CN111598041A (en) Image generation text method for article searching
CN116842126B (en) Method, medium and system for realizing accurate output of knowledge base by using LLM
CN110879834A (en) Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof
CN116610778A (en) Bidirectional image-text matching method based on cross-modal global and local attention mechanism
CN116910272B (en) Academic knowledge graph completion method based on pre-training model T5
CN112860930A (en) Text-to-commodity image retrieval method based on hierarchical similarity learning
CN111339407A (en) Implementation method of information extraction cloud platform
CN114997288A (en) Design resource association method
CN114064901B (en) Book comment text classification method based on knowledge graph word meaning disambiguation
CN116628192A (en) Text theme representation method based on Seq2Seq-Attention
CN112926340B (en) Semantic matching model for knowledge point positioning
CN114356990A (en) Base named entity recognition system and method based on transfer learning
CN112148857B (en) Automatic document generation system and method
CN112989803A (en) Entity link model based on topic vector learning
CN116932736A (en) Patent recommendation method based on combination of user requirements and inverted list
CN116720519A (en) Seedling medicine named entity identification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination