CN116860959A - Extraction type abstract method and system combining local topic and hierarchical structure information - Google Patents
Extraction type abstract method and system combining local topic and hierarchical structure information Download PDFInfo
- Publication number
- CN116860959A CN116860959A CN202310699985.0A CN202310699985A CN116860959A CN 116860959 A CN116860959 A CN 116860959A CN 202310699985 A CN202310699985 A CN 202310699985A CN 116860959 A CN116860959 A CN 116860959A
- Authority
- CN
- China
- Prior art keywords
- sentence
- information
- text
- topic
- local
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 78
- 238000000605 extraction Methods 0.000 title claims abstract description 47
- 239000013598 vector Substances 0.000 claims description 42
- 238000012549 training Methods 0.000 claims description 21
- 239000011159 matrix material Substances 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 14
- 239000012634 fragment Substances 0.000 claims description 9
- 230000004927 fusion Effects 0.000 claims description 3
- 238000004590 computer program Methods 0.000 claims 2
- 230000008569 process Effects 0.000 abstract description 9
- 230000007246 mechanism Effects 0.000 description 8
- 230000000694 effects Effects 0.000 description 7
- 238000013528 artificial neural network Methods 0.000 description 6
- 230000007547 defect Effects 0.000 description 5
- 238000003058 natural language processing Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000003491 array Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 230000001537 neural effect Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention belongs to the technical field of text abstract extraction, and discloses an extraction type abstract method and system combining local topic and hierarchical structure information, wherein an original document is given, a context representation of the document is obtained through an encoder, topic information of a segment to which the sentence belongs is extracted by an input local topic information extraction module, and the local topic information representation is fused with the context representation of the document to obtain a text context representation fused with the local topic information; the text hierarchical structure information embedding module embeds the hierarchical structure information of the text into the text context representation fused with the local theme information; and calculating the confidence score of each sentence through the Sigmoid layer to judge whether the sentence belongs to the abstract sentence. The method and the device pay more attention to important text parts, and improve the quality of the generated abstract; by modifying the tokenPositionEmbeddings of Longformer, it is enabled to process longer text data.
Description
Technical Field
The invention belongs to the technical field of text abstract extraction, and particularly relates to an extraction type abstract method and system combining local subjects and hierarchical structure information.
Background
At present, with the rapid development of internet technology and the arrival of the knowledge exploration age, people can search data through the internet by using information search engines such as google, yahoo, hundred degrees and the like. But massive text information and files on the internet are confusing for people, which also lifts the hot tide for scientific researchers to develop and automatically summarize text technology. Therefore, the text summarization (Text Summarization) technology is emerging as a difficult challenge in the field of natural language processing (Natural Language Processing, NLP) because it requires accurate text analysis, such as semantic analysis and lexical analysis, to produce a good summary. The purpose of the text summary is to generate a concise and coherent important sentence that includes all relevant important information from the original document. The method can effectively reduce the information burden of the user, enable the user to quickly acquire the information from the redundant information, greatly reduce manpower and material resources, and play an important role in the fields of information retrieval, title generation and the like.
Text summarization tasks may be classified into different types according to different partitioning criteria. A Single document text abstract (Single-document Text Summarization) and a Multi-document text abstract (Multi-document Text Summary) are classified according to the number of documents; depending on the implementation, it may be divided into a decimated digest (Extractive Summarization) and a generated digest (Abstractive Summarization). The extraction method is generally regarded as a classification problem, and directly selects sentences from the original text to form abstracts according to the importance of the sentences. Generally, the summary generated in this way has a good performance in terms of fluency and grammar, but there may be many redundant contents, and there is a lack of consistency between sentences. Whereas neural network-based methods of generating digests typically employ a Sequence-to-Sequence (Seq 2 Seq) architecture, i.e., an Encoder-Decoder architecture, which is similar to human practice for digest generation. First, the whole text is encoded by the Encoder, and then new sentences are generated word by the Decoder to form a document abstract. The method generates less abstract redundant information, but because sentences are generated from scratch, they are relatively poor in terms of fluency and grammar. In addition, generating new words or phrases may produce digests that are inconsistent with the original statement. Instead, the problem can be solved by directly selecting sentences from the original text to assemble the extracted abstract into the abstract. For the abstract extraction task, the core work is how to learn long-range sentence context information and model long-range inter-sentence relationships through the Encoder, so that the sentence classifier extracts more valuable sentences. Conventional extraction methods typically perform unsupervised abstract extraction based on graph-based methods or cluster-based methods, which construct correlations between sentences through cosine similarity, and then apply a ranking method to calculate the importance of the sentences. With the rapid development of deep learning, many extraction type summarization methods employ recurrent neural networks (Recurrent Neural Network, RNN) to capture inter-sentence relationships, however RNN-based methods have difficulty in dealing with long-range dependencies, especially for long document summarization. In recent years, a transducer language model pre-trained by a large-scale corpus achieves excellent performance in fine tuning of downstream tasks, and the transducer-based pre-trained language model is widely applied to the field of text abstracts. Liu et al propose BERTUM model by improving BERT embedding layer, apply BERT model in text abstract field for the first time, and realize SOTA (State-of-the-art) effect on CNN/DailyMail dataset. Zhang et al designed a hierarchical transducer for capturing long-range inter-sentence relationships, but the method did not achieve significant performance gain for abstract tasks and had problems of slow training speed, easy overfitting, etc. At the same time, some personnel introduce a neural topic model (Neural Topic Model, NTM) and a graph neural network (Graph Neural Network, GNN) into the text summarization task for capturing global semantic information to further guide summarization generation. Cui et al use NTM to capture document theme features and GNN to represent the document as a graph structure to obtain inter-sentence relationships.
With the rapid development of neural networks, the abstract task has achieved important results. At present, the extraction type abstract method is mainly regarded as a sentence sorting task and a sequence labeling two-classification task. In the sentence ordering task paradigm, the model needs to score each sentence in the text and place the high scoring sentence in front of the abstract list and the low scoring sentence in back. Thus, an ordered sentence list can be obtained, and then the first sentences in the list are taken as abstracts. Narayan et al propose a topic-aware convolutional neural network model that first uses a convolutional neural network to extract features from a document, then weights the features according to topic, and finally uses a method based on selection ordering to select the most relevant sentences as digests. Experiments were performed on multiple data sets, and the results indicate that it can produce a very compact but still informative text abstract. Li et al propose a method for evaluating sentence importance in a multi-document abstract using variational self-coding. Unlike traditional feature engineering-based methods, the method directly learns abstract semantic representations from raw data, and constrains the proximity of generated sentence representations to a priori distribution by introducing KL divergence to improve the generalization capability of the model.
For the second paradigm, namely considering the extracted text summaries as sequence annotation problems, the method determines which sentences are selected as summary sentences by feature extraction and encoding each sentence or paragraph, and then inputting it to the decoder for marker prediction. The sequence labeling method is widely applied to the extraction type text abstract and achieves good effects. Nallapati et al propose a SummaRunner text summary model, which is an RNN-based sequence model that generates a document summary by learning the importance of each sentence in the document, which achieves better summary performance across multiple text datasets. Zhang et al propose a latent variable extraction model that treats sentences as latent variables, using sentences with activated variables to infer digests.
However, the above methods are mostly RNN-based extraction type digest methods, since RNN-based methods have difficulty in handling long distance dependency of sentence level, and some language or structure information may be omitted due to the input format of the original document. To address these issues, some workers began using a transducer-based pre-trained language model as an encoder and represented the document by a more intuitive graph structure, and adding NTM to extract the subject features of the document further led the model to produce a high quality summary. The deep differential amplifier proposed by Jia et al extracts the digest and enhances the characteristics of the digest sentence by the differential amplifier, thereby comparing with the non-digest sentence. The Shi et al propose a method for abstracting extraction based on a star architecture, which models sentences in documents as satellite nodes, introduces a virtual central node, learns the inter-sentence relationship of each document through the star architecture, and achieves good effects on all three public data sets. Ma et al embed the NTM extracted topic features into BERT to produce a vector representation with topic features, thereby improving summary quality.
Although the above method has succeeded in modeling inter-sentence relationships and extracting global semantics, the extracted text summarization method based on the Transformer pre-training language model still has a problem in the text summarization task, that is, the text length input by the text summarization task is longer than that of a general natural language processing task, and the encoder based on the Transformer is only used to fully process long text and has the defects of high calculation amount and the like. To better understand the raw text, researchers have proposed many such approaches. Xie et al first pre-process the document, divide the document into blocks of the same size, encode each document block by block coding, integrate the block coding result into NTM to produce global theme feature, finally establish contrast diagram of theme feature and sentence feature to carry on the abstract sentence screening, the method has got good effects in long text and news short text, but have more advantages to long text. Beltagy et al propose Longformer models that focus on processing long documents, reducing the temporal complexity to a linear level by changing the transform's self-attention mechanism to a sliding window self-attention mechanism, enabling the model to easily process long text. Although Longformer can handle long text well, it fails to model local semantic information as well as document hierarchy information, affecting performance. Therefore, the invention uses Longformer as encoder, and integrates the local context information of the current topic segment and the hierarchical structure information of the document on the basis, so that the model focuses more on the local topic information and the whole structure information of the whole text when processing the scientific paper long text.
Through the above analysis, the problems and defects existing in the prior art are as follows:
(1) Conventional Transformer language models typically treat text as a linear sequence, ignoring the hierarchical information inherent to text.
(2) For long text data, the more topics it may describe, because each section represents different topic information. While the Transformer language model still has limitations in capturing and integrating local context information in the subject segments.
(3) Existing encoders do not handle long text well.
Disclosure of Invention
Aiming at the problems existing in the prior art, the invention provides a method and a system for extracting abstract combining local theme and hierarchical structure information.
The invention is realized in such a way that a method for abstracting abstract combining local topic and hierarchical structure information comprises the following steps:
step one, given an original document d= { send 1 ,…,sent n }, send n Representing the nth sentence of the original document;
obtaining context representation of a document through an encoder, and inputting the context representation into a local topic information extraction module to extract topic information of a segment to which the sentence belongs;
fusing the local topic information representation and the context representation of the document to obtain a text context representation fused with the local topic information;
step four, the text hierarchical structure information embedding module embeds the hierarchical structure information of the text into text context representation fused with local theme information, and the text context structure is better known by the model through learning sentence document level hierarchical structure information of two layers of stacked transformers;
and fifthly, calculating the confidence score of each sentence through the Sigmoid layer to judge whether the sentence belongs to the abstract sentence.
Further, in the first step, the [ BOS ] and [ EOS ] tags are inserted at the beginning and end of each sentence, respectively, and the entire sentence representation is represented using the [ BOS ] tags.
Further, in the second step, a Longformer pre-training language model is used as a text encoder, and the model embedding layer includes TE: tokenEmbeddings, SE: segment references and PE: positional references.
w i,j =(TE+SE+PE)
Obtaining embedded representation of each word from the above, and performing context learning on the input sequence by using a pre-training Longformer;
{h 1,0 ,h 1,1 ,…,h N,0 ,…,h N,* }=Longformer(w 1,0 ,w 1,1 ,…,w N,0 ,…,w N,* )
wherein w is i,j The j-th word, w, representing the i-th sentence i,0 And w i,* Respectively represent [ BOS ] of ith sentence]And [ EOS ]]Label, h i,j Representing a hidden state of the corresponding word; after Longformer encoding, [ BOS]Tags as contextual representations of each sentence, i.e. H s =(h 1,0 ,…,h N,0 )。
Further, the specific step of extracting the topic information of the segment to which the sentence belongs by the local topic information extraction module in the second step includes:
obtaining hidden vector representation of each sentence through Bi-LSTM coding of sentence context representation;
using the subtraction between the beginning and ending hidden vectors of each topic segment to represent the local context information of the topic segment to which the sentence belongs, for the ith topic segment t i The specific representation method is as follows:
t i =(f i |b i )
wherein f i And b i The subject fragments, start, representing forward propagation and backward propagation, respectively i And end i Respectively representing a start position and an end position of the subject segment, |represents a vector splice symbol;
the 0 vector is added at the beginning and end of the forward and backward propagation, respectively, to prevent the subscript from exceeding the boundary.
Further, the hierarchical structure information of the text in the fourth step includes sentence hierarchical structure information and chapter title information;
the sentence hierarchical structure information comprises a linear position of a paragraph to which the sentence belongs and a linear position representation of the sentence in the paragraph;
the position of the paragraph is represented by acquiring the numerical sequence corresponding to the paragraph and sentence, and D= { send for a given document 1 ,…,sent n I-th sentence send i Is represented as a two-dimensional vector (s s ,g s ) The two-dimensional vector represents the position of the sentence at the level, specifically:
vsent i =(s s ,g s )
wherein s is s Expressed as a linear position, g, containing the sentence fragment relative to the entire article s Expressed as a linear position of the sentence within the paragraph in which it is located;
and the chapter title information is a plurality of chapter title categories preset according to the chapter title characteristics of the PubMed data set, and if the chapter title of the chapter where the sentence is located does not belong to one of the preset chapter title categories, the chapter title of the sentence is directly used.
Further, the text hierarchy information embedding module in the fourth step encodes the vset vector by using a BERT position encoding method;
for the hierarchical structure vector (s s ,g s ) The expression is as follows:
PE represents a position coding method of BERT, d represents a vector dimension of the sentence, and I represents a vector splicing symbol;
and encoding the chapter title information of the sentence by using a pre-training encoder which is the same as the document encoding, obtaining a hidden state corresponding to each character by inputting the extracted chapter title into the pre-training encoder, and adding each hidden state.
Further, the Sigmoid function adopted by the Sigmoid layer in the fifth step is:
in the training phase, the sigmoid function uses binary cross entropy as a loss function:
Loss={loss 1 ,…,loss n }
wherein σ represents a sigmoid function, W h Representing a matrix of learnable parameters, b h Representing bias, loss i Represents a loss generated when judging whether each sentence belongs to the abstract sentence,probability value, y, representing current sentence prediction i A tag value representing the authenticity of the sentence.
Another object of the present invention is to provide a system for abstracting a summary of a local topic and hierarchical information, the system comprising:
a document giving module for giving an original document;
an encoding module for obtaining a contextual representation of the document using an encoder based on a pre-trained language model;
the local topic information extraction module is used for extracting topic information of fragments of sentences in the context representation of the document;
the fusion module is used for fusing the local topic information representation and the context representation of the document to obtain a text context representation fused with the local topic information;
the text hierarchical structure information embedding module is used for embedding the hierarchical structure information of the text into the text context representation fused with the local theme information;
and the judging module is used for calculating the confidence score of each sentence through the Sigmoid layer so as to judge whether the sentence belongs to the abstract sentence.
In combination with the technical scheme and the technical problems to be solved, the technical scheme to be protected has the following advantages and positive effects:
the first and inventive technical effects are specifically described as follows:
(1) By adding text hierarchical structure information to the model, the model is more focused on important text parts, and the quality of the generated abstract is improved.
(2) A local topic information extraction module is provided for capturing local topic information of a segment to which a sentence belongs, so that a model is more deeply understood for a long document, and a high-quality abstract is generated.
(3) By modifying the Longformer Token Position Embeddings, it is enabled to process longer text data.
The invention provides a long document extraction abstract model which fuses local topic information and document hierarchical structure information in a current topic segment. The model consists of a text encoder, a local subject information extraction module and a text hierarchical structure information embedding module, and can effectively generate high-quality abstracts.
Second, the invention mainly considers that in the long text extraction type abstract task, the traditional method has two defects, and the first defect is that for long text data, the long text data contains clear internal hierarchical structure and chapter title information. When we manually summarize text, we tend to focus on the important parts. For example, in the scientific paper, we may be more concerned with the "method", "exact", and "content" parts, while they may be less concerned with "Background" or "related Work". In addition, sentences inside a certain section have closer relations than sentences outside the section, the sequence relations among the sentences and the hierarchical structure inside the document are known, and the model is facilitated to better determine important sentences in the document. Traditional text summarization methods based on Transformer often treat text as a sequence structure and cannot process long documents. A second drawback is that the longer a document, the more topics it may discuss, as each section represents different topic information. The above method focuses on the full text topic information, i.e. global information, and ignores the local topic information of these chapters. In order to solve the problems, a long document extraction abstract model which fuses local topic information and document hierarchical structure information in a current topic segment is provided. The model consists of a text encoder, a local subject information extraction module and a text hierarchical structure information embedding module, and can effectively generate high-quality abstracts.
Thirdly, as inventive supplementary evidence of the claims of the present invention, the following important aspects are also presented:
the technical scheme of the invention fills the technical blank in the domestic and foreign industries: the invention solves the problem of insufficient extraction of original text hierarchical structure information and long text local topic information in the traditional long text extraction type text abstract method, and provides a long text extraction abstract model which fuses the local topic information and document hierarchical structure information in the current topic segment.
Whether the technical scheme of the invention solves the technical problems that people want to solve all the time but fail to obtain success all the time is solved: the invention mainly considers that in the long text extraction type abstract task, the traditional method has two defects, firstly, a transducer language model usually regards a text as a linear sequence, and inherent hierarchical structure information of the text is ignored; second, for long text data, the Transformer language model still has limitations in capturing and integrating local context information in the subject segments. In order to solve the above problems, a long text extraction abstract model is provided that fuses local context information and document hierarchical structure information in a topic segment.
Drawings
FIG. 1 is a flow chart of a method for providing a decimated summary that incorporates local topic and hierarchical information in accordance with an embodiment of the present invention;
FIG. 2 is a schematic diagram of a method for extracting abstracts in combination with local topic and hierarchical information according to an embodiment of the present invention;
FIG. 3 is a block diagram of a local topic information extraction module provided by an embodiment of the invention;
fig. 4 is a diagram of a model structure provided in an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The technical key points and the points to be protected of the embodiment of the invention are as follows: the model is of an overall architecture. The LSTM-Minus is used to obtain a distributed representation of the local subject information and combine it with the text summarization task. The text hierarchy information is represented using position coding and embedded into a text context representation that fuses the local subject information.
As shown in fig. 1, the method for extracting abstract combining local topic and hierarchical structure information according to the embodiment of the invention includes:
s101, step one, giving an original document D= { send 1 ,…,sent n }, send n Representing an nth sentence in the original document;
s102, obtaining a context representation of a document through an encoder, and inputting the context representation into a local topic information extraction module to extract topic information of a segment to which the sentence belongs;
s103, fusing the local topic information representation with the context representation of the document to obtain a text context representation fused with the local topic information;
s104, the text hierarchical structure information embedding module embeds the hierarchical structure information of the text into text context representation fused with local subject information, and the text context structure is better known by the model through learning sentence document level hierarchical structure information by two layers of stacked transformers;
s105, calculating the confidence score of each sentence through the Sigmoid layer to judge whether the sentence belongs to the abstract sentence.
As shown in fig. 2, the model mainly includes three modules as a whole: an encoder based on a pre-training language model, a local subject information extraction module and a text hierarchical structure information embedding module.
Because the invention uses long text corpus, the invention adopts a Longformer improved based on a pre-training language model fransformer as an encoder to more fully encode long documents. Specifically, for a given original document d= { send 1 ,…,sent n }, send n Representing the nth sentence in the original document.
To obtain a representation of each sentence, a [ BOS ] and an [ EOS ] tag are inserted at the beginning and end of each sentence, respectively, and the entire sentence representation is represented using the [ BOS ] tag. After the context representation of the document is obtained by the encoder, the context representation is input into a local topic information extraction module to extract topic information of the segment to which the sentence belongs.
And then fusing the local topic information representation with the text context representation to obtain the text context representation fused with the local topic information. The text hierarchical structure information embedding module is used for embedding the hierarchical structure information of the text into the text context representation fused with the local theme information, and learning sentence document-level hierarchical structure information through two layers of stacked convertors so that the model can know the text context structure more deeply.
And finally, calculating the confidence score of each sentence through the Sigmoid layer to judge whether the sentence belongs to the abstract sentence.
1. Text hierarchy information
1.1 sentence hierarchical structure information
Since a scientific paper contains many sections, and a section contains many paragraphs, there are differences in the topics described by the different paragraphs, the present invention uses paragraphs as a hierarchical division unit of articles.
The sentence hierarchy information includes a linear position of a paragraph to which the sentence belongs and a linear position representation of the sentence within the paragraph. The positions of the paragraphs and sentences are represented by acquiring numerical sequence numbers corresponding to the paragraphs and sentences. For a given document d= { send 1 ,…,sent n I-th sentence send i Is represented as a two-dimensional vector (s s ,g s ) The two-dimensional vector represents the position of the sentence at the level, specifically as shown in equation 1.
vsent i =(s s ,g s ) (1)
Wherein s is s Expressed as a linear position, g, containing the sentence fragment relative to the entire article s Represented as the linear position of the sentence within the paragraph in which it is located. All sentences in the same paragraph are identical in the first dimension of vset, indicating that the correlation between sentences in the same paragraph is higher, and g s The vector further indicates the linear relationship of the sentences within the paragraph.
1.2 chapter title information
In contrast to news short articles, the science paper long text generally has a chapter title, and what can be described in a chapter is generally highly related to the chapter title, i.e., the chapter title is a reduction of the content of the chapter. Based on this, the present invention adds chapter titles as additional hierarchical information to sentence coding when the sentences are coded. However, for scientific papers, there are many similar section headings with the same meaning, e.g. "Method" and "Method" have the same meaning, and both can be categorized as "Method". Accordingly, 8 chapter title categories, "include", "background", "case", "method", "result", "display", "control", and "additional information", respectively, are set for the PubMed data set used in the present invention. If the section title of the section where the sentence is located does not belong to one of the 8 categories, the section title of itself is directly used.
2. Text encoding
The purpose of document encoding is to encode sentences of an input document into fixed-length vector representations, with previous methods typically employing RNN and BERT pre-trained language models as encoders for the extraction text summarization task. The BERT is a bi-directional transducer encoder based on large corpus pre-training, which achieves excellent performance in multiple natural language processing tasks, but for long text data, the BERT cannot acquire all data and can cause information loss, so the invention uses Longformer pre-training language model as text encoder. Longformer enables it to easily process documents of thousands of characters by improving the self-attention mechanism of a conventional fransformer to a sliding window self-attention mechanism. The conventional transform self-attention mechanism calculation method is to perform linear transformation on an input word vector matrix to generate a Query matrix (Query, Q), a keyword matrix (Key, K) and a Value matrix (Value, V) with a dimension d. The transposed multiplication of each Q matrix and K matrix can calculate an attention weight matrix that represents the similarity between the words. After the attention weight matrix is obtained, the attention weight matrix is multiplied by the V matrix to obtain a matrix containing the similarity relation. The specific calculation process is shown in formula 2.
Wherein (Q, K, V) ∈R L×d L represents the sequence length, d represents the word vector matrix dimension, d k Representing the K matrix dimension size. From this, QK is obtained T ∈R L×L That is, the spatial complexity of the conventional transducer self-attention mechanism is O (L 2 ) Being twice proportional to the input sequence length, results in that it does not benefit very well from a long input sequence. In contrast to traditional transformers, longformer uses a sliding window self-attention mechanism. If the sliding window size is set to n, the operation in equation 2 becomes: the ith row in Q is used only with the column in K N rows within this window perform dot product operations. Then QK T ∈R L×n Where (n < L), i.e., the sliding window self-attention mechanism spatial complexity of Longformer is O (L), is linearly proportional to the input sequence length, longformer is more advantageous for long text encoding.
To obtain a representation of each sentence, the present invention inserts [ BOS ] and [ EOS ] tags at the beginning and end of each sentence, respectively. The model embedding layer includes Token Embedded (TE), segment Embeddings (SE), position Embeddings (PE).
w i,j =(TE+SE+PE) (3)
After the embedded representation of each word is derived from equation 3, the input sequence is context-learned using a pre-trained Longformer.
{h 1,0 ,h 1,1 ,…,h N,0 ,…,h N,* }=Longformer(w 1,0 ,w 1,1 ,…,w N,0 ,…,w N,* ) (4)
Wherein w is i,j The j-th word representing the i-th sentence is obtained by the formula 3. w (w) i,0 And w i,* Respectively represent [ BOS ] of ith sentence]And [ EOS ]]Label, h i,j Representing the hidden state of the corresponding word. After Longformer encoding, [ BOS]Tags as contextual representations of each sentence, i.e. H s =(h 1,0 ,…,h N,0 )。
3. Local topic information extraction
In order to capture the local context information of the segment to which the sentence belongs, the invention uses the LSTM-Minus method for learning text segment embedding, and the detailed structure of the method is shown in figure 3.
The input to the local topic information extraction module is a contextual representation of each sentence obtained by the encoder. Since LSTM can learn and utilize previous information through its own gating structure and save it in memory cells, the present invention derives a hidden vector representation for each sentence from the Bi-LSTM encoding of the sentence context representation. The local context information of the subject segment to which the sentence belongs is then represented by a subtraction between the starting and ending hidden vectors of each subject segment.
For the ith subject segment t i The specific expression method is shown in formulas (5) to (7).
t i =(f i |b i ) (7)
Wherein f i And b i The subject fragments, start, representing forward propagation and backward propagation, respectively i And end i Respectively representing the start and end positions of the subject segment. The i represents the vector concatenation symbol. For example, for the second topic segment t in FIG. 3 2 Can be expressed as [ f 5 -f 2 ,b 3 -b 6 ]Wherein f 5 And f 2 Representing hidden states of forward propagation of sentence 5 and sentence 2, b 3 And b 6 Representing hidden states of the 3 rd sentence and 6 th sentence back propagation. To prevent the subscript from exceeding the boundary, the present invention adds a 0 vector at the beginning and end of the forward and backward propagation, respectively. After the local context information of the subject fragment to which the sentence belongs is calculated, the local context information is spliced with the context code of the document sentence, and the sentence context representation is further enriched.
4. Text hierarchical information encoding
Currently, there are two main linear position coding methods, a fixed value generated by sin/cos used in a transducer and a randomly generated and trainable method used in BERT. The position coding method in the transducer can only generate a fixed value through sine or cosine function, and can only mark the position of the character, but cannot learn the specific effect of the position according to the context information of the token. The position encoding of the BERT is instead performed by randomly initializing a dimension [ seq_length, width]Is used to determine the embedded vector of (a). The first dimension represents the sequence length, the second dimension represents the vector length corresponding to each character and is trained with the whole abstract model, so that the method can mark the character position and learn the effect of the position. The present invention encodes the vset vector using the BERT position coding method. For the hierarchical structure vector (s s ,g s ) Can be expressed as formula 8.
Wherein PE represents the position coding method of BERT, d represents the vector dimension of the sentence, and I represents the vector concatenation symbol.
In order to encode (STE) the chapter title information to which a sentence belongs, the present invention encodes it using the same pre-training encoder as document encoding, obtains a hidden state corresponding to each character by inputting the extracted chapter title to the pre-training encoder, and adds each hidden state. This allows better fusion of the semantic information for each location in the chapter title to more fully represent the chapter title information.
5. Model training and reasoning
After the output sentence vector is obtained through the text hierarchical structure information embedding module, the sentence document level hierarchical structure information is learned through the two layers of stacked converters. And finally, inputting the abstract sentence into a sigmoid function to predict whether the sentence belongs to the abstract sentence.
In the training stage, the model uses the binary cross entropy as a loss function, and the training aim is to minimize the binary cross entropy loss function to optimize the model. The detailed formulas 10 to 11 are shown.
Loss={loss 1 ,…,loss n } (10)
Wherein σ in equation 9 represents a sigmoid function, W h Representing a matrix of learnable parameters, b h Representing the bias. Loss in equation 10 i Represents a loss generated when judging whether each sentence belongs to the abstract sentence,probability value representing current sentence prediction, and y i A tag value representing the authenticity of the sentence.
For a better understanding of the specific example implementation, the following derives a specific process in conjunction with the model specific structure.
(1) If an original document D is taken as input, wherein:
D={sent 1 ,sent 2 ,sent 3 ,sent 4 ,sent 5 ,sent 6 ,sent 7 comprises 7 sentences, each of which has a [ BOS ] inserted at the beginning and end of each sentence]And [ EOS ]]A tag as shown in fig. 4. First, the sentences pass through an embedding layer to obtain an embedded representation of each token, and then the embedded representation is input into a Longformer encoder to obtain a context representation of each sentence.
(2) The specific structure of the word vector matrix obtained from the Longformer encoder input to the local topic message extraction module (Topic Segment Representation in fig. 4) is shown in fig. 3. The context representation of each sentence is first input to Bi-LSTM encoding to obtain a hidden vector representation of each sentence. Then starting and ending the hidden vector with each subject segmentThe subtraction therebetween represents the local context information of the subject segment to which the sentence belongs. To prevent the subscript from exceeding the boundary, the present invention adds a 0 vector at the beginning and end of the forward and backward propagation, respectively. For example, for the second topic segment t in FIG. 3 2 Can be expressed as [ f 5 -f 2 ,b 3 -b 6 ]Wherein f 5 And f 2 Representing hidden states of forward propagation of sentence 5 and sentence 2, b 3 And b 6 Representing hidden states of the 3 rd sentence and 6 th sentence back propagation.
(3) And fusing the local topic information of the sentence with the text context representation to obtain the text context representation fused with the local topic information. The text hierarchical structure information is input into a text hierarchical structure information embedding module, and the text hierarchical structure information is embedded into text context representation fused with local subject information by adopting a BERT position coding mode.
(4) After the local topic information extraction module and the text hierarchical structure information embedding module are adopted, sentence document level hierarchical structure information is learned through two layers of stacked convectors, a model is enabled to understand text context structures more deeply, finally, through a sentence classification layer, the layer is composed of a sigmoid function, and the confidence score of each sentence is calculated through the sigmoid layer to judge whether the sentence belongs to a abstract sentence or not.
It should be noted that the embodiments of the present invention can be realized in hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or special purpose design hardware. Those of ordinary skill in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such as provided on a carrier medium such as a magnetic disk, CD or DVD-ROM, a programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The device of the present invention and its modules may be implemented by hardware circuitry, such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., as well as software executed by various types of processors, or by a combination of the above hardware circuitry and software, such as firmware.
The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the invention is not limited thereto, but any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention will be apparent to those skilled in the art within the scope of the present invention.
Claims (10)
1. A method for abstracting a summary in combination with local topic and hierarchical information, comprising:
step one, given an original document d= { send 1 ,...,sent n Sendn represents the nth sentence in the original document;
obtaining context representation of a document through an encoder, and inputting the context representation into a local topic information extraction module to extract topic information of a segment to which the sentence belongs;
fusing the local topic information representation and the context representation of the document to obtain a text context representation fused with the local topic information;
step four, the text hierarchical structure information embedding module embeds the hierarchical structure information of the text into text context representation fused with local theme information, and the text context structure is better known by the model through learning sentence document level hierarchical structure information of two layers of stacked transformers;
and fifthly, calculating the confidence score of each sentence through the Sigmoid layer to judge whether the sentence belongs to the abstract sentence.
2. The method for abstracting a sentence according to claim 1, wherein in the first step, [ BOS ] and [ EOS ] tags are inserted at the beginning and end of each sentence, respectively, and the entire sentence representation is represented using the [ BOS ] tags.
3. The method of claim 1, wherein in the second step, a Longformer pre-training language model is used as a text encoder, and the model embedding layer includes TE: token references, SE: segment Embeddings and PE: position Embeddings.
w i,j =(TE+SE+PE)
Obtaining embedded representation of each word from the above, and performing context learning on the input sequence by using a pre-training Longformer;
{h 1,0 ,h 1,1 ,…,h N,0 ,…,h N,* }=Longformer(w 1,0 ,w 1,1 ,…,w N,0 ,…,w N,* )
wherein w is i,j The j-th word, w, representing the i-th sentence i,0 And w i,* Respectively represent [ BOS ] of ith sentence]And [ EOS ]]Label, h i,j Representing a hidden state of the corresponding word; after Longformer encoding, [ BOS]Tags as contextual representations of each sentence, i.e. H s =(h 1,0 ,…,h N,0 )。
4. The method for abstracting a summary by combining local topic and hierarchical structure information according to claim 1, wherein the specific step of extracting topic information of a segment to which a sentence belongs by the local topic information extraction module in the second step comprises:
obtaining hidden vector representation of each sentence through Bi-LSTM coding of sentence context representation;
using the subtraction between the beginning and ending hidden vectors of each topic segment to represent the local context information of the topic segment to which the sentence belongs, for the ith topic segment t i The specific representation method is as follows:
t i =(f i |b i )
wherein f i And b i The subject fragments, start, representing forward propagation and backward propagation, respectively i And end i Respectively representing a start position and an end position of the subject segment, |represents a vector splice symbol;
the 0 vector is added at the beginning and end of the forward and backward propagation, respectively, to prevent the subscript from exceeding the boundary.
5. The method for abstracting a summary in combination with local topic and hierarchical information as claimed in claim 1, wherein the hierarchical information of the text in the fourth step includes sentence hierarchical information and chapter title information;
the sentence hierarchical structure information comprises a linear position of a paragraph to which the sentence belongs and a linear position representation of the sentence in the paragraph;
the position of the paragraph is represented by acquiring the numerical sequence corresponding to the paragraph and sentence, and D= { send for a given document 1 ,...,sent n I-th sentence send i Is represented as a two-dimensional vector (s s ,g s ) The two-dimensional vector represents the position of the sentence at the level, specifically:
vsent i =(s s ,g s )
wherein s is s Expressed as a linear position, g, containing the sentence fragment relative to the entire article s Expressed as a linear position of the sentence within the paragraph in which it is located;
the chapter title information adopts a PubMed data set preset with a plurality of chapter title categories, and if the chapter title of the chapter where the sentence is located does not belong to one of the preset chapter title categories, the chapter title of the sentence is directly used.
6. The method for abstracting a summary in combination with local topic and hierarchical information according to claim 1, wherein the text hierarchical information embedding module in the fourth step encodes vset vector using BERT position coding method;
for the hierarchical structure vector (s s ,g s ) The expression is as follows:
PE represents a position coding method of BERT, d represents a vector dimension of the sentence, and I represents a vector splicing symbol;
and encoding the chapter title information of the sentence by using a pre-training encoder which is the same as the document encoding, obtaining a hidden state corresponding to each character by inputting the extracted chapter title into the pre-training encoder, and adding each hidden state.
7. The method for abstracting a summary in combination with local topic and hierarchical information according to claim 1, wherein the Sigmoid function adopted by the Sigmoid layer in the fifth step is:
in the training phase, the sigmoid function uses binary cross entropy as a loss function:
Loss={loss 1 ,...,loss n }
wherein σ represents a sigmoid function, W h Representing a matrix of learnable parameters, b h Representing bias, loss i Represents a loss generated when judging whether each sentence belongs to the abstract sentence,probability value, y, representing current sentence prediction i A tag value representing the authenticity of the sentence.
8. A local topic and hierarchy information combined extraction summarization system for implementing the local topic and hierarchy information combined extraction summarization method of any one of claims 1-7, wherein the local topic and hierarchy information combined extraction summarization system comprises:
a document giving module for giving an original document;
an encoding module for obtaining a contextual representation of the document using an encoder based on a pre-trained language model;
the local topic information extraction module is used for extracting topic information of fragments of sentences in the context representation of the document;
the fusion module is used for fusing the local topic information representation and the context representation of the document to obtain a text context representation fused with the local topic information;
the text hierarchical structure information embedding module is used for embedding the hierarchical structure information of the text into the text context representation fused with the local theme information;
and the judging module is used for calculating the confidence score of each sentence through the Sigmoid layer so as to judge whether the sentence belongs to the abstract sentence.
9. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method of abstracting incorporating local topic and hierarchical information as claimed in any one of claims 1 to 7.
10. A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the method of extracting abstract combining local topic and hierarchical structure information as claimed in any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310699985.0A CN116860959A (en) | 2023-06-13 | 2023-06-13 | Extraction type abstract method and system combining local topic and hierarchical structure information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310699985.0A CN116860959A (en) | 2023-06-13 | 2023-06-13 | Extraction type abstract method and system combining local topic and hierarchical structure information |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116860959A true CN116860959A (en) | 2023-10-10 |
Family
ID=88218108
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310699985.0A Pending CN116860959A (en) | 2023-06-13 | 2023-06-13 | Extraction type abstract method and system combining local topic and hierarchical structure information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116860959A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117951291A (en) * | 2024-03-26 | 2024-04-30 | 西南石油大学 | Two-stage local generation type abstract method based on guiding mechanism |
-
2023
- 2023-06-13 CN CN202310699985.0A patent/CN116860959A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117951291A (en) * | 2024-03-26 | 2024-04-30 | 西南石油大学 | Two-stage local generation type abstract method based on guiding mechanism |
CN117951291B (en) * | 2024-03-26 | 2024-05-31 | 西南石油大学 | Two-stage local generation type abstract method based on guiding mechanism |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wang et al. | An overview of image caption generation methods | |
Gupta et al. | Abstractive summarization: An overview of the state of the art | |
Alomari et al. | Deep reinforcement and transfer learning for abstractive text summarization: A review | |
CN110390103B (en) | Automatic short text summarization method and system based on double encoders | |
Jung | Semantic vector learning for natural language understanding | |
Khan et al. | Deep recurrent neural networks with word embeddings for Urdu named entity recognition | |
Xiao et al. | A new attention-based LSTM for image captioning | |
Liu et al. | Uamner: uncertainty-aware multimodal named entity recognition in social media posts | |
Wang et al. | Data set and evaluation of automated construction of financial knowledge graph | |
CN116958997B (en) | Graphic summary method and system based on heterogeneous graphic neural network | |
Luo et al. | A thorough review of models, evaluation metrics, and datasets on image captioning | |
Tarride et al. | A comparative study of information extraction strategies using an attention-based neural network | |
Heo et al. | Multimodal neural machine translation with weakly labeled images | |
CN116187317A (en) | Text generation method, device, equipment and computer readable medium | |
CN116860959A (en) | Extraction type abstract method and system combining local topic and hierarchical structure information | |
Xie et al. | Extractive text-image summarization with relation-enhanced graph attention network | |
Chou et al. | An analysis of BERT (NLP) for assisted subject indexing for Project Gutenberg | |
Yazar et al. | Low-resource neural machine translation: A systematic literature review | |
Xie et al. | ReCoMIF: Reading comprehension based multi-source information fusion network for Chinese spoken language understanding | |
Wang et al. | A study of extractive summarization of long documents incorporating local topic and hierarchical information | |
Chavali et al. | A study on named entity recognition with different word embeddings on gmb dataset using deep learning pipelines | |
Qi et al. | Video captioning via a symmetric bidirectional decoder | |
Wang et al. | RSRNeT: a novel multi-modal network framework for named entity recognition and relation extraction | |
Song et al. | Generative Event Extraction via Internal Knowledge-Enhanced Prompt Learning | |
Tian et al. | Semantic similarity measure of natural language text through machine learning and a keyword‐aware cross‐encoder‐ranking summarizer—A case study using UCGIS GIS &T body of knowledge |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |