CN116049394A - Long text similarity comparison method based on graph neural network - Google Patents
Long text similarity comparison method based on graph neural network Download PDFInfo
- Publication number
- CN116049394A CN116049394A CN202211656521.3A CN202211656521A CN116049394A CN 116049394 A CN116049394 A CN 116049394A CN 202211656521 A CN202211656521 A CN 202211656521A CN 116049394 A CN116049394 A CN 116049394A
- Authority
- CN
- China
- Prior art keywords
- sentence
- source text
- loss function
- text
- constructing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Animal Behavior & Ethology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to the technical field of text abstracts, in particular to a graph neural network-based long text similarity comparison method, which comprises the steps of preprocessing a source text to obtain word embedding of words in each sentence of the source text, inputting the word embedding into a pretrained BERT, inputting the output of the BERT into an expansion-gated convolutional network, polymerizing the output of the BERT and the output of the expansion-gated convolutional network in a residual structure mode to prevent gradient disappearance, respectively constructing a multi-level semantic similarity graph and a multi-level natural relation graph, respectively predicting sentence labels through a graph attention layer, then polymerizing, and predicting sentence labels through an activation function and a full-connection layer, and constructing a abstract loss function; finally, a model loss function is obtained based on the abstract loss function and the topic model loss function, so that the extracted text abstract can more comprehensively retain important information of the source text.
Description
Technical Field
The invention relates to the technical field of text abstracts, in particular to a long text similarity comparison method based on a graph neural network.
Background
In recent years, the number of research and publication papers published from various conferences and journals has been increasing explosively. This large number of academic articles is a source of precious information and knowledge. And along with the rapid iteration of science and technology, beginners or industry new people face a lot of papers, and are difficult to select proper articles for reading, and even if proper scientific papers are found, the general information of the articles is difficult to master rapidly. How to effectively summarize a great deal of scientific papers becomes a focus problem.
In addition, a large number of papers also relate to counterfeiting and plagiarism, the Pubmed database contains up to 3400 ten thousand papers, at least 34 ten thousand papers are counted to be possibly problematic, and part of plagiarism papers can modify the expression of the papers, but the core content is unchanged and still belongs to the plagiarism scope, and the full-text comparison of the situation can not find the plagiarism condition. Therefore, the whole text is summarized by generating the long text abstract, and then the core summaries of the papers are compared, so that the calculation cost is greatly reduced, and the plagiarism situation can be effectively found.
Text summarization provides an effective solution with great success in text summarization by modeling document sentences, especially in news, short text, etc. The generation and extraction are common methods in text summaries, and many studies are currently based on both. However, most of the current text summarization methods summarize short texts, and when long texts such as medical papers are faced, only summary parts are usually processed, because they still have some problems when processing long texts, such as failing to propose more detailed structural features, failing to guarantee sentence consistency, and having grammatical problems. Therefore, if the article is simply summarized from the abstract portion, many more detailed technical information and experimental details are ignored, so that the conventional model method for summarizing short text is not ideal for directly summarizing long text, and cannot fully cover important information conveyed in a given scientific text, so that the similarity of the long text cannot be compared more fully and effectively.
Disclosure of Invention
The invention aims to provide a long text similarity comparison method based on a graph neural network, and aims to solve the technical problems that the existing pre-trained language model cannot capture the context relation of long sequence texts, further cannot comprehensively construct inter-sentence relations and finally cannot comprehensively and effectively compare the similarity of the long texts.
In order to achieve the above purpose, the present invention provides a long text similarity comparison method based on a graph neural network, comprising the following steps:
acquiring a source text and a abstract label corresponding to the source text;
preprocessing the source text, and inputting the source text into a pre-trained BERT to obtain embedded representation of each sentence in the source text;
embedding and inputting each sentence of the source text into an expansion gating convolution network, and aggregating through a residual error structure form to obtain a new embedded representation of each sentence in the source text;
constructing heterograms of each sentence of the source text and words belonging to the sentence, constructing inter-sentence graphs through semantic similarity, and iterating through a graph attention layer;
constructing an abnormal composition of each sentence of the source text and related topics, constructing a natural relationship diagram among sentences through a topic model, and iterating through a diagram attention layer;
aggregating two different characteristic representations of each sentence in the source text, inputting the two different characteristic representations into an activation function full-connection layer, calculating a probability value of each sentence, obtaining a prediction abstract, and constructing an abstract loss function and a theme loss function;
calculating a model loss function based on the abstract loss function and the theme loss function;
and adjusting the weight of the whole neural network based on the model loss function to obtain a text abstract model.
The process for preprocessing the source text comprises the following steps:
inserting all sentences in the source text into a start symbol [ CLS ] and an end symbol [ SEP ], and representing each word in the sentences in the form of word vectors;
each input sequence is input into the pre-trained BERT in a length of no more than 512, and sentence features are represented by the first vector feature of each sentence.
Embedding each sentence of the source text into a process of inputting the sentences into an expansion gating convolution network, namely inputting the characteristic representation of each sentence in the source text into the gating convolution according to sequence, so as to reduce the risk of gradient disappearance;
and (3) inputting the output of the gate convolution into an expansion convolution network to learn longer context characteristics, and finally, aggregating the characteristics of each sentence with the characteristics of the expansion convolution network in a residual structure form to obtain new sentence characteristic representation.
Wherein, the process of constructing the heterogram by each sentence of the source text and the related subject includes the following steps:
constructing heterograms of each sentence of the source text connected with words belonging to the sentence, wherein conjunctions, exclamation words and nonsensical words are removed;
each sentence in the source text is constructed into a diagram among sentences through BM25 semantic similarity;
and inputting the multi-level heterogeneous semantic similarity representation of each sentence into a graph-annotation semantic network layer for iterative calculation to obtain a new representation.
Wherein the process of constructing a heterogram of each sentence of the source text with words belonging to the sentence comprises the steps of:
inputting the source text into a neural topic model to obtain a set number of topics, and constructing different patterns of sentences in the source text and related topics;
constructing a natural relation diagram among sentences in a form conforming to the similar subjects for each sentence of the source text;
and inputting the multi-level natural relation representation of each sentence into the graph annotation meaning network layer to perform iterative calculation to obtain a new representation.
The method comprises the following steps of aggregating two different characteristic representations of each sentence in the source text, inputting the two different characteristic representations into an activation function full-connection layer, calculating a probability value of each sentence, obtaining a predicted abstract, and constructing an abstract loss and theme loss function, wherein the method comprises the following steps:
selecting sentences with the maximum probability value as prediction abstract sentences to obtain a prediction abstract;
constructing a cross entropy loss function based on the source text prediction digest and the tag digest;
and constructing a KL divergence topic loss function based on the word topics and the document topics of the source text.
The abstract loss function specifically comprises the following steps:
The topic model loss function specifically comprises the following steps:
l NTM =D KL (p(z)||q(z|x))-E q(z|x) (p (x|z)), p (z) is the probability of the topic z, q (z|x) is the probability of the topic z under the word x, and p (x|z) is the dominant oneProbability of word x under question z.
The model loss function specifically comprises the following steps: l=αl 1 +(1-α)l NTM Wherein l 1 Cross entropy loss function for abstract loss, l NTM For the topic model loss function, α represents a weight value that adjusts the two loss functions.
The invention provides a long text similarity comparison method based on a graph neural network, which comprises the steps of preprocessing a source text to obtain word embedding of words in each sentence of the source text, inputting the word embedding into a pre-trained BERT, inputting the output after the BERT into an expansion gating convolutional network, aggregating the output after the BERT and the output after the expansion gating convolutional network in a residual structure mode to prevent gradient disappearance, respectively constructing a multi-level semantic similarity graph and a multi-level natural relation graph, respectively passing through a graph attention layer, then aggregating, predicting sentence labels through an activation function and a full connection layer, and constructing a summary loss function; finally, a model loss function is obtained based on the abstract loss function and the topic model loss function, so that the extracted text abstract can more comprehensively retain important information of the source text.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a long text similarity comparison method based on a graph neural network.
Fig. 2 is a schematic diagram of the overall structure of the text summarization model of the present invention.
FIG. 3 is a schematic diagram of the pre-trained BERT model structure of the present invention.
FIG. 4 is a schematic diagram of the structure of the expansion-gated convolution model of the present invention.
Fig. 5 is a diagram of semantic similarity among multiple layers of sentences according to the present invention.
Fig. 6 is a diagram of natural relationships between multiple layers of sentences according to the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
Referring to fig. 1, the invention provides a long text similarity comparison method based on a graph neural network, which comprises the following steps:
s1: acquiring a source text and a abstract label corresponding to the source text;
s2: preprocessing the source text, and inputting the source text into a pre-trained BERT to obtain embedded representation of each sentence in the source text;
s3: embedding and inputting each sentence of the source text into an expansion gating convolution network, and aggregating through a residual error structure form to obtain a new embedded representation of each sentence in the source text;
s4: constructing heterograms of each sentence of the source text and words belonging to the sentence, constructing inter-sentence graphs through semantic similarity, and iterating through a graph attention layer;
s5: constructing an abnormal composition of each sentence of the source text and related topics, constructing a natural relationship diagram among sentences through a topic model, and iterating through a diagram attention layer;
s6: aggregating two different characteristic representations of each sentence in the source text, inputting the two different characteristic representations into an activation function full-connection layer, calculating a probability value of each sentence, obtaining a prediction abstract, and constructing an abstract loss function and a theme loss function;
s7: calculating a model loss function based on the abstract loss function and the theme loss function;
s8: and adjusting the weight of the whole neural network based on the model loss function to obtain a text abstract model.
The invention is further described in connection with the following specific implementation steps:
s1: acquiring a source text and a abstract label corresponding to the source text;
the source text refers to data of a text abstract model obtained through training. The abstract label refers to a preset label for identifying the abstract.
S2: preprocessing the source text, and inputting the source text into a pre-trained BERT to obtain embedded representation of each sentence in the source text;
wherein the preprocessing is to insert a start symbol [ CLS ] and an end symbol [ SEP ] in each sentence in the source text and to correspond each word in the sentence in the form of a word vector, including the start coincidence and end symbols. The corresponding sentence is represented by the start symbol [ CLS ] of each sentence through the BERT output.
S3: and embedding and inputting each sentence of the source text into an expansion-gated convolution network, and aggregating through a residual structure form to obtain a new embedded representation of each sentence in the source text.
The expansion gating convolution network is a combined network of the expansion convolution network and the gating convolution network, because of the limitation of the input sequence length of the pre-training language model, the expansion gating convolution network can capture the contextual information which is truncated into irrelevant context information due to the fact that the length exceeds 512, the limitation of the fixed length of the input sequence can be effectively relieved, the performance can be improved by using the gating convolution network, and gradient disappearance can be prevented by using a residual structure form.
S4: and constructing a heterogram of each sentence of the source text and words belonging to the sentence, constructing an inter-sentence diagram through semantic similarity, and iterating through a diagram attention layer.
Each sentence and the words forming the sentence form an heterogram, the heterogram does not comprise the conjunctions, the exclamation words and other words without practical meaning, the semantic similarity graph between the sentences is built through the BM25 after the heterogram is built, the meaning force between the sentences and the words is calculated first and iterated for a plurality of times, wherein the meaning force iteration between the sentences and the words means that the meaning force between the words and the meaning force between the sentences are iterated each other and iterated for a plurality of times. And then, calculating the meaning force of the drawing between sentences, and carrying out iteration among sentences and iteration for a plurality of times so as to capture the similarity relation among the sentences.
S5: and constructing an heterogram of each sentence of the source text and the related subject, constructing a natural relation diagram among sentences through a subject model, and iterating through a diagram attention layer.
The construction process of the theme model is as follows: the method comprises the steps of obtaining a mean value and a covariance through two different fully connected networks in a word bag mode, constructing Gaussian distribution through the word bag, obtaining a topic probability through an activation function, and extracting a certain number of topics; then constructing heterogeneous graphs between sentences and topics through conditional probability, and constructing natural relationship graphs between sentences through the conditional probability of the topics among the sentences, so that synonym relationships can be effectively captured; next, the meaning force of the drawing between the sentence and the subject is calculated and iterated several times, wherein the meaning force of the drawing between the sentence and the subject refers to the meaning force of the drawing from the subject to the sentence and the meaning force of the drawing from the sentence to the subject iterated several times. Then, the natural relation among sentences is captured by calculating the meaning force of the drawing among sentences.
S6: aggregating two different characteristic representations of each sentence in the source text, inputting the two different characteristic representations into an activation function full-connection layer, calculating a probability value of each sentence, obtaining a predicted abstract, and constructing an abstract loss function and a theme loss function;
wherein calculating the probability value is calculating a probability of selecting a sentence as a summary.
Specifically, the summary loss function is:
yi represents sentence tag, < >>Representing predicted sentence tags by computing summary tags (i.e. theoretical abstractTo be) and predictive abstracts (i.e., abstracts that are actually output by a fully connected neural network)
A loss function is constructed.
Specifically, the topic loss function is:
l NTM =D KL (p(z)||q(z|x))-E q (z|x) (p (x|z)), p (z) is the probability of topic z, q (z|x) is the probability of topic z under word x, representing the encoder network of the topic model, and p (x|z) is the probability of word x under topic z, representing the decoder network of the topic model.
S7: and calculating a model loss function based on the abstract loss and the theme loss function.
The model loss function is specifically: l=αl 1 +(1-α)I NTM Wherein l 1 Cross entropy loss function for summary loss, I NTM For the topic model loss function, α is the weight parameter that adjusts the two loss functions.
S8: and adjusting the weight of the whole neural network based on the model loss function to obtain a text abstract model.
Wherein figure 2 is the overall structure of the model.
Further, as shown in fig. 3, the source text is preprocessed and input into a pretrained BERT to obtain an embedded representation of each sentence in the source text, which specifically includes the following steps:
in the source text, respectively inserting a start symbol [ CLS ] and an end symbol [ SEP ] into the beginning and the end of each sentence, and then representing each word in the sentence by using a pre-trained word vector, wherein the word vector comprises the start symbol and the end symbol;
all word vectors are input into the BERT, specifically:
{h 1,0 ,h 1,1 ,...,h N,0 ,...,h N,* )=BERT(w 1,0 ,w 1,1 ,...,w N,0 ,...,w N,* )
wherein w is i,j The jth word representing the ith sentence. w (w) i,0 And w i,* Start symbols [ CLS ] respectively representing ith sentence]And end symbol [ SEP ]],,h i,j Represents the hidden state of the corresponding symbol, h i,0 Representing a representation of the ith sentence.
Further, as shown in fig. 4, each sentence of the source text is embedded and input into the expansion-gated convolutional network, and the specific steps are as follows:
firstly, using a residual structure through a gating convolutional neural network, and adopting a specific formula:
where σ is a sigmoid activation function and H is the sentence representation h= { H of the BERT output 1,0 ,..,h i,0 ,...,h n,0 The Conv1D1 and Conv1D2 convolution layers use the same window and convolution kernel numbers, where the two convolution layer weights are not shared.
The output Y after the gated convolution is input to the dilation convolution, where the dilation ratio of each layer is set to 1,2,4, respectively. Window settings of 3 all result in a new feature H '= { H' 1,0 ,..h′ i,0 ,...,h′ n,0 }。
Further, as shown in fig. 5, each sentence of the source text and the word belonging to the sentence are constructed into a heterogram, and an inter-sentence diagram is constructed through semantic similarity, and iterated through a diagram attention layer, specifically comprising the following steps:
wherein, the heterogram of the sentence and the word belonging to the sentence does not include words with no practical meaning such as stop words, conjunctions and the like.
Semantic similarity between sentences is constructed by classical algorithm BM 25. Firstly, carrying out sentence-to-word meaning calculation, and carrying out iteration for a certain number of times, wherein the iteration refers to the meaning calculation from word to sentence and the meaning calculation from sentence to word, and then calculating the meaning of sentence to sentence.
The word vectors use trained vectors, all vocabulary amounts are contained in glove.42B.300d.txt, nonsensical vocabulary are filtered, and sentence vectors are output from each sentence starting symbol through BERT and DGCNN;
wherein the word-to-sentence graph attention calculation formula:
z ij =LeakyReLU(W a [W q h i ;W k w j ;e ij ])
wherein the sentence-to-word legend meaning force calculation formula:
z ji =LeakyReLU(W a [W v w j ;W m h i ;e ij ])
wherein h is i And w j Respectively, sentence and word vector, e ij Representing the connection between the two, W a ,W q ,W k ,W v ,W m As a trainable parameter, leakyReLU is an activation function, z ij Calculating word-to-sentence meaning force, z ji The sentence-to-word mindset force is calculated.
α ij Is word w j To sentence h i WhereinRepresentation and sentence vector h i A collection of connected words, alpha ji Is sentence h i To word w j Weight of->Representation and word vector w j A collection of connected sentences.
The invention adopts multi-head attention, wherein the quantity is K, ||represents aggregation, sigma is an activation function, and W 1 k ,W 2 k For trainable parameters j.epsilon.N i Representation and sentence h i A set of connected word vectors, i e N j And word w j A collection of connected sentences. After multi-head attention calculation, respectively new sentence matrix S and word matrix Z.
To prevent the gradient from disappearing, a residual structure was added:
h′=u i +h i
wherein the first iteration of words to sentences is:
wherein, at the first iteration of words to sentences,an embedding matrix S for sentences, matrix +.>Representing the query matrix, the key matrix and the value matrix are embedded by words +.>And (3) representing.
T+1st iteration of word to sentence:
to prevent the gradient from disappearing, a residual structure was added:
t+1st word iteration from sentence to word
To prevent the gradient from disappearing, a residual structure was added:
new feature representations of sentences are obtained, and graph attention calculations are performed between sentences:
wherein Y is a sentence set connected with the sentence s through semantic similarity
To prevent the gradient from disappearing, a residual structure was added:
wherein, the liquid crystal display device comprises a liquid crystal display device,for the new features of the corresponding sentence, for distinguishing between the new features of the natural sentence +.>Representing the new features of the last iteration.
Further, as shown in fig. 6, each sentence of the source text is structured into an iso-composition with a related topic, and an inter-sentence graph is structured through conditional topic probability, and iterated through a graph attention layer, specifically comprising the following steps:
wherein, the topic model is constructed by taking the source text as the source textInput and get the bag of words X bow The word bag is taken as input to pass through two different linear full-connection layers with activation functions to obtain a mean value mu and a standard deviation sigma, and the method is as follows:
μ=Relu(F 1 (X bow ))
σ=Relu(F 2 (X bow ))
wherein Relu is the activation function F 1 ,F 2 Is a full connection layer.
Construction of the subject gaussian distribution z-N (μ, σ) by means of mean and variance 2 ) Then the subject distribution θ=softmax (z) is obtained,
wherein->Is a topic-word distribution matrix, where +.>Representing the relevance of word i and topic j, by P w The relationship between the topic and the word is constructed.
In particular to obtain a certain number of themes H T Associated weights Tc d :
Wherein the method comprises the steps ofIs a linear transfer function with a Relu activation function, +>Is->Transpose of θ (i) Probability of the ith topic, H T (i) Is the i-th subject.
Firstly, connecting sentences with related topics, then, performing drawing meaning calculation between the sentences and the topics, and iterating for a certain number of times, wherein the iteration refers to the drawing meaning calculation from the topics to the sentences and the drawing meaning calculation from the sentences to the topics, so as to obtain new embedded representations U and T of the sentences and the topics, and further, performing drawing attention calculation from the sentences to the sentences.
Wherein the graph attention calculation formula between sentence and topic:
z ij =LeakyReLU(W a [W q h i ;W k t j ;e ij ])
wherein the graph attention calculation formula between the topic and the sentence:
z ji =LeakyReLU(W a [W v t j ;W m h i ;e ij ])
wherein h is i And t j Representing sentences and topic vectors, e ij Is a connection between, W a ,W q ,W k ,W v ,W m As a trainable parameter, leakyReLU is an activation function, z ij Calculating the meaning force, z, of a drawing between a sentence and a subject ji The mindset force between the subject and the sentence is calculated.
α ij Is h i To t j WhereinRepresentation and sentence vector h i A set of connected topics, alpha ji Is t j To h i Weight of->Is related to the topic vector t j And (5) a connected sentence set.
The invention adopts multi-head attention, wherein the quantity is K, ||represents aggregation, sigma is an activation function, and W k For trainable parameters j.epsilon.N i Is the sentence vector h i A set of connected topic vectors. N (N) j Is the topic vector t j A collection of connected sentence vectors.
Meanwhile, in order to prevent gradient from disappearing, a residual structure method is also adopted:
h′=u i +h i
wherein the primary iteration of the topic into sentences is:
wherein, in the first iteration,an embedding matrix U for sentences, representing a query matrix, < >>The key matrix and the value matrix are represented as an embedding matrix T of subjects.
T+1st iteration of the topic into sentences:
to prevent the gradient from disappearing, a residual structure was added:
t+1st word iteration of sentence to topic
To prevent the gradient from disappearing, a residual structure was added:
new feature representations of sentences are obtained, sentence-to-sentence graph attention calculations are performed between sentences:
wherein N is a set of sentences connected with the sentence s through the subject, and a specific calculation formula of edge connection between two sentences is as follows:natural relationships between two sentences are established by the value of KL divergence.
To prevent the gradient from disappearing, a residual structure was added:
wherein, the liquid crystal display device comprises a liquid crystal display device,for the new features of the corresponding sentence +.>Features representing the last iteration->
Further, two different feature representations of each sentence in the source text are aggregated and input to an activation function full-connection layer, the probability value of each sentence is calculated, a prediction abstract is obtained, and an abstract loss function and a theme loss function are constructed;
wherein the new featureAnd according to P sent =sigmoid(f 3 (H) Selecting a certain number of sentences as text abstracts according to the probability;
f3 is a linear full-connection layer, sigmoid is an activation function, P sent Is the probability value of the corresponding sentence send.
Wherein, the abstract loss function is:
y i representing sentence tag->Representing predicted sentence tags by computing the similarity between the digest tag (i.e., theoretical digest) and the predicted digest (i.e., the digest actually output by the fully-connected neural network)
To construct a loss function.
Wherein, the theme loss function is:
l NTM =D KL (p(z)||q(z|x))-E q(Z|X) (p(x|z))
where p (z) is the probability of the topic z, q (z|x) represents the probability value of the topic under the word x, and p (x|z) represents the probability value of the word under the topic z.
Further, calculating a model loss function based on the abstract loss and the theme loss function;
the model loss function is specifically: l=αl 1 +(1-α)l NTM Wherein l 1 Cross entropy loss function for abstract loss, l NTM For the topic model loss function, α is the weight parameter that adjusts the two loss functions.
Further, the weight of the whole neural network is adjusted based on the model loss function, and a long text abstract model is obtained.
The above disclosure is only a preferred embodiment of the present invention, and it should be understood that the scope of the invention is not limited thereto, and those skilled in the art will appreciate that all or part of the procedures described above can be performed according to the equivalent changes of the claims, and still fall within the scope of the present invention.
Claims (9)
1. The long text similarity comparison method based on the graph neural network is characterized by comprising the following steps of:
acquiring a source text and a abstract label corresponding to the source text;
preprocessing the source text, and inputting the source text into a pre-trained BERT to obtain embedded representation of each sentence in the source text;
embedding and inputting each sentence of the source text into an expansion gating convolution network, and aggregating through a residual error structure form to obtain a new embedded representation of each sentence in the source text;
constructing heterograms of each sentence of the source text and words belonging to the sentence, constructing inter-sentence graphs through semantic similarity, and iterating through a graph attention layer;
constructing an abnormal composition of each sentence of the source text and related topics, constructing a natural relationship diagram among sentences through a topic model, and iterating through a diagram attention layer;
aggregating two different characteristic representations of each sentence in the source text, inputting the two different characteristic representations into an activation function full-connection layer, calculating a probability value of each sentence, obtaining a prediction abstract, and constructing an abstract loss function and a theme loss function;
calculating a model loss function based on the abstract loss function and the theme loss function;
and adjusting the weight of the whole neural network based on the model loss function to obtain a text abstract model.
2. The long text similarity comparison method based on a graph neural network according to claim 1,
the process of preprocessing the source text comprises the following steps:
inserting all sentences in the source text into a start symbol [ CLS ] and an end symbol [ SEP ], and representing each word in the sentences in the form of word vectors;
each input sequence is input into the pre-trained BERT in a length of no more than 512, and sentence features are represented by the first vector feature of each sentence.
3. The long text similarity comparison method based on a graph neural network according to claim 1,
embedding each sentence of the source text into an expansion gating convolution network, specifically, inputting the characteristic representation of each sentence in the source text into the gating convolution according to the sequence, so as to reduce the risk of gradient disappearance;
and (3) inputting the output of the gate convolution into an expansion convolution network to learn longer context characteristics, and finally, aggregating the characteristics of each sentence with the characteristics of the expansion convolution network in a residual structure form to obtain new sentence characteristic representation.
4. The long text similarity comparison method based on a graph neural network according to claim 1,
a process of constructing an heterogram of each sentence of the source text with a related topic, comprising the steps of:
constructing heterograms of each sentence of the source text connected with words belonging to the sentence, wherein conjunctions, exclamation words and nonsensical words are removed;
each sentence in the source text is constructed into a diagram among sentences through BM25 semantic similarity;
and inputting the multi-level heterogeneous semantic similarity representation of each sentence into a graph-annotation semantic network layer for iterative calculation to obtain a new representation.
5. The long text similarity comparison method based on a graph neural network according to claim 1,
a process of constructing a heterogram for each sentence of the source text with words belonging to the sentence, comprising the steps of:
inputting the source text into a neural topic model to obtain a set number of topics, and constructing different patterns of sentences in the source text and related topics;
constructing a natural relation diagram among sentences in a form conforming to the similar subjects for each sentence of the source text;
and inputting the multi-level natural relation representation of each sentence into the graph annotation meaning network layer to perform iterative calculation to obtain a new representation.
6. The long text similarity comparison method based on a graph neural network according to claim 1,
the process of aggregating two different characteristic representations of each sentence in the source text, inputting the two different characteristic representations into an activation function full-connection layer, calculating the probability value of each sentence, obtaining a predicted abstract, and constructing an abstract loss and theme loss function comprises the following steps:
selecting sentences with the maximum probability value as prediction abstract sentences to obtain a prediction abstract;
constructing a cross entropy loss function based on the source text prediction digest and the tag digest;
and constructing a KL divergence topic loss function based on the word topics and the document topics of the source text.
8. The method for comparing long text similarity based on the graphic neural network of claim 7,
the topic model loss function is specifically:
l NTM =D KL (p(z)||q(z|x))-E q(z | x) (p (x|z)), p (z) is the probability of the topic z, q (z|x) is the probability of the topic z under the word x, and p (x|z) is the probability of the word x under the topic z.
9. The long text similarity comparison method based on the graphic neural network of claim 8,
the model loss function is specifically: l=αl 1 +(1-α)l NTM Wherein l 1 Cross entropy loss function for abstract loss, l NTM For the topic model loss function, α represents a weight value that adjusts the two loss functions.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211656521.3A CN116049394A (en) | 2022-12-22 | 2022-12-22 | Long text similarity comparison method based on graph neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211656521.3A CN116049394A (en) | 2022-12-22 | 2022-12-22 | Long text similarity comparison method based on graph neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116049394A true CN116049394A (en) | 2023-05-02 |
Family
ID=86126588
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211656521.3A Pending CN116049394A (en) | 2022-12-22 | 2022-12-22 | Long text similarity comparison method based on graph neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116049394A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116050397A (en) * | 2023-03-07 | 2023-05-02 | 知呱呱(天津)大数据技术有限公司 | Method, system, equipment and storage medium for generating long text abstract |
CN117875268A (en) * | 2024-03-13 | 2024-04-12 | 山东科技大学 | Extraction type text abstract generation method based on clause coding |
-
2022
- 2022-12-22 CN CN202211656521.3A patent/CN116049394A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116050397A (en) * | 2023-03-07 | 2023-05-02 | 知呱呱(天津)大数据技术有限公司 | Method, system, equipment and storage medium for generating long text abstract |
CN117875268A (en) * | 2024-03-13 | 2024-04-12 | 山东科技大学 | Extraction type text abstract generation method based on clause coding |
CN117875268B (en) * | 2024-03-13 | 2024-05-31 | 山东科技大学 | Extraction type text abstract generation method based on clause coding |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110413986B (en) | Text clustering multi-document automatic summarization method and system for improving word vector model | |
CN111966917B (en) | Event detection and summarization method based on pre-training language model | |
Qiu et al. | Chinese clinical named entity recognition using residual dilated convolutional neural network with conditional random field | |
Chen et al. | Knowledge-enhanced neural networks for sentiment analysis of Chinese reviews | |
CN116049394A (en) | Long text similarity comparison method based on graph neural network | |
Jin et al. | Inter-sentence and implicit causality extraction from chinese corpus | |
CN112232053A (en) | Text similarity calculation system, method and storage medium based on multi-keyword pair matching | |
Li et al. | Integrating language model and reading control gate in BLSTM-CRF for biomedical named entity recognition | |
Shen et al. | Hashtag Recommendation Using LSTM Networks with Self-Attention. | |
Lin et al. | Multi-label emotion classification based on adversarial multi-task learning | |
Wang et al. | Adversarial learning for multi-task sequence labeling with attention mechanism | |
CN114265936A (en) | Method for realizing text mining of science and technology project | |
Zhang et al. | A named entity recognition method towards product reviews based on BiLSTM-attention-CRF | |
CN112507717A (en) | Medical field entity classification method fusing entity keyword features | |
Emami et al. | Designing a deep neural network model for finding semantic similarity between short persian texts using a parallel corpus | |
Wang et al. | A hybrid model based on deep convolutional network for medical named entity recognition | |
Jiang et al. | A hierarchical bidirectional LSTM sequence model for extractive text summarization in electric power systems | |
Yu et al. | Multi-module Fusion Relevance Attention Network for Multi-label Text Classification. | |
CN116263786A (en) | Public opinion text emotion analysis method, device, computer equipment and medium | |
CN112270185A (en) | Text representation method based on topic model | |
CN112347784A (en) | Cross-document entity identification method combined with multi-task learning | |
Wu et al. | A Text Emotion Analysis Method Using the Dual‐Channel Convolution Neural Network in Social Networks | |
Xiao et al. | Social emotion cause extraction from online texts | |
Tu et al. | Learning regular expressions for interpretable medical text classification using a pool-based simulated annealing approach | |
CN117807999B (en) | Domain self-adaptive named entity recognition method based on countermeasure learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |