CN116049394A - Long text similarity comparison method based on graph neural network - Google Patents

Long text similarity comparison method based on graph neural network Download PDF

Info

Publication number
CN116049394A
CN116049394A CN202211656521.3A CN202211656521A CN116049394A CN 116049394 A CN116049394 A CN 116049394A CN 202211656521 A CN202211656521 A CN 202211656521A CN 116049394 A CN116049394 A CN 116049394A
Authority
CN
China
Prior art keywords
sentence
source text
loss function
text
constructing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211656521.3A
Other languages
Chinese (zh)
Inventor
王利娥
常恒通
李先贤
曾华昌
韦容文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Normal University
Original Assignee
Guangxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi Normal University filed Critical Guangxi Normal University
Priority to CN202211656521.3A priority Critical patent/CN116049394A/en
Publication of CN116049394A publication Critical patent/CN116049394A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of text abstracts, in particular to a graph neural network-based long text similarity comparison method, which comprises the steps of preprocessing a source text to obtain word embedding of words in each sentence of the source text, inputting the word embedding into a pretrained BERT, inputting the output of the BERT into an expansion-gated convolutional network, polymerizing the output of the BERT and the output of the expansion-gated convolutional network in a residual structure mode to prevent gradient disappearance, respectively constructing a multi-level semantic similarity graph and a multi-level natural relation graph, respectively predicting sentence labels through a graph attention layer, then polymerizing, and predicting sentence labels through an activation function and a full-connection layer, and constructing a abstract loss function; finally, a model loss function is obtained based on the abstract loss function and the topic model loss function, so that the extracted text abstract can more comprehensively retain important information of the source text.

Description

Long text similarity comparison method based on graph neural network
Technical Field
The invention relates to the technical field of text abstracts, in particular to a long text similarity comparison method based on a graph neural network.
Background
In recent years, the number of research and publication papers published from various conferences and journals has been increasing explosively. This large number of academic articles is a source of precious information and knowledge. And along with the rapid iteration of science and technology, beginners or industry new people face a lot of papers, and are difficult to select proper articles for reading, and even if proper scientific papers are found, the general information of the articles is difficult to master rapidly. How to effectively summarize a great deal of scientific papers becomes a focus problem.
In addition, a large number of papers also relate to counterfeiting and plagiarism, the Pubmed database contains up to 3400 ten thousand papers, at least 34 ten thousand papers are counted to be possibly problematic, and part of plagiarism papers can modify the expression of the papers, but the core content is unchanged and still belongs to the plagiarism scope, and the full-text comparison of the situation can not find the plagiarism condition. Therefore, the whole text is summarized by generating the long text abstract, and then the core summaries of the papers are compared, so that the calculation cost is greatly reduced, and the plagiarism situation can be effectively found.
Text summarization provides an effective solution with great success in text summarization by modeling document sentences, especially in news, short text, etc. The generation and extraction are common methods in text summaries, and many studies are currently based on both. However, most of the current text summarization methods summarize short texts, and when long texts such as medical papers are faced, only summary parts are usually processed, because they still have some problems when processing long texts, such as failing to propose more detailed structural features, failing to guarantee sentence consistency, and having grammatical problems. Therefore, if the article is simply summarized from the abstract portion, many more detailed technical information and experimental details are ignored, so that the conventional model method for summarizing short text is not ideal for directly summarizing long text, and cannot fully cover important information conveyed in a given scientific text, so that the similarity of the long text cannot be compared more fully and effectively.
Disclosure of Invention
The invention aims to provide a long text similarity comparison method based on a graph neural network, and aims to solve the technical problems that the existing pre-trained language model cannot capture the context relation of long sequence texts, further cannot comprehensively construct inter-sentence relations and finally cannot comprehensively and effectively compare the similarity of the long texts.
In order to achieve the above purpose, the present invention provides a long text similarity comparison method based on a graph neural network, comprising the following steps:
acquiring a source text and a abstract label corresponding to the source text;
preprocessing the source text, and inputting the source text into a pre-trained BERT to obtain embedded representation of each sentence in the source text;
embedding and inputting each sentence of the source text into an expansion gating convolution network, and aggregating through a residual error structure form to obtain a new embedded representation of each sentence in the source text;
constructing heterograms of each sentence of the source text and words belonging to the sentence, constructing inter-sentence graphs through semantic similarity, and iterating through a graph attention layer;
constructing an abnormal composition of each sentence of the source text and related topics, constructing a natural relationship diagram among sentences through a topic model, and iterating through a diagram attention layer;
aggregating two different characteristic representations of each sentence in the source text, inputting the two different characteristic representations into an activation function full-connection layer, calculating a probability value of each sentence, obtaining a prediction abstract, and constructing an abstract loss function and a theme loss function;
calculating a model loss function based on the abstract loss function and the theme loss function;
and adjusting the weight of the whole neural network based on the model loss function to obtain a text abstract model.
The process for preprocessing the source text comprises the following steps:
inserting all sentences in the source text into a start symbol [ CLS ] and an end symbol [ SEP ], and representing each word in the sentences in the form of word vectors;
each input sequence is input into the pre-trained BERT in a length of no more than 512, and sentence features are represented by the first vector feature of each sentence.
Embedding each sentence of the source text into a process of inputting the sentences into an expansion gating convolution network, namely inputting the characteristic representation of each sentence in the source text into the gating convolution according to sequence, so as to reduce the risk of gradient disappearance;
and (3) inputting the output of the gate convolution into an expansion convolution network to learn longer context characteristics, and finally, aggregating the characteristics of each sentence with the characteristics of the expansion convolution network in a residual structure form to obtain new sentence characteristic representation.
Wherein, the process of constructing the heterogram by each sentence of the source text and the related subject includes the following steps:
constructing heterograms of each sentence of the source text connected with words belonging to the sentence, wherein conjunctions, exclamation words and nonsensical words are removed;
each sentence in the source text is constructed into a diagram among sentences through BM25 semantic similarity;
and inputting the multi-level heterogeneous semantic similarity representation of each sentence into a graph-annotation semantic network layer for iterative calculation to obtain a new representation.
Wherein the process of constructing a heterogram of each sentence of the source text with words belonging to the sentence comprises the steps of:
inputting the source text into a neural topic model to obtain a set number of topics, and constructing different patterns of sentences in the source text and related topics;
constructing a natural relation diagram among sentences in a form conforming to the similar subjects for each sentence of the source text;
and inputting the multi-level natural relation representation of each sentence into the graph annotation meaning network layer to perform iterative calculation to obtain a new representation.
The method comprises the following steps of aggregating two different characteristic representations of each sentence in the source text, inputting the two different characteristic representations into an activation function full-connection layer, calculating a probability value of each sentence, obtaining a predicted abstract, and constructing an abstract loss and theme loss function, wherein the method comprises the following steps:
selecting sentences with the maximum probability value as prediction abstract sentences to obtain a prediction abstract;
constructing a cross entropy loss function based on the source text prediction digest and the tag digest;
and constructing a KL divergence topic loss function based on the word topics and the document topics of the source text.
The abstract loss function specifically comprises the following steps:
Figure BDA0004013027240000031
y i representing sentence tag->
Figure BDA0004013027240000032
Representing the predicted sentence tag.
The topic model loss function specifically comprises the following steps:
l NTM =D KL (p(z)||q(z|x))-E q(z|x) (p (x|z)), p (z) is the probability of the topic z, q (z|x) is the probability of the topic z under the word x, and p (x|z) is the dominant oneProbability of word x under question z.
The model loss function specifically comprises the following steps: l=αl 1 +(1-α)l NTM Wherein l 1 Cross entropy loss function for abstract loss, l NTM For the topic model loss function, α represents a weight value that adjusts the two loss functions.
The invention provides a long text similarity comparison method based on a graph neural network, which comprises the steps of preprocessing a source text to obtain word embedding of words in each sentence of the source text, inputting the word embedding into a pre-trained BERT, inputting the output after the BERT into an expansion gating convolutional network, aggregating the output after the BERT and the output after the expansion gating convolutional network in a residual structure mode to prevent gradient disappearance, respectively constructing a multi-level semantic similarity graph and a multi-level natural relation graph, respectively passing through a graph attention layer, then aggregating, predicting sentence labels through an activation function and a full connection layer, and constructing a summary loss function; finally, a model loss function is obtained based on the abstract loss function and the topic model loss function, so that the extracted text abstract can more comprehensively retain important information of the source text.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a long text similarity comparison method based on a graph neural network.
Fig. 2 is a schematic diagram of the overall structure of the text summarization model of the present invention.
FIG. 3 is a schematic diagram of the pre-trained BERT model structure of the present invention.
FIG. 4 is a schematic diagram of the structure of the expansion-gated convolution model of the present invention.
Fig. 5 is a diagram of semantic similarity among multiple layers of sentences according to the present invention.
Fig. 6 is a diagram of natural relationships between multiple layers of sentences according to the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
Referring to fig. 1, the invention provides a long text similarity comparison method based on a graph neural network, which comprises the following steps:
s1: acquiring a source text and a abstract label corresponding to the source text;
s2: preprocessing the source text, and inputting the source text into a pre-trained BERT to obtain embedded representation of each sentence in the source text;
s3: embedding and inputting each sentence of the source text into an expansion gating convolution network, and aggregating through a residual error structure form to obtain a new embedded representation of each sentence in the source text;
s4: constructing heterograms of each sentence of the source text and words belonging to the sentence, constructing inter-sentence graphs through semantic similarity, and iterating through a graph attention layer;
s5: constructing an abnormal composition of each sentence of the source text and related topics, constructing a natural relationship diagram among sentences through a topic model, and iterating through a diagram attention layer;
s6: aggregating two different characteristic representations of each sentence in the source text, inputting the two different characteristic representations into an activation function full-connection layer, calculating a probability value of each sentence, obtaining a prediction abstract, and constructing an abstract loss function and a theme loss function;
s7: calculating a model loss function based on the abstract loss function and the theme loss function;
s8: and adjusting the weight of the whole neural network based on the model loss function to obtain a text abstract model.
The invention is further described in connection with the following specific implementation steps:
s1: acquiring a source text and a abstract label corresponding to the source text;
the source text refers to data of a text abstract model obtained through training. The abstract label refers to a preset label for identifying the abstract.
S2: preprocessing the source text, and inputting the source text into a pre-trained BERT to obtain embedded representation of each sentence in the source text;
wherein the preprocessing is to insert a start symbol [ CLS ] and an end symbol [ SEP ] in each sentence in the source text and to correspond each word in the sentence in the form of a word vector, including the start coincidence and end symbols. The corresponding sentence is represented by the start symbol [ CLS ] of each sentence through the BERT output.
S3: and embedding and inputting each sentence of the source text into an expansion-gated convolution network, and aggregating through a residual structure form to obtain a new embedded representation of each sentence in the source text.
The expansion gating convolution network is a combined network of the expansion convolution network and the gating convolution network, because of the limitation of the input sequence length of the pre-training language model, the expansion gating convolution network can capture the contextual information which is truncated into irrelevant context information due to the fact that the length exceeds 512, the limitation of the fixed length of the input sequence can be effectively relieved, the performance can be improved by using the gating convolution network, and gradient disappearance can be prevented by using a residual structure form.
S4: and constructing a heterogram of each sentence of the source text and words belonging to the sentence, constructing an inter-sentence diagram through semantic similarity, and iterating through a diagram attention layer.
Each sentence and the words forming the sentence form an heterogram, the heterogram does not comprise the conjunctions, the exclamation words and other words without practical meaning, the semantic similarity graph between the sentences is built through the BM25 after the heterogram is built, the meaning force between the sentences and the words is calculated first and iterated for a plurality of times, wherein the meaning force iteration between the sentences and the words means that the meaning force between the words and the meaning force between the sentences are iterated each other and iterated for a plurality of times. And then, calculating the meaning force of the drawing between sentences, and carrying out iteration among sentences and iteration for a plurality of times so as to capture the similarity relation among the sentences.
S5: and constructing an heterogram of each sentence of the source text and the related subject, constructing a natural relation diagram among sentences through a subject model, and iterating through a diagram attention layer.
The construction process of the theme model is as follows: the method comprises the steps of obtaining a mean value and a covariance through two different fully connected networks in a word bag mode, constructing Gaussian distribution through the word bag, obtaining a topic probability through an activation function, and extracting a certain number of topics; then constructing heterogeneous graphs between sentences and topics through conditional probability, and constructing natural relationship graphs between sentences through the conditional probability of the topics among the sentences, so that synonym relationships can be effectively captured; next, the meaning force of the drawing between the sentence and the subject is calculated and iterated several times, wherein the meaning force of the drawing between the sentence and the subject refers to the meaning force of the drawing from the subject to the sentence and the meaning force of the drawing from the sentence to the subject iterated several times. Then, the natural relation among sentences is captured by calculating the meaning force of the drawing among sentences.
S6: aggregating two different characteristic representations of each sentence in the source text, inputting the two different characteristic representations into an activation function full-connection layer, calculating a probability value of each sentence, obtaining a predicted abstract, and constructing an abstract loss function and a theme loss function;
wherein calculating the probability value is calculating a probability of selecting a sentence as a summary.
Specifically, the summary loss function is:
Figure BDA0004013027240000061
yi represents sentence tag, < >>
Figure BDA0004013027240000062
Representing predicted sentence tags by computing summary tags (i.e. theoretical abstractTo be) and predictive abstracts (i.e., abstracts that are actually output by a fully connected neural network)
A loss function is constructed.
Specifically, the topic loss function is:
l NTM =D KL (p(z)||q(z|x))-E q (z|x) (p (x|z)), p (z) is the probability of topic z, q (z|x) is the probability of topic z under word x, representing the encoder network of the topic model, and p (x|z) is the probability of word x under topic z, representing the decoder network of the topic model.
S7: and calculating a model loss function based on the abstract loss and the theme loss function.
The model loss function is specifically: l=αl 1 +(1-α)I NTM Wherein l 1 Cross entropy loss function for summary loss, I NTM For the topic model loss function, α is the weight parameter that adjusts the two loss functions.
S8: and adjusting the weight of the whole neural network based on the model loss function to obtain a text abstract model.
Wherein figure 2 is the overall structure of the model.
Further, as shown in fig. 3, the source text is preprocessed and input into a pretrained BERT to obtain an embedded representation of each sentence in the source text, which specifically includes the following steps:
in the source text, respectively inserting a start symbol [ CLS ] and an end symbol [ SEP ] into the beginning and the end of each sentence, and then representing each word in the sentence by using a pre-trained word vector, wherein the word vector comprises the start symbol and the end symbol;
all word vectors are input into the BERT, specifically:
{h 1,0 ,h 1,1 ,...,h N,0 ,...,h N,* )=BERT(w 1,0 ,w 1,1 ,...,w N,0 ,...,w N,* )
wherein w is i,j The jth word representing the ith sentence. w (w) i,0 And w i,* Start symbols [ CLS ] respectively representing ith sentence]And end symbol [ SEP ]],,h i,j Represents the hidden state of the corresponding symbol, h i,0 Representing a representation of the ith sentence.
Further, as shown in fig. 4, each sentence of the source text is embedded and input into the expansion-gated convolutional network, and the specific steps are as follows:
firstly, using a residual structure through a gating convolutional neural network, and adopting a specific formula:
Figure BDA0004013027240000071
where σ is a sigmoid activation function and H is the sentence representation h= { H of the BERT output 1,0 ,..,h i,0 ,...,h n,0 The Conv1D1 and Conv1D2 convolution layers use the same window and convolution kernel numbers, where the two convolution layer weights are not shared.
The output Y after the gated convolution is input to the dilation convolution, where the dilation ratio of each layer is set to 1,2,4, respectively. Window settings of 3 all result in a new feature H '= { H' 1,0 ,..h′ i,0 ,...,h′ n,0 }。
Further, as shown in fig. 5, each sentence of the source text and the word belonging to the sentence are constructed into a heterogram, and an inter-sentence diagram is constructed through semantic similarity, and iterated through a diagram attention layer, specifically comprising the following steps:
wherein, the heterogram of the sentence and the word belonging to the sentence does not include words with no practical meaning such as stop words, conjunctions and the like.
Semantic similarity between sentences is constructed by classical algorithm BM 25. Firstly, carrying out sentence-to-word meaning calculation, and carrying out iteration for a certain number of times, wherein the iteration refers to the meaning calculation from word to sentence and the meaning calculation from sentence to word, and then calculating the meaning of sentence to sentence.
The word vectors use trained vectors, all vocabulary amounts are contained in glove.42B.300d.txt, nonsensical vocabulary are filtered, and sentence vectors are output from each sentence starting symbol through BERT and DGCNN;
wherein the word-to-sentence graph attention calculation formula:
z ij =LeakyReLU(W a [W q h i ;W k w j ;e ij ])
wherein the sentence-to-word legend meaning force calculation formula:
z ji =LeakyReLU(W a [W v w j ;W m h i ;e ij ])
wherein h is i And w j Respectively, sentence and word vector, e ij Representing the connection between the two, W a ,W q ,W k ,W v ,W m As a trainable parameter, leakyReLU is an activation function, z ij Calculating word-to-sentence meaning force, z ji The sentence-to-word mindset force is calculated.
Figure BDA0004013027240000081
Figure BDA0004013027240000082
α ij Is word w j To sentence h i Wherein
Figure BDA0004013027240000083
Representation and sentence vector h i A collection of connected words, alpha ji Is sentence h i To word w j Weight of->
Figure BDA0004013027240000084
Representation and word vector w j A collection of connected sentences.
Figure BDA0004013027240000085
Figure BDA0004013027240000086
The invention adopts multi-head attention, wherein the quantity is K, ||represents aggregation, sigma is an activation function, and W 1 k ,W 2 k For trainable parameters j.epsilon.N i Representation and sentence h i A set of connected word vectors, i e N j And word w j A collection of connected sentences. After multi-head attention calculation, respectively new sentence matrix S and word matrix Z.
To prevent the gradient from disappearing, a residual structure was added:
h′=u i +h i
wherein the first iteration of words to sentences is:
Figure BDA0004013027240000091
wherein, at the first iteration of words to sentences,
Figure BDA0004013027240000092
an embedding matrix S for sentences, matrix +.>
Figure BDA0004013027240000093
Representing the query matrix, the key matrix and the value matrix are embedded by words +.>
Figure BDA0004013027240000094
And (3) representing.
T+1st iteration of word to sentence:
Figure BDA0004013027240000095
to prevent the gradient from disappearing, a residual structure was added:
Figure BDA0004013027240000096
t+1st word iteration from sentence to word
Figure BDA0004013027240000097
To prevent the gradient from disappearing, a residual structure was added:
Figure BDA0004013027240000098
new feature representations of sentences are obtained, and graph attention calculations are performed between sentences:
Figure BDA0004013027240000099
wherein Y is a sentence set connected with the sentence s through semantic similarity
To prevent the gradient from disappearing, a residual structure was added:
Figure BDA00040130272400000910
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA00040130272400000911
for the new features of the corresponding sentence, for distinguishing between the new features of the natural sentence +.>
Figure BDA00040130272400000915
Representing the new features of the last iteration.
Further, as shown in fig. 6, each sentence of the source text is structured into an iso-composition with a related topic, and an inter-sentence graph is structured through conditional topic probability, and iterated through a graph attention layer, specifically comprising the following steps:
wherein, the topic model is constructed by taking the source text as the source textInput and get the bag of words X bow The word bag is taken as input to pass through two different linear full-connection layers with activation functions to obtain a mean value mu and a standard deviation sigma, and the method is as follows:
μ=Relu(F 1 (X bow ))
σ=Relu(F 2 (X bow ))
wherein Relu is the activation function F 1 ,F 2 Is a full connection layer.
Construction of the subject gaussian distribution z-N (μ, σ) by means of mean and variance 2 ) Then the subject distribution θ=softmax (z) is obtained,
Figure BDA00040130272400000912
wherein->
Figure BDA00040130272400000913
Is a topic-word distribution matrix, where +.>
Figure BDA00040130272400000914
Representing the relevance of word i and topic j, by P w The relationship between the topic and the word is constructed.
In particular to obtain a certain number of themes H T Associated weights Tc d
Figure BDA0004013027240000101
Figure BDA0004013027240000102
Wherein the method comprises the steps of
Figure BDA0004013027240000103
Is a linear transfer function with a Relu activation function, +>
Figure BDA0004013027240000104
Is->
Figure BDA0004013027240000105
Transpose of θ (i) Probability of the ith topic, H T (i) Is the i-th subject.
Firstly, connecting sentences with related topics, then, performing drawing meaning calculation between the sentences and the topics, and iterating for a certain number of times, wherein the iteration refers to the drawing meaning calculation from the topics to the sentences and the drawing meaning calculation from the sentences to the topics, so as to obtain new embedded representations U and T of the sentences and the topics, and further, performing drawing attention calculation from the sentences to the sentences.
Wherein the graph attention calculation formula between sentence and topic:
z ij =LeakyReLU(W a [W q h i ;W k t j ;e ij ])
wherein the graph attention calculation formula between the topic and the sentence:
z ji =LeakyReLU(W a [W v t j ;W m h i ;e ij ])
wherein h is i And t j Representing sentences and topic vectors, e ij Is a connection between, W a ,W q ,W k ,W v ,W m As a trainable parameter, leakyReLU is an activation function, z ij Calculating the meaning force, z, of a drawing between a sentence and a subject ji The mindset force between the subject and the sentence is calculated.
Figure BDA0004013027240000106
Figure BDA0004013027240000107
α ij Is h i To t j Wherein
Figure BDA0004013027240000108
Representation and sentence vector h i A set of connected topics, alpha ji Is t j To h i Weight of->
Figure BDA0004013027240000109
Is related to the topic vector t j And (5) a connected sentence set.
Figure BDA00040130272400001010
Figure BDA00040130272400001011
The invention adopts multi-head attention, wherein the quantity is K, ||represents aggregation, sigma is an activation function, and W k For trainable parameters j.epsilon.N i Is the sentence vector h i A set of connected topic vectors. N (N) j Is the topic vector t j A collection of connected sentence vectors.
Meanwhile, in order to prevent gradient from disappearing, a residual structure method is also adopted:
h′=u i +h i
wherein the primary iteration of the topic into sentences is:
Figure BDA0004013027240000111
wherein, in the first iteration,
Figure BDA0004013027240000112
an embedding matrix U for sentences, representing a query matrix, < >>
Figure BDA0004013027240000113
The key matrix and the value matrix are represented as an embedding matrix T of subjects.
T+1st iteration of the topic into sentences:
Figure BDA0004013027240000114
to prevent the gradient from disappearing, a residual structure was added:
Figure BDA0004013027240000115
t+1st word iteration of sentence to topic
Figure BDA0004013027240000116
To prevent the gradient from disappearing, a residual structure was added:
Figure BDA0004013027240000117
new feature representations of sentences are obtained, sentence-to-sentence graph attention calculations are performed between sentences:
Figure BDA0004013027240000118
wherein N is a set of sentences connected with the sentence s through the subject, and a specific calculation formula of edge connection between two sentences is as follows:
Figure BDA0004013027240000119
natural relationships between two sentences are established by the value of KL divergence.
To prevent the gradient from disappearing, a residual structure was added:
Figure BDA00040130272400001110
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA00040130272400001111
for the new features of the corresponding sentence +.>
Figure BDA00040130272400001112
Features representing the last iteration->
Figure BDA00040130272400001113
Further, two different feature representations of each sentence in the source text are aggregated and input to an activation function full-connection layer, the probability value of each sentence is calculated, a prediction abstract is obtained, and an abstract loss function and a theme loss function are constructed;
wherein the new feature
Figure BDA00040130272400001114
And according to P sent =sigmoid(f 3 (H) Selecting a certain number of sentences as text abstracts according to the probability;
f3 is a linear full-connection layer, sigmoid is an activation function, P sent Is the probability value of the corresponding sentence send.
Wherein, the abstract loss function is:
Figure BDA0004013027240000121
y i representing sentence tag->
Figure BDA0004013027240000122
Representing predicted sentence tags by computing the similarity between the digest tag (i.e., theoretical digest) and the predicted digest (i.e., the digest actually output by the fully-connected neural network)
To construct a loss function.
Wherein, the theme loss function is:
l NTM =D KL (p(z)||q(z|x))-E q(Z|X) (p(x|z))
where p (z) is the probability of the topic z, q (z|x) represents the probability value of the topic under the word x, and p (x|z) represents the probability value of the word under the topic z.
Further, calculating a model loss function based on the abstract loss and the theme loss function;
the model loss function is specifically: l=αl 1 +(1-α)l NTM Wherein l 1 Cross entropy loss function for abstract loss, l NTM For the topic model loss function, α is the weight parameter that adjusts the two loss functions.
Further, the weight of the whole neural network is adjusted based on the model loss function, and a long text abstract model is obtained.
The above disclosure is only a preferred embodiment of the present invention, and it should be understood that the scope of the invention is not limited thereto, and those skilled in the art will appreciate that all or part of the procedures described above can be performed according to the equivalent changes of the claims, and still fall within the scope of the present invention.

Claims (9)

1. The long text similarity comparison method based on the graph neural network is characterized by comprising the following steps of:
acquiring a source text and a abstract label corresponding to the source text;
preprocessing the source text, and inputting the source text into a pre-trained BERT to obtain embedded representation of each sentence in the source text;
embedding and inputting each sentence of the source text into an expansion gating convolution network, and aggregating through a residual error structure form to obtain a new embedded representation of each sentence in the source text;
constructing heterograms of each sentence of the source text and words belonging to the sentence, constructing inter-sentence graphs through semantic similarity, and iterating through a graph attention layer;
constructing an abnormal composition of each sentence of the source text and related topics, constructing a natural relationship diagram among sentences through a topic model, and iterating through a diagram attention layer;
aggregating two different characteristic representations of each sentence in the source text, inputting the two different characteristic representations into an activation function full-connection layer, calculating a probability value of each sentence, obtaining a prediction abstract, and constructing an abstract loss function and a theme loss function;
calculating a model loss function based on the abstract loss function and the theme loss function;
and adjusting the weight of the whole neural network based on the model loss function to obtain a text abstract model.
2. The long text similarity comparison method based on a graph neural network according to claim 1,
the process of preprocessing the source text comprises the following steps:
inserting all sentences in the source text into a start symbol [ CLS ] and an end symbol [ SEP ], and representing each word in the sentences in the form of word vectors;
each input sequence is input into the pre-trained BERT in a length of no more than 512, and sentence features are represented by the first vector feature of each sentence.
3. The long text similarity comparison method based on a graph neural network according to claim 1,
embedding each sentence of the source text into an expansion gating convolution network, specifically, inputting the characteristic representation of each sentence in the source text into the gating convolution according to the sequence, so as to reduce the risk of gradient disappearance;
and (3) inputting the output of the gate convolution into an expansion convolution network to learn longer context characteristics, and finally, aggregating the characteristics of each sentence with the characteristics of the expansion convolution network in a residual structure form to obtain new sentence characteristic representation.
4. The long text similarity comparison method based on a graph neural network according to claim 1,
a process of constructing an heterogram of each sentence of the source text with a related topic, comprising the steps of:
constructing heterograms of each sentence of the source text connected with words belonging to the sentence, wherein conjunctions, exclamation words and nonsensical words are removed;
each sentence in the source text is constructed into a diagram among sentences through BM25 semantic similarity;
and inputting the multi-level heterogeneous semantic similarity representation of each sentence into a graph-annotation semantic network layer for iterative calculation to obtain a new representation.
5. The long text similarity comparison method based on a graph neural network according to claim 1,
a process of constructing a heterogram for each sentence of the source text with words belonging to the sentence, comprising the steps of:
inputting the source text into a neural topic model to obtain a set number of topics, and constructing different patterns of sentences in the source text and related topics;
constructing a natural relation diagram among sentences in a form conforming to the similar subjects for each sentence of the source text;
and inputting the multi-level natural relation representation of each sentence into the graph annotation meaning network layer to perform iterative calculation to obtain a new representation.
6. The long text similarity comparison method based on a graph neural network according to claim 1,
the process of aggregating two different characteristic representations of each sentence in the source text, inputting the two different characteristic representations into an activation function full-connection layer, calculating the probability value of each sentence, obtaining a predicted abstract, and constructing an abstract loss and theme loss function comprises the following steps:
selecting sentences with the maximum probability value as prediction abstract sentences to obtain a prediction abstract;
constructing a cross entropy loss function based on the source text prediction digest and the tag digest;
and constructing a KL divergence topic loss function based on the word topics and the document topics of the source text.
7. The long text similarity comparison method based on a graph neural network according to claim 1,
the abstract loss function is specifically:
Figure FDA0004013027230000031
y i representing sentence tag->
Figure FDA0004013027230000032
Representing the predicted sentence tag.
8. The method for comparing long text similarity based on the graphic neural network of claim 7,
the topic model loss function is specifically:
l NTM =D KL (p(z)||q(z|x))-E q(z | x) (p (x|z)), p (z) is the probability of the topic z, q (z|x) is the probability of the topic z under the word x, and p (x|z) is the probability of the word x under the topic z.
9. The long text similarity comparison method based on the graphic neural network of claim 8,
the model loss function is specifically: l=αl 1 +(1-α)l NTM Wherein l 1 Cross entropy loss function for abstract loss, l NTM For the topic model loss function, α represents a weight value that adjusts the two loss functions.
CN202211656521.3A 2022-12-22 2022-12-22 Long text similarity comparison method based on graph neural network Pending CN116049394A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211656521.3A CN116049394A (en) 2022-12-22 2022-12-22 Long text similarity comparison method based on graph neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211656521.3A CN116049394A (en) 2022-12-22 2022-12-22 Long text similarity comparison method based on graph neural network

Publications (1)

Publication Number Publication Date
CN116049394A true CN116049394A (en) 2023-05-02

Family

ID=86126588

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211656521.3A Pending CN116049394A (en) 2022-12-22 2022-12-22 Long text similarity comparison method based on graph neural network

Country Status (1)

Country Link
CN (1) CN116049394A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116050397A (en) * 2023-03-07 2023-05-02 知呱呱(天津)大数据技术有限公司 Method, system, equipment and storage medium for generating long text abstract
CN117875268A (en) * 2024-03-13 2024-04-12 山东科技大学 Extraction type text abstract generation method based on clause coding

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116050397A (en) * 2023-03-07 2023-05-02 知呱呱(天津)大数据技术有限公司 Method, system, equipment and storage medium for generating long text abstract
CN117875268A (en) * 2024-03-13 2024-04-12 山东科技大学 Extraction type text abstract generation method based on clause coding
CN117875268B (en) * 2024-03-13 2024-05-31 山东科技大学 Extraction type text abstract generation method based on clause coding

Similar Documents

Publication Publication Date Title
CN110413986B (en) Text clustering multi-document automatic summarization method and system for improving word vector model
CN111966917B (en) Event detection and summarization method based on pre-training language model
Qiu et al. Chinese clinical named entity recognition using residual dilated convolutional neural network with conditional random field
Chen et al. Knowledge-enhanced neural networks for sentiment analysis of Chinese reviews
CN116049394A (en) Long text similarity comparison method based on graph neural network
Jin et al. Inter-sentence and implicit causality extraction from chinese corpus
CN112232053A (en) Text similarity calculation system, method and storage medium based on multi-keyword pair matching
Li et al. Integrating language model and reading control gate in BLSTM-CRF for biomedical named entity recognition
Shen et al. Hashtag Recommendation Using LSTM Networks with Self-Attention.
Lin et al. Multi-label emotion classification based on adversarial multi-task learning
Wang et al. Adversarial learning for multi-task sequence labeling with attention mechanism
CN114265936A (en) Method for realizing text mining of science and technology project
Zhang et al. A named entity recognition method towards product reviews based on BiLSTM-attention-CRF
CN112507717A (en) Medical field entity classification method fusing entity keyword features
Emami et al. Designing a deep neural network model for finding semantic similarity between short persian texts using a parallel corpus
Wang et al. A hybrid model based on deep convolutional network for medical named entity recognition
Jiang et al. A hierarchical bidirectional LSTM sequence model for extractive text summarization in electric power systems
Yu et al. Multi-module Fusion Relevance Attention Network for Multi-label Text Classification.
CN116263786A (en) Public opinion text emotion analysis method, device, computer equipment and medium
CN112270185A (en) Text representation method based on topic model
CN112347784A (en) Cross-document entity identification method combined with multi-task learning
Wu et al. A Text Emotion Analysis Method Using the Dual‐Channel Convolution Neural Network in Social Networks
Xiao et al. Social emotion cause extraction from online texts
Tu et al. Learning regular expressions for interpretable medical text classification using a pool-based simulated annealing approach
CN117807999B (en) Domain self-adaptive named entity recognition method based on countermeasure learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination