CN115062140A - Method for generating abstract of BERT SUM and PGN fused supply chain ecological district length document - Google Patents
Method for generating abstract of BERT SUM and PGN fused supply chain ecological district length document Download PDFInfo
- Publication number
- CN115062140A CN115062140A CN202210593675.6A CN202210593675A CN115062140A CN 115062140 A CN115062140 A CN 115062140A CN 202210593675 A CN202210593675 A CN 202210593675A CN 115062140 A CN115062140 A CN 115062140A
- Authority
- CN
- China
- Prior art keywords
- document
- sentences
- word
- generating
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 239000013598 vector Substances 0.000 claims abstract description 100
- 238000012549 training Methods 0.000 claims abstract description 19
- 230000007704 transition Effects 0.000 claims abstract description 14
- 230000011218 segmentation Effects 0.000 claims abstract description 13
- 238000007781 pre-processing Methods 0.000 claims abstract description 6
- 238000009826 distribution Methods 0.000 claims description 43
- 230000007246 mechanism Effects 0.000 claims description 32
- 239000010410 layer Substances 0.000 claims description 29
- 230000008569 process Effects 0.000 claims description 13
- 230000002457 bidirectional effect Effects 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 5
- 239000002356 single layer Substances 0.000 claims description 5
- 239000002131 composite material Substances 0.000 claims description 4
- 239000000284 extract Substances 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 claims description 4
- 238000004140 cleaning Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 abstract description 10
- 230000010076 replication Effects 0.000 abstract description 4
- 230000009286 beneficial effect Effects 0.000 abstract description 3
- 230000000694 effects Effects 0.000 abstract description 3
- 230000006870 function Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 5
- 238000010276 construction Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a method for generating a feed chain ecological district length document abstract fused with BERT SUM and PGN, which comprises the following steps: preprocessing the data set, wherein the preprocessing comprises the steps of carrying out sentence segmentation on documents in the data set to obtain sentences, and carrying out word segmentation on the sentences to obtain words; the method comprises the steps of coding sentences and words, coding paragraph marks, coding the positions of the sentences, and obtaining sentence vectors based on a BERT pre-training model according to the codes of the sentences and the words, the codes of the paragraph marks and the position codes of the sentences; capturing document-level features based on a transformer according to sentence vectors output by a BERT pre-training model; forming a transition document by the extracted key sentences and inputting the transition document into a generating model; generating a network replication or generating word level information based on the pointer; and generating a document abstract. The invention uses BERT as the code of the text abstract extraction part, is beneficial to machine understanding of the semantic information of the text and improves the effect of the final text abstract.
Description
Technical Field
The invention belongs to the field of supply chain collaborative ecological data processing, and particularly relates to a method for generating a supply chain ecological district length document abstract fused with BERT SUM and PGN.
Background
The text abstract aims to extract and simplify the complex and lengthy words on the basis of keeping the original meanings of the words. Traditional manual summarization is time-consuming and labor-consuming, which is particularly evident in supply chain collaborative ecosystems. The services in the ecological region of the supply chain are numerous, document information such as technical documents, operation instructions, news propaganda and the like in each service plate is endless, if key information is extracted after manual reading and understanding, a large amount of manpower is consumed to maintain the ecological region, and the speed of accurately acquiring the latest information by a user is greatly influenced. With the development of modern computer technology, a machine can learn how to automatically extract abstract contents of a text. The text abstract relates to a plurality of fields such as artificial intelligence, machine learning and the like, is a key direction of the natural language processing research field, and the abstract forming mode is mainly divided into an extraction type and a generation type.
The extraction type abstract is to extract key sentences and key words from a source document to form an abstract, wherein the abstract is all from original text. The method has low error rate in grammar and syntax and ensures certain effect. The traditional abstraction method is completed by using a graph method, clustering, a neural network and the like. The graph method is represented by TextRank, the algorithm imitates PageRank, sentences are used as nodes, similarity among the sentences is used, and undirected weighted edges are constructed. And (4) iteratively updating the node values by using the weight values on the edges, and finally selecting N nodes with the highest scores as the abstract. However, TextRank depends heavily on the word segmentation result, if a word is segmented into two words during word segmentation, the two words cannot be bonded together during key word extraction, and therefore the accuracy and the recall rate are greatly different. And TextRank, although considering the relationship between words, still tends to take frequent words as keywords. In addition, the extraction speed is low because the TextRank involves the construction of word graphs and iterative computation. In summary, the graph-based unsupervised method does not take semantic information of sentences into consideration when extracting the abstract, and is slow. Sentence feature vectors extracted by the neural network based on the BERT pre-training model can be expressed in a better semantic mode.
The abstract has certain grammatical and syntactic guarantees, but also faces certain problems, such as: content selection error, poor consistency, poor flexibility and the like. The generative abstract allows new words or phrases to be contained in the abstract, has high flexibility, and with the development of a neural network model in recent years, a sequence-to-sequence (Seq2Seq) model is widely used for the generative abstract task and achieves certain results. However, in the generated abstract, the generation process often lacks control and guidance of key information, for example, a pointer-generator network cannot well locate key words in the copy process.
Disclosure of Invention
The invention aims to overcome the defects of the prior art of technical document data processing in a supply chain collaborative ecological area, and provides a method for generating a summary of a long document in a supply chain ecological area with integration of BERT SUM and PGN, which comprises the following steps:
the method comprises the following steps that firstly, a data set is preprocessed, wherein the preprocessing comprises the steps of carrying out sentence segmentation on documents in the data set to obtain sentences, and carrying out word segmentation on the sentences to obtain words;
secondly, coding sentences and words, coding paragraph marks, coding the positions of the sentences, and acquiring sentence vectors based on a BERT pre-training model according to the codes of the sentences and the words, the codes of the paragraph marks and the position codes of the sentences;
step three, capturing document-level features based on a transformer according to sentence vectors output by the BERT pre-training model;
step four, forming a transition document by the extracted key sentences and inputting the transition document into the generation model;
step five, generating network replication or word level information based on the pointer;
and step six, generating a document abstract.
Further, the method for encoding sentences and words, paragraph marks and sentence positions includes the following steps of:
carrying out word vector coding, namely firstly coding a beginning symbol [ CLS ] and a sentence ending symbol [ SEP ] of a sentence and all words in a document to obtain the characteristics of a plurality of sentences;
carrying out the encoding of paragraph identification, and distinguishing a plurality of sentences in the document by using the paragraph identification;
coding the positions of sentences, wherein a position vector obtained after coding represents the absolute position of each word, each word in an input sequence is sequentially converted into position codes according to the subscript sequence of the word, and then the position codes are converted into real-value vectors by using a position vector matrix to obtain position vectors;
and adding the word vector codes, the paragraph marks and the sentence position codes to obtain sentence vectors.
Further, the sentence vector output according to the BERT pre-training model captures document-level features based on a transformer, comprising the following steps:
the inter-sentence Transformer extracts document-level features from the BERT output that are devoted to the summarization task:
wherein h is 0 Pomsemb (T) and T are sentence vectors output by BERT, PosEmb is an added position coding formula representing position information of sentence T, LN is layer normalization operation, MHA tt Is the attention of multiple headsMechanism, superscript l denotes the depth of the superimposed layer;
calculating the final output layer by using a sigmoid classifier:
wherein h is L Is sent i The vector from the transform top layer (lth layer) is set to L ═ 2, i.e., 2 layers of transforms.
Further, the method for generating network replication or generating word-level information based on pointers includes the following steps:
s51, constructing a sequence-to-sequence attention model framework;
s52, constructing a pointer generation copy mechanism;
s53, constructing a coverage mechanism to increase punishment items, reduce the words which are already generated in the generated summary, and reduce the repetition rate.
Further, the step of inputting the extracted key sentence forming transition document into the generative model includes the following steps:
extracting key sentences according to the document level characteristics to obtain a key sentence set, performing word segmentation on the extracted key sentence set, cleaning, performing vector representation on the cleaned text by word2vec, encoding, and generating the input of a network model by using the encoded vector as a pointer.
Further, the constructing of the sequence-to-sequence attention model framework comprises the following processes:
s511, the word vectors of the words are sequentially input into an encoder, the encoder is a single-layer bidirectional LSTM, and a hidden state sequence h of the encoder is generated i At each step t, the decoder receives the word vector of the previous word:
a t =softmax(e t )
wherein,v,W h ,W s And b attn Is an adjustable parameter, the attention distribution can be seen as a probability distribution of the source word, which tells the decoder where to look for generating the next word;
s512, the attention distribution is used for generating a weighted sum of hidden states of the encoder, which is called a context vector
A context vector, being a fixed-size representation of the content read from the source code, connected to a decoder state stage provided by two linear layers;
s513, generating a vocabulary distribution Pvocab:
wherein V, V ', b and b' are adjustable parameters, P vocab Is the probability distribution of all words in the vocabulary, providing the final distribution of predicted words w
S514, distribution of words w
P(w)=P vocab (w)
During the training process, the loss of time step t is the target wordNegative log probability at the following time steps:
the total loss of the entire sequence is:
further, the constructing pointer generates copy mechanism, which includes the following processes:
the decoder state is denoted as s t The decoder input is denoted x t :
Wherein the vectorw s ,w x And a scalar b ptr Is a learnable parameter, δ is a sigmoid function, p gen For flexible selectors, selecting from P vocab Generating a word from the sample vocabulary of (a) or from a t The probability distribution of the extended vocabulary, where the sample copies one word from the input sequence, is obtained for each document by adding the extra vocabulary to all the vocabularies present in the source document, as follows:
if w is an out-of-vocabulary word (OOV), then P vocab (w) is zero; similarly, if w does not appear in the source document, thenAlso 0.
Further, the construction of the coverage mechanism increases penalty terms, reduces words which have appeared in the generated summary, and reduces repetition rate, including the following processes:
in the overlay mechanism, one overlay vector c is maintained t :
c t Is a distribution of source document words that represents the degree of coverage that these words have obtained from the attention mechanism to date; c. C 0 Is a zero vector, the attention distribution at this time is as follows:
wherein w c Is a learnable parameter vector of the same length as v;
define a coverage loss to penalize multiple attention to the same vocabulary:
Weighting the coverage loss through the hyper-parameter lambda, and adding the coverage loss into a loss function of the sequence model to obtain a new composite loss function:
the invention has the beneficial effects that: 1. the invention uses BERT as the coding of the text abstract extraction part, has advancement, is beneficial to the machine to understand the semantic information of the text, and improves the effect of the final text abstract;
2. extracting document-level features by using a transformer between sentences, so that text information is more fully utilized;
3. the pointer generator model is used as a tool for generating the abstract, so that the repetition rate in generating abstract sentences can be reduced, the phenomenon of sentence circulation can be avoided, and the generated sentences are smoother.
Drawings
FIG. 1 is a schematic flow chart diagram of a method for generating a document abstract of a supply chain ecosystem with integration of BERT SUM and PGN;
FIG. 2 is a schematic diagram of an extraction model based on a BERT pre-training model;
FIG. 3 is a diagram of a transform-based document-level feature extraction architecture;
FIG. 4 is a schematic diagram of a sequence-to-sequence model;
FIG. 5 is a schematic diagram of a pointer generator model.
Detailed Description
The technical solutions of the present invention are further described in detail below with reference to the accompanying drawings, but the scope of the present invention is not limited to the following descriptions.
As shown in fig. 1 to 5, a method for generating a document summary of a BERT SUM and PGN fused supply chain ecosystem captain includes the following steps:
the method comprises the following steps that firstly, a data set is preprocessed, wherein the preprocessing comprises the steps of carrying out sentence segmentation on documents in the data set to obtain sentences, and carrying out word segmentation on the sentences to obtain words;
secondly, coding sentences and words, coding paragraph marks, coding the positions of the sentences, and acquiring sentence vectors based on a BERT pre-training model according to the codes of the sentences and the words, the codes of the paragraph marks and the position codes of the sentences;
step three, capturing document-level features based on a transformer according to sentence vectors output by the BERT pre-training model;
step four, forming a transition document by the extracted key sentences and inputting the transition document into the generation model;
generating network copy or word level information based on the pointer;
and step six, generating a document abstract.
The method for coding sentences and words, paragraph marks and sentence positions comprises the following steps of:
carrying out word vector coding, namely firstly coding a beginning symbol [ CLS ] and a sentence ending symbol [ SEP ] of a sentence and all words in a document to obtain the characteristics of a plurality of sentences;
carrying out encoding of paragraph identification, and distinguishing a plurality of sentences in the document by using the paragraph identification;
coding the positions of sentences, wherein a position vector obtained after coding represents the absolute position of each word, each word in an input sequence is sequentially converted into position codes according to the subscript sequence of the word, and then the position codes are converted into real-value vectors by utilizing a position vector matrix to obtain position vectors;
and adding the word vector code, the paragraph mark code and the sentence position code to obtain a sentence vector.
The sentence vector output according to the BERT pre-training model captures the document-level features based on a transformer, and the method comprises the following steps:
the inter-sentence Transformer extracts document-level features from the BERT output that are devoted to the summarization task:
wherein h is 0 Pomsemb (T) and T are sentence vectors output by BERT, PosEmb is an added position coding formula representing position information of sentence T, LN is layer normalization operation, MHA tt A multi-head attention mechanism, wherein the superscript l represents the depth of the superimposed layer;
calculating the final output layer by using a sigmoid classifier:
wherein h is L Is sent i The vector from the transform top layer (lth layer) is set to L ═ 2, i.e., 2 layers of transforms.
The method for forming the transition document by the extracted key sentences and inputting the transition document into the generating model comprises the following steps:
extracting key sentences according to the document level characteristics to obtain a key sentence set, dividing words of the extracted key sentence set, cleaning, carrying out vector representation on the cleaned text by word2vec, coding, and generating the input of the network model by using the coded vector as a pointer. The key sentence extraction method comprises the following steps:
the feature extraction is carried out by adopting a bidirectional Transformer as an encoder, and a Transformer encoding unit comprises two parts, namely a self-attention mechanism and a feedforward neural network. The input part from the attention mechanism is made up of three different vectors from the same word, a Query vector (Q), a Key vector (K), and a Value vector (V). The similarity between the word vectors of the input part is expressed by multiplying the Query vector and the Key vector, and is marked as QK T And through d k And scaling is carried out to ensure that the obtained result is moderate in size. And finally, carrying out normalization operation by softmax to obtain probability distribution, and further obtaining the weight summation expression of all word vectors in the sentence.
And taking the sentence with the summation value of the top n as a key sentence and extracting the sentence, wherein n is the self-defined number of the key sentences.
The method for generating the network replication or generating the word level information based on the pointer comprises the following steps:
s51, constructing a sequence-to-sequence attention model framework;
s52, constructing a pointer generation copy mechanism;
s53, constructing a coverage mechanism to increase punishment items, reduce the words which are already generated in the generated summary, and reduce the repetition rate.
The method for constructing the sequence-to-sequence attention model framework comprises the following steps:
s511, the word vectors of the words are sequentially input into the encoder, which is a single-layer bidirectional LSTM, to generateEncoder hidden state sequence h i At each step t, the decoder receives the word vector of the previous word:
a t =softmax(e t )
wherein, v, W h ,W s And b attn Is an adjustable parameter, the attention distribution can be seen as a probability distribution of the source word, which tells the decoder where to look for generating the next word;
s512, the attention distribution is used for generating a weighted sum of hidden states of the encoder, which is called a context vector
A context vector, being a fixed-size representation of the content read from the source code, connected to a decoder state stage provided by two linear layers;
s513, generating a vocabulary distribution Pvocab:
wherein V, V ', b and b' are adjustable parameters, P vocab Is the probability distribution of all words in the vocabulary, providing the final distribution of predicted words w
S514, distribution of words w
P(w)=P vocab (w)
During the training process, the loss of time step t is the target wordNegative log probability at the following time step:
The total loss of the entire sequence is:
the construction pointer generation copy mechanism comprises the following processes:
the decoder state is denoted as s t The decoder input is denoted x t :
Wherein the vectorw s ,w x And a scalar b ptr Is a learnable parameter, δ is a sigmoid function, p gen For flexible selectors, selecting from P vocab Generating a word from the sample vocabulary of (a) or from a t The probability distribution of the extended vocabulary, where the sample copies one word from the input sequence, is obtained for each document by adding the extra vocabulary to all the vocabularies present in the source document, as follows:
if w is an out-of-vocabulary word (OOV), then P vocab (w) is zero; similarly, if w is not present in the source document, thenAlso 0.
The construction of the coverage mechanism enlarges penalty items, reduces words which are already appeared in the generated summary, and reduces repetition rate, and comprises the following processes:
in the overlay mechanism, one overlay vector c is maintained t :
c t Is a distribution of source document words that represents the degree of coverage that these words have obtained from the attention mechanism to date; c. C 0 Is a zero vector, the attention distribution at this time is as follows:
wherein w c Is a learnable parameter vector of the same length as v;
defining a coverage loss to penalize multiple attention to the same vocabulary:
Weighting the coverage loss through the hyper-parameter lambda, and adding the coverage loss into a loss function of the sequence model to obtain a new composite loss function:
specifically, the method for generating the summary of the supply chain ecological district leader document fused by BERT SUM and PGN comprises the following steps:
s1, preprocessing the data set such as sentence segmentation and word segmentation;
s2, obtaining sentence vectors based on the BERT pre-training model;
s3, capturing document-level features based on a transformer;
s4, forming a transition document by the extracted key sentences and inputting the transition document into the generation model;
s5, generating network copy or word level information based on the pointer;
and S6, generating a final abstract.
Further, the step S2 includes the following sub-steps:
s21, carrying out word vector coding, firstly, coding the beginning symbol [ CLS ], the ending symbol [ SEP ] of the sentence and all the words in the document. In normal BERT, [ CLS ] is used as a symbol to aggregate features in one or a pair of sentences. Modifying the model using a plurality of [ CLS ] symbols to obtain characteristics of a plurality of sentences;
s22, paragraph mark is coded, wherein the paragraph mark is used to distinguish a plurality of sentences in the document, specifically, for the ith sentence, the E is used A Or E B To encode whether its position is in the odd or even position, e.g. E for the odd number of sentences A Marking, for even sentences, by E B Marking;
s23, coding the position of the sentence, wherein the position vector obtained after coding can represent the absolute position of each word, each word in the input sequence is sequentially converted into position codes according to the subscript sequence of the word, and then the position codes are converted into real value vectors by utilizing a position vector matrix to obtain the position vector;
and S24, adding the three parts of codes, wherein the dimensions of the three vectors are the same, and the sizes of the vectors are the products of the maximum length of the sequence and the dimension of the word vector.
Further, the capturing of document-level features based on the transform of step S3 includes the following sub-steps:
s31, applying more Transformer layers on sentence representation by the Transformer among sentences, and extracting document-level features which are concentrated in the abstract task from BERT output:
wherein h is 0 PosEmb (T) and T are sentence vectors output by BERT, and PosEmb is an added position coding formula to represent position information of the sentence T. LN is a layer normalization operation, MHA tt Is a multi-headed attention mechanism, the superscript l denoting the depth of the overlying layer
S32, calculating the final output layer by using a sigmoid classifier:
wherein h is L Is sent i A vector from the transform top layer (lth layer), where L is set to 2, i.e., a 2-layer transform.
Further, the step S5 includes the following sub-steps:
s51, constructing a sequence-to-sequence attention model framework;
s511, the word vectors of the words are input into an encoder one by one, the encoder is a single-layer bidirectional LSTM, and a hidden state sequence h of the encoder is generated i . At each step t, the decoder (single-layer unidirectional LSTM) receives the word vector of the previous word
a t =softmax(e t ) (5)
Wherein, v, W h ,W s And b attn Is a tunable parameter, the attention distribution can be seen as a probability distribution of the source word, which tells the decoder where to look for generating the next word.
S512, the attention distribution is used for generating an encoderWeighted sum of hidden states, called context vector
The context vector, which can be seen as a fixed size representation of the content read from the source code at this step, is connected to a decoder state stage provided by two linear layers.
S513, generating a vocabulary distribution Pvocab:
wherein V, V ', b and b' are adjustable parameters, P vocab Is the probability distribution of all words in the vocabulary, providing the final distribution of predicted words w
S514, distribution of words w
P(w)=P vocab (w) (8)
During the training process, the loss of time step t is the target wordNegative log probability at the following time step:
the total loss of the entire sequence is:
s52, constructing a pointer to generate a copy mechanism, and solving the problem of unregistered words;
since it allows both copying of words by pointing and from a fixed vocabularyA word is generated. In the pointer generator model (shown in FIG. 3), the attention distribution a is calculated in section 2.1 t And context vectorFurthermore, the probability P of generation of the time step t gen ∈[0,1]Is based on context vectorsCalculated, decoder state is represented as s t The decoder input is denoted x t :
Wherein the vectorw s ,w x And a scalar b ptr Is a learnable parameter, δ is a sigmoid function. p is a radical of gen Is regarded as a flexible selector, and can select from P vocab Whether to generate a word from the sample vocabulary of (a) or (b) t The sample copies a word from the input sequence. For each document, the extended vocabulary is the addition of the additional vocabulary and all the vocabulary that appears in the source document. The probability distribution of the extended vocabulary can be obtained as follows:
if w is an out-of-vocabulary word (OOV), then P vocab (w) is zero; similarly, if w does not appear in the source document, thenAlso 0.
Therefore, the problem that new vocabularies cannot be generated is solved through a copy mechanism, and the limitation that vocabularies in a sequence-to-sequence model are limited in a preset vocabulary is broken through.
S53, constructing a coverage mechanism to increase punishment items and reduce the words which are already generated in the generated abstract, thereby reducing the repetition rate;
in the overlay mechanism, one overlay vector c is maintained t I.e. the sum of the attention distributions of all previous decoder time steps:
c t is a (non-normalized) distribution of source document words that represents the degree of coverage that these words have obtained from the attention mechanism to date. c. C 0 Is a zero vector because in the first time step, no source documents are involved. The coverage vector is used as an additional input to the attention mechanism, so the attention distribution at this time is as follows:
wherein w c Is a learnable parameter vector of the same length as v, which ensures that where the attention mechanism chooses as the next place of attention is influenced by the words already in the generated summary. This will make it easier for the attention mechanism to avoid repeatedly paying attention to the same location, and thus avoiding the generation of repeated text.
In addition, an extra coverage loss is defined to punish multiple attention to the same vocabulary:
And finally, weighting the coverage loss through the hyper-parameter lambda, and adding the coverage loss into a loss function of the sequence model to obtain a new composite loss function:
therefore, the problem of word circulation and repetition based on a sequence-to-sequence model during the generation of the long document abstract is solved through a coverage mechanism.
The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (8)
1. A BERT SUM and PGN fused supply chain ecological district document abstract generating method is characterized by comprising the following steps:
the method comprises the following steps that firstly, a data set is preprocessed, wherein the preprocessing comprises the steps of carrying out sentence segmentation on documents in the data set to obtain sentences, and carrying out word segmentation on the sentences to obtain words;
secondly, coding sentences and words, coding paragraph marks, coding the positions of the sentences, and acquiring sentence vectors based on a BERT pre-training model according to the codes of the sentences and the words, the codes of the paragraph marks and the position codes of the sentences;
step three, capturing document-level features based on a transformer according to sentence vectors output by the BERT pre-training model;
step four, forming a transition document by the extracted key sentences and inputting the transition document into the generation model;
generating network copy or word level information based on the pointer;
and step six, generating a document abstract.
2. The method for generating the abstract of the BERT SUM and PGN fused supply chain ecosystem for document with a length according to claim 1, wherein the method comprises the following steps of coding sentences and words, coding paragraph identifiers, coding positions of sentences, and obtaining sentence vectors based on a BERT pre-training model according to the coding of sentences and words, the coding of paragraph identifiers and the coding positions of sentences, and comprises the following steps:
carrying out word vector coding, namely firstly coding a beginning symbol [ CLS ] and a sentence ending symbol [ SEP ] of a sentence and all words in a document to obtain the characteristics of a plurality of sentences;
carrying out the encoding of paragraph identification, and distinguishing a plurality of sentences in the document by using the paragraph identification;
coding the positions of sentences, wherein a position vector obtained after coding represents the absolute position of each word, each word in an input sequence is sequentially converted into position codes according to the subscript sequence of the word, and then the position codes are converted into real-value vectors by utilizing a position vector matrix to obtain position vectors;
and adding the word vector codes, the paragraph marks and the sentence position codes to obtain sentence vectors.
3. The method as claimed in claim 1, wherein the sentence vector output according to the BERT pre-training model captures the document-level features based on a transformer, comprising the steps of:
the inter-sentence Transformer extracts document-level features from the BERT output that are devoted to the summarization task:
wherein h is 0 Pomsemb (T) and T are sentence vectors output by BERT, PosEmb is an added position coding formula representing position information of sentence T, LN is layer normalization operation, MHA tt A multi-head attention mechanism, wherein the superscript l represents the depth of the superimposed layer;
calculating the final output layer by using a sigmoid classifier:
wherein h is L Is sent i The vector from the transform top layer (lth layer) is set to L ═ 2, i.e., 2 layers of transforms.
4. The method for generating the abstract of the BERT SUM and PGN fused supply chain ecosystem-oriented document according to claim 1, wherein the step of inputting the extracted key sentence forming transition document into the generation model includes the following steps:
extracting key sentences according to the document level characteristics to obtain a key sentence set, dividing words of the extracted key sentence set, cleaning, carrying out vector representation on the cleaned text by word2vec, coding, and generating the input of the network model by using the coded vector as a pointer.
5. The method for generating the summary of the BERT SUM and PGN-fused supply chain ecosystem-oriented document according to claim 1, wherein the generating network copy or generating word level information based on pointers comprises the following steps:
s51, constructing a sequence-to-sequence attention model framework;
s52, constructing a pointer generation copy mechanism;
s53, constructing a coverage mechanism to increase punishment items, reduce the words which are already generated in the generated summary, and reduce the repetition rate.
6. The method of claim 5, wherein the constructing the sequence-to-sequence attention model framework comprises the following steps:
s511, the word vectors of the words are sequentially input into an encoder, the encoder is a single-layer bidirectional LSTM, and a hidden state sequence h of the encoder is generated i At each step t, the decoder receives the word vector of the previous word:
a t =soft max(e t )
wherein, v, W h ,W s And b attn Is an adjustable parameter, the attention distribution can be seen as a probability distribution of the source word, which tells the decoder where to look for generating the next word;
s512, the attention distribution is used for generating a weighted sum of hidden states of the encoder, which is called a context vector
A context vector, being a fixed-size representation of the content read from the source code, connected to a decoder state stage provided by two linear layers;
s513, generating a vocabulary distribution Pvocab:
wherein V, V ', b and b' are adjustable parameters, P vocab Is the probability distribution of all words in the vocabulary, providing the final distribution of predicted words w
S514, distribution of words w
P(w)=P vocab (w)
During the training process, the loss of time step t is the target wordNegative log probability at the following time step:
the total loss of the entire sequence is:
7. the method for generating the summary of the BERT SUM and PGN fused supply chain ecosystem leader document according to claim 5, wherein the constructing pointer generates copy mechanism, which includes the following processes:
the decoder state is denoted as s t The decoder input is denoted x t :
Wherein the vectorw s ,w x And a scalar b ptr Is a learnable parameter, δ is a sigmoid function, p gen For flexible selectors, selecting from P vocab Generating a word from the sample vocabulary of (a) or from a t The probability distribution of the extended vocabulary, where the sample copies one word from the input sequence, is obtained for each document by adding the extra vocabulary to all the vocabularies present in the source document, as follows:
8. The method for generating the abstract of the BERT SUM and PGN fused supply chain ecosystem-oriented document according to claim 5, wherein the constructing a coverage mechanism increases penalty terms, reduces words that have appeared in the generated abstract, and reduces repetition rate, including the following processes:
in the overlay mechanism, one overlay vector c is maintained t :
c t Is a distribution of source document words that represents the degree of coverage that these words have obtained from the attention mechanism to date; c. C 0 Is a zero vector, the attention distribution at this time is as follows:
wherein w c Is a learnable parameter vector of the same length as v;
defining a coverage loss to penalize multiple attention to the same vocabulary:
Weighting the coverage loss through the hyper-parameter lambda, and adding the coverage loss into a loss function of the sequence model to obtain a new composite loss function:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210593675.6A CN115062140A (en) | 2022-05-27 | 2022-05-27 | Method for generating abstract of BERT SUM and PGN fused supply chain ecological district length document |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210593675.6A CN115062140A (en) | 2022-05-27 | 2022-05-27 | Method for generating abstract of BERT SUM and PGN fused supply chain ecological district length document |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115062140A true CN115062140A (en) | 2022-09-16 |
Family
ID=83197907
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210593675.6A Pending CN115062140A (en) | 2022-05-27 | 2022-05-27 | Method for generating abstract of BERT SUM and PGN fused supply chain ecological district length document |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115062140A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116187163A (en) * | 2022-12-20 | 2023-05-30 | 北京知呱呱科技服务有限公司 | Construction method and system of pre-training model for patent document processing |
CN116245178A (en) * | 2023-05-08 | 2023-06-09 | 中国人民解放军国防科技大学 | Biomedical knowledge extraction method and device of decoder based on pointer network |
CN116933785A (en) * | 2023-06-30 | 2023-10-24 | 国网湖北省电力有限公司武汉供电公司 | Transformer-based electronic file abstract generation method, system and medium |
CN117668213A (en) * | 2024-01-29 | 2024-03-08 | 南京争锋信息科技有限公司 | Chaotic engineering abstract generation method based on cascade extraction and graph comparison model |
CN118506387A (en) * | 2024-07-17 | 2024-08-16 | 中科晶锐(苏州)科技有限公司 | Radar display control key information extraction device and method in electronic countermeasure |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112559730A (en) * | 2020-12-08 | 2021-03-26 | 北京京航计算通讯研究所 | Text abstract automatic generation method and system based on global feature extraction |
CN113360601A (en) * | 2021-06-10 | 2021-09-07 | 东北林业大学 | PGN-GAN text abstract model fusing topics |
CN114139497A (en) * | 2021-12-13 | 2022-03-04 | 国家电网有限公司大数据中心 | Text abstract extraction method based on BERTSUM model |
CN114281982A (en) * | 2021-12-29 | 2022-04-05 | 中山大学 | Book propaganda abstract generation method and system based on multi-mode fusion technology |
-
2022
- 2022-05-27 CN CN202210593675.6A patent/CN115062140A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112559730A (en) * | 2020-12-08 | 2021-03-26 | 北京京航计算通讯研究所 | Text abstract automatic generation method and system based on global feature extraction |
CN113360601A (en) * | 2021-06-10 | 2021-09-07 | 东北林业大学 | PGN-GAN text abstract model fusing topics |
CN114139497A (en) * | 2021-12-13 | 2022-03-04 | 国家电网有限公司大数据中心 | Text abstract extraction method based on BERTSUM model |
CN114281982A (en) * | 2021-12-29 | 2022-04-05 | 中山大学 | Book propaganda abstract generation method and system based on multi-mode fusion technology |
Non-Patent Citations (2)
Title |
---|
ABIGAIL SEE等: "Get To The Point: Summarization with Pointer-Generator Networks", 《ARXIV:1704.04368V2》, 25 April 2017 (2017-04-25), pages 1 - 2 * |
潘倩: "基于BERT模型的文本摘要方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 01, 15 January 2022 (2022-01-15), pages 2 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116187163A (en) * | 2022-12-20 | 2023-05-30 | 北京知呱呱科技服务有限公司 | Construction method and system of pre-training model for patent document processing |
CN116187163B (en) * | 2022-12-20 | 2024-02-20 | 北京知呱呱科技有限公司 | Construction method and system of pre-training model for patent document processing |
CN116245178A (en) * | 2023-05-08 | 2023-06-09 | 中国人民解放军国防科技大学 | Biomedical knowledge extraction method and device of decoder based on pointer network |
CN116245178B (en) * | 2023-05-08 | 2023-07-21 | 中国人民解放军国防科技大学 | Biomedical knowledge extraction method and device of decoder based on pointer network |
CN116933785A (en) * | 2023-06-30 | 2023-10-24 | 国网湖北省电力有限公司武汉供电公司 | Transformer-based electronic file abstract generation method, system and medium |
CN117668213A (en) * | 2024-01-29 | 2024-03-08 | 南京争锋信息科技有限公司 | Chaotic engineering abstract generation method based on cascade extraction and graph comparison model |
CN117668213B (en) * | 2024-01-29 | 2024-04-09 | 南京争锋信息科技有限公司 | Chaotic engineering abstract generation method based on cascade extraction and graph comparison model |
CN118506387A (en) * | 2024-07-17 | 2024-08-16 | 中科晶锐(苏州)科技有限公司 | Radar display control key information extraction device and method in electronic countermeasure |
CN118506387B (en) * | 2024-07-17 | 2024-10-15 | 中科晶锐(苏州)科技有限公司 | Radar display control key information extraction device and method in electronic countermeasure |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113010693B (en) | Knowledge graph intelligent question-answering method integrating pointer generation network | |
CN115062140A (en) | Method for generating abstract of BERT SUM and PGN fused supply chain ecological district length document | |
CN111310471B (en) | Travel named entity identification method based on BBLC model | |
CN111708882B (en) | Transformer-based Chinese text information missing completion method | |
CN114201581B (en) | Long text retrieval model based on contrast learning | |
CN110796160B (en) | Text classification method, device and storage medium | |
CN113128229A (en) | Chinese entity relation joint extraction method | |
CN111881677A (en) | Address matching algorithm based on deep learning model | |
CN113190656A (en) | Chinese named entity extraction method based on multi-label framework and fusion features | |
CN114757182A (en) | BERT short text sentiment analysis method for improving training mode | |
CN114154504B (en) | Chinese named entity recognition algorithm based on multi-information enhancement | |
CN112417901A (en) | Non-autoregressive Mongolian machine translation method based on look-around decoding and vocabulary attention | |
CN114281982B (en) | Book propaganda abstract generation method and system adopting multi-mode fusion technology | |
CN114169312A (en) | Two-stage hybrid automatic summarization method for judicial official documents | |
CN113821635A (en) | Text abstract generation method and system for financial field | |
CN112417138A (en) | Short text automatic summarization method combining pointer generation type and self-attention mechanism | |
CN116663578A (en) | Neural machine translation method based on strategy gradient method improvement | |
CN112069827B (en) | Data-to-text generation method based on fine-grained subject modeling | |
CN112765983A (en) | Entity disambiguation method based on neural network combined with knowledge description | |
CN115048511A (en) | Bert-based passport layout analysis method | |
CN117744658A (en) | Ship naming entity identification method based on BERT-BiLSTM-CRF | |
CN113297374A (en) | Text classification method based on BERT and word feature fusion | |
CN112434512A (en) | New word determining method and device in combination with context | |
CN116611428A (en) | Non-autoregressive decoding Vietnam text regularization method based on editing alignment algorithm | |
Aharoni et al. | Sequence to sequence transduction with hard monotonic attention |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |