CN115062140A - Method for generating abstract of BERT SUM and PGN fused supply chain ecological district length document - Google Patents

Method for generating abstract of BERT SUM and PGN fused supply chain ecological district length document Download PDF

Info

Publication number
CN115062140A
CN115062140A CN202210593675.6A CN202210593675A CN115062140A CN 115062140 A CN115062140 A CN 115062140A CN 202210593675 A CN202210593675 A CN 202210593675A CN 115062140 A CN115062140 A CN 115062140A
Authority
CN
China
Prior art keywords
document
sentences
word
generating
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210593675.6A
Other languages
Chinese (zh)
Inventor
张川东
张卜心
阎德劲
王伟
廖伟智
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202210593675.6A priority Critical patent/CN115062140A/en
Publication of CN115062140A publication Critical patent/CN115062140A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for generating a feed chain ecological district length document abstract fused with BERT SUM and PGN, which comprises the following steps: preprocessing the data set, wherein the preprocessing comprises the steps of carrying out sentence segmentation on documents in the data set to obtain sentences, and carrying out word segmentation on the sentences to obtain words; the method comprises the steps of coding sentences and words, coding paragraph marks, coding the positions of the sentences, and obtaining sentence vectors based on a BERT pre-training model according to the codes of the sentences and the words, the codes of the paragraph marks and the position codes of the sentences; capturing document-level features based on a transformer according to sentence vectors output by a BERT pre-training model; forming a transition document by the extracted key sentences and inputting the transition document into a generating model; generating a network replication or generating word level information based on the pointer; and generating a document abstract. The invention uses BERT as the code of the text abstract extraction part, is beneficial to machine understanding of the semantic information of the text and improves the effect of the final text abstract.

Description

Method for generating abstract of BERT SUM and PGN fused supply chain ecological district length document
Technical Field
The invention belongs to the field of supply chain collaborative ecological data processing, and particularly relates to a method for generating a supply chain ecological district length document abstract fused with BERT SUM and PGN.
Background
The text abstract aims to extract and simplify the complex and lengthy words on the basis of keeping the original meanings of the words. Traditional manual summarization is time-consuming and labor-consuming, which is particularly evident in supply chain collaborative ecosystems. The services in the ecological region of the supply chain are numerous, document information such as technical documents, operation instructions, news propaganda and the like in each service plate is endless, if key information is extracted after manual reading and understanding, a large amount of manpower is consumed to maintain the ecological region, and the speed of accurately acquiring the latest information by a user is greatly influenced. With the development of modern computer technology, a machine can learn how to automatically extract abstract contents of a text. The text abstract relates to a plurality of fields such as artificial intelligence, machine learning and the like, is a key direction of the natural language processing research field, and the abstract forming mode is mainly divided into an extraction type and a generation type.
The extraction type abstract is to extract key sentences and key words from a source document to form an abstract, wherein the abstract is all from original text. The method has low error rate in grammar and syntax and ensures certain effect. The traditional abstraction method is completed by using a graph method, clustering, a neural network and the like. The graph method is represented by TextRank, the algorithm imitates PageRank, sentences are used as nodes, similarity among the sentences is used, and undirected weighted edges are constructed. And (4) iteratively updating the node values by using the weight values on the edges, and finally selecting N nodes with the highest scores as the abstract. However, TextRank depends heavily on the word segmentation result, if a word is segmented into two words during word segmentation, the two words cannot be bonded together during key word extraction, and therefore the accuracy and the recall rate are greatly different. And TextRank, although considering the relationship between words, still tends to take frequent words as keywords. In addition, the extraction speed is low because the TextRank involves the construction of word graphs and iterative computation. In summary, the graph-based unsupervised method does not take semantic information of sentences into consideration when extracting the abstract, and is slow. Sentence feature vectors extracted by the neural network based on the BERT pre-training model can be expressed in a better semantic mode.
The abstract has certain grammatical and syntactic guarantees, but also faces certain problems, such as: content selection error, poor consistency, poor flexibility and the like. The generative abstract allows new words or phrases to be contained in the abstract, has high flexibility, and with the development of a neural network model in recent years, a sequence-to-sequence (Seq2Seq) model is widely used for the generative abstract task and achieves certain results. However, in the generated abstract, the generation process often lacks control and guidance of key information, for example, a pointer-generator network cannot well locate key words in the copy process.
Disclosure of Invention
The invention aims to overcome the defects of the prior art of technical document data processing in a supply chain collaborative ecological area, and provides a method for generating a summary of a long document in a supply chain ecological area with integration of BERT SUM and PGN, which comprises the following steps:
the method comprises the following steps that firstly, a data set is preprocessed, wherein the preprocessing comprises the steps of carrying out sentence segmentation on documents in the data set to obtain sentences, and carrying out word segmentation on the sentences to obtain words;
secondly, coding sentences and words, coding paragraph marks, coding the positions of the sentences, and acquiring sentence vectors based on a BERT pre-training model according to the codes of the sentences and the words, the codes of the paragraph marks and the position codes of the sentences;
step three, capturing document-level features based on a transformer according to sentence vectors output by the BERT pre-training model;
step four, forming a transition document by the extracted key sentences and inputting the transition document into the generation model;
step five, generating network replication or word level information based on the pointer;
and step six, generating a document abstract.
Further, the method for encoding sentences and words, paragraph marks and sentence positions includes the following steps of:
carrying out word vector coding, namely firstly coding a beginning symbol [ CLS ] and a sentence ending symbol [ SEP ] of a sentence and all words in a document to obtain the characteristics of a plurality of sentences;
carrying out the encoding of paragraph identification, and distinguishing a plurality of sentences in the document by using the paragraph identification;
coding the positions of sentences, wherein a position vector obtained after coding represents the absolute position of each word, each word in an input sequence is sequentially converted into position codes according to the subscript sequence of the word, and then the position codes are converted into real-value vectors by using a position vector matrix to obtain position vectors;
and adding the word vector codes, the paragraph marks and the sentence position codes to obtain sentence vectors.
Further, the sentence vector output according to the BERT pre-training model captures document-level features based on a transformer, comprising the following steps:
the inter-sentence Transformer extracts document-level features from the BERT output that are devoted to the summarization task:
Figure BDA0003666765690000021
Figure BDA0003666765690000022
wherein h is 0 Pomsemb (T) and T are sentence vectors output by BERT, PosEmb is an added position coding formula representing position information of sentence T, LN is layer normalization operation, MHA tt Is the attention of multiple headsMechanism, superscript l denotes the depth of the superimposed layer;
calculating the final output layer by using a sigmoid classifier:
Figure BDA0003666765690000023
wherein h is L Is sent i The vector from the transform top layer (lth layer) is set to L ═ 2, i.e., 2 layers of transforms.
Further, the method for generating network replication or generating word-level information based on pointers includes the following steps:
s51, constructing a sequence-to-sequence attention model framework;
s52, constructing a pointer generation copy mechanism;
s53, constructing a coverage mechanism to increase punishment items, reduce the words which are already generated in the generated summary, and reduce the repetition rate.
Further, the step of inputting the extracted key sentence forming transition document into the generative model includes the following steps:
extracting key sentences according to the document level characteristics to obtain a key sentence set, performing word segmentation on the extracted key sentence set, cleaning, performing vector representation on the cleaned text by word2vec, encoding, and generating the input of a network model by using the encoded vector as a pointer.
Further, the constructing of the sequence-to-sequence attention model framework comprises the following processes:
s511, the word vectors of the words are sequentially input into an encoder, the encoder is a single-layer bidirectional LSTM, and a hidden state sequence h of the encoder is generated i At each step t, the decoder receives the word vector of the previous word:
Figure BDA0003666765690000031
a t =softmax(e t )
wherein,v,W h ,W s And b attn Is an adjustable parameter, the attention distribution can be seen as a probability distribution of the source word, which tells the decoder where to look for generating the next word;
s512, the attention distribution is used for generating a weighted sum of hidden states of the encoder, which is called a context vector
Figure BDA0003666765690000032
Figure BDA0003666765690000033
A context vector, being a fixed-size representation of the content read from the source code, connected to a decoder state stage provided by two linear layers;
s513, generating a vocabulary distribution Pvocab:
Figure BDA0003666765690000034
wherein V, V ', b and b' are adjustable parameters, P vocab Is the probability distribution of all words in the vocabulary, providing the final distribution of predicted words w
S514, distribution of words w
P(w)=P vocab (w)
During the training process, the loss of time step t is the target word
Figure BDA0003666765690000035
Negative log probability at the following time steps:
Figure BDA0003666765690000036
the total loss of the entire sequence is:
Figure BDA0003666765690000041
further, the constructing pointer generates copy mechanism, which includes the following processes:
the decoder state is denoted as s t The decoder input is denoted x t :
Figure BDA0003666765690000042
Wherein the vector
Figure BDA0003666765690000043
w s ,w x And a scalar b ptr Is a learnable parameter, δ is a sigmoid function, p gen For flexible selectors, selecting from P vocab Generating a word from the sample vocabulary of (a) or from a t The probability distribution of the extended vocabulary, where the sample copies one word from the input sequence, is obtained for each document by adding the extra vocabulary to all the vocabularies present in the source document, as follows:
Figure BDA0003666765690000044
if w is an out-of-vocabulary word (OOV), then P vocab (w) is zero; similarly, if w does not appear in the source document, then
Figure BDA0003666765690000045
Also 0.
Further, the construction of the coverage mechanism increases penalty terms, reduces words which have appeared in the generated summary, and reduces repetition rate, including the following processes:
in the overlay mechanism, one overlay vector c is maintained t
Figure BDA0003666765690000046
c t Is a distribution of source document words that represents the degree of coverage that these words have obtained from the attention mechanism to date; c. C 0 Is a zero vector, the attention distribution at this time is as follows:
Figure BDA0003666765690000047
wherein w c Is a learnable parameter vector of the same length as v;
define a coverage loss to penalize multiple attention to the same vocabulary:
Figure BDA0003666765690000048
coverage loss satisfies
Figure BDA0003666765690000049
Weighting the coverage loss through the hyper-parameter lambda, and adding the coverage loss into a loss function of the sequence model to obtain a new composite loss function:
Figure BDA0003666765690000051
the invention has the beneficial effects that: 1. the invention uses BERT as the coding of the text abstract extraction part, has advancement, is beneficial to the machine to understand the semantic information of the text, and improves the effect of the final text abstract;
2. extracting document-level features by using a transformer between sentences, so that text information is more fully utilized;
3. the pointer generator model is used as a tool for generating the abstract, so that the repetition rate in generating abstract sentences can be reduced, the phenomenon of sentence circulation can be avoided, and the generated sentences are smoother.
Drawings
FIG. 1 is a schematic flow chart diagram of a method for generating a document abstract of a supply chain ecosystem with integration of BERT SUM and PGN;
FIG. 2 is a schematic diagram of an extraction model based on a BERT pre-training model;
FIG. 3 is a diagram of a transform-based document-level feature extraction architecture;
FIG. 4 is a schematic diagram of a sequence-to-sequence model;
FIG. 5 is a schematic diagram of a pointer generator model.
Detailed Description
The technical solutions of the present invention are further described in detail below with reference to the accompanying drawings, but the scope of the present invention is not limited to the following descriptions.
As shown in fig. 1 to 5, a method for generating a document summary of a BERT SUM and PGN fused supply chain ecosystem captain includes the following steps:
the method comprises the following steps that firstly, a data set is preprocessed, wherein the preprocessing comprises the steps of carrying out sentence segmentation on documents in the data set to obtain sentences, and carrying out word segmentation on the sentences to obtain words;
secondly, coding sentences and words, coding paragraph marks, coding the positions of the sentences, and acquiring sentence vectors based on a BERT pre-training model according to the codes of the sentences and the words, the codes of the paragraph marks and the position codes of the sentences;
step three, capturing document-level features based on a transformer according to sentence vectors output by the BERT pre-training model;
step four, forming a transition document by the extracted key sentences and inputting the transition document into the generation model;
generating network copy or word level information based on the pointer;
and step six, generating a document abstract.
The method for coding sentences and words, paragraph marks and sentence positions comprises the following steps of:
carrying out word vector coding, namely firstly coding a beginning symbol [ CLS ] and a sentence ending symbol [ SEP ] of a sentence and all words in a document to obtain the characteristics of a plurality of sentences;
carrying out encoding of paragraph identification, and distinguishing a plurality of sentences in the document by using the paragraph identification;
coding the positions of sentences, wherein a position vector obtained after coding represents the absolute position of each word, each word in an input sequence is sequentially converted into position codes according to the subscript sequence of the word, and then the position codes are converted into real-value vectors by utilizing a position vector matrix to obtain position vectors;
and adding the word vector code, the paragraph mark code and the sentence position code to obtain a sentence vector.
The sentence vector output according to the BERT pre-training model captures the document-level features based on a transformer, and the method comprises the following steps:
the inter-sentence Transformer extracts document-level features from the BERT output that are devoted to the summarization task:
Figure BDA0003666765690000061
Figure BDA0003666765690000062
wherein h is 0 Pomsemb (T) and T are sentence vectors output by BERT, PosEmb is an added position coding formula representing position information of sentence T, LN is layer normalization operation, MHA tt A multi-head attention mechanism, wherein the superscript l represents the depth of the superimposed layer;
calculating the final output layer by using a sigmoid classifier:
Figure BDA0003666765690000063
wherein h is L Is sent i The vector from the transform top layer (lth layer) is set to L ═ 2, i.e., 2 layers of transforms.
The method for forming the transition document by the extracted key sentences and inputting the transition document into the generating model comprises the following steps:
extracting key sentences according to the document level characteristics to obtain a key sentence set, dividing words of the extracted key sentence set, cleaning, carrying out vector representation on the cleaned text by word2vec, coding, and generating the input of the network model by using the coded vector as a pointer. The key sentence extraction method comprises the following steps:
the feature extraction is carried out by adopting a bidirectional Transformer as an encoder, and a Transformer encoding unit comprises two parts, namely a self-attention mechanism and a feedforward neural network. The input part from the attention mechanism is made up of three different vectors from the same word, a Query vector (Q), a Key vector (K), and a Value vector (V). The similarity between the word vectors of the input part is expressed by multiplying the Query vector and the Key vector, and is marked as QK T And through d k And scaling is carried out to ensure that the obtained result is moderate in size. And finally, carrying out normalization operation by softmax to obtain probability distribution, and further obtaining the weight summation expression of all word vectors in the sentence.
Figure BDA0003666765690000064
And taking the sentence with the summation value of the top n as a key sentence and extracting the sentence, wherein n is the self-defined number of the key sentences.
The method for generating the network replication or generating the word level information based on the pointer comprises the following steps:
s51, constructing a sequence-to-sequence attention model framework;
s52, constructing a pointer generation copy mechanism;
s53, constructing a coverage mechanism to increase punishment items, reduce the words which are already generated in the generated summary, and reduce the repetition rate.
The method for constructing the sequence-to-sequence attention model framework comprises the following steps:
s511, the word vectors of the words are sequentially input into the encoder, which is a single-layer bidirectional LSTM, to generateEncoder hidden state sequence h i At each step t, the decoder receives the word vector of the previous word:
Figure BDA0003666765690000071
a t =softmax(e t )
wherein, v, W h ,W s And b attn Is an adjustable parameter, the attention distribution can be seen as a probability distribution of the source word, which tells the decoder where to look for generating the next word;
s512, the attention distribution is used for generating a weighted sum of hidden states of the encoder, which is called a context vector
Figure BDA0003666765690000072
Figure BDA0003666765690000073
A context vector, being a fixed-size representation of the content read from the source code, connected to a decoder state stage provided by two linear layers;
s513, generating a vocabulary distribution Pvocab:
Figure BDA0003666765690000074
wherein V, V ', b and b' are adjustable parameters, P vocab Is the probability distribution of all words in the vocabulary, providing the final distribution of predicted words w
S514, distribution of words w
P(w)=P vocab (w)
During the training process, the loss of time step t is the target word
Figure BDA0003666765690000075
Negative log probability at the following time step:
Figure BDA0003666765690000076
The total loss of the entire sequence is:
Figure BDA0003666765690000077
the construction pointer generation copy mechanism comprises the following processes:
the decoder state is denoted as s t The decoder input is denoted x t :
Figure BDA0003666765690000081
Wherein the vector
Figure BDA0003666765690000082
w s ,w x And a scalar b ptr Is a learnable parameter, δ is a sigmoid function, p gen For flexible selectors, selecting from P vocab Generating a word from the sample vocabulary of (a) or from a t The probability distribution of the extended vocabulary, where the sample copies one word from the input sequence, is obtained for each document by adding the extra vocabulary to all the vocabularies present in the source document, as follows:
Figure BDA0003666765690000083
if w is an out-of-vocabulary word (OOV), then P vocab (w) is zero; similarly, if w is not present in the source document, then
Figure BDA0003666765690000084
Also 0.
The construction of the coverage mechanism enlarges penalty items, reduces words which are already appeared in the generated summary, and reduces repetition rate, and comprises the following processes:
in the overlay mechanism, one overlay vector c is maintained t
Figure BDA0003666765690000085
c t Is a distribution of source document words that represents the degree of coverage that these words have obtained from the attention mechanism to date; c. C 0 Is a zero vector, the attention distribution at this time is as follows:
Figure BDA0003666765690000086
wherein w c Is a learnable parameter vector of the same length as v;
defining a coverage loss to penalize multiple attention to the same vocabulary:
Figure BDA0003666765690000087
coverage loss satisfies
Figure BDA0003666765690000088
Weighting the coverage loss through the hyper-parameter lambda, and adding the coverage loss into a loss function of the sequence model to obtain a new composite loss function:
Figure BDA0003666765690000089
specifically, the method for generating the summary of the supply chain ecological district leader document fused by BERT SUM and PGN comprises the following steps:
s1, preprocessing the data set such as sentence segmentation and word segmentation;
s2, obtaining sentence vectors based on the BERT pre-training model;
s3, capturing document-level features based on a transformer;
s4, forming a transition document by the extracted key sentences and inputting the transition document into the generation model;
s5, generating network copy or word level information based on the pointer;
and S6, generating a final abstract.
Further, the step S2 includes the following sub-steps:
s21, carrying out word vector coding, firstly, coding the beginning symbol [ CLS ], the ending symbol [ SEP ] of the sentence and all the words in the document. In normal BERT, [ CLS ] is used as a symbol to aggregate features in one or a pair of sentences. Modifying the model using a plurality of [ CLS ] symbols to obtain characteristics of a plurality of sentences;
s22, paragraph mark is coded, wherein the paragraph mark is used to distinguish a plurality of sentences in the document, specifically, for the ith sentence, the E is used A Or E B To encode whether its position is in the odd or even position, e.g. E for the odd number of sentences A Marking, for even sentences, by E B Marking;
s23, coding the position of the sentence, wherein the position vector obtained after coding can represent the absolute position of each word, each word in the input sequence is sequentially converted into position codes according to the subscript sequence of the word, and then the position codes are converted into real value vectors by utilizing a position vector matrix to obtain the position vector;
and S24, adding the three parts of codes, wherein the dimensions of the three vectors are the same, and the sizes of the vectors are the products of the maximum length of the sequence and the dimension of the word vector.
Further, the capturing of document-level features based on the transform of step S3 includes the following sub-steps:
s31, applying more Transformer layers on sentence representation by the Transformer among sentences, and extracting document-level features which are concentrated in the abstract task from BERT output:
Figure BDA0003666765690000091
Figure BDA0003666765690000092
wherein h is 0 PosEmb (T) and T are sentence vectors output by BERT, and PosEmb is an added position coding formula to represent position information of the sentence T. LN is a layer normalization operation, MHA tt Is a multi-headed attention mechanism, the superscript l denoting the depth of the overlying layer
S32, calculating the final output layer by using a sigmoid classifier:
Figure BDA0003666765690000093
wherein h is L Is sent i A vector from the transform top layer (lth layer), where L is set to 2, i.e., a 2-layer transform.
Further, the step S5 includes the following sub-steps:
s51, constructing a sequence-to-sequence attention model framework;
s511, the word vectors of the words are input into an encoder one by one, the encoder is a single-layer bidirectional LSTM, and a hidden state sequence h of the encoder is generated i . At each step t, the decoder (single-layer unidirectional LSTM) receives the word vector of the previous word
Figure BDA0003666765690000101
a t =softmax(e t ) (5)
Wherein, v, W h ,W s And b attn Is a tunable parameter, the attention distribution can be seen as a probability distribution of the source word, which tells the decoder where to look for generating the next word.
S512, the attention distribution is used for generating an encoderWeighted sum of hidden states, called context vector
Figure BDA0003666765690000102
Figure BDA0003666765690000103
The context vector, which can be seen as a fixed size representation of the content read from the source code at this step, is connected to a decoder state stage provided by two linear layers.
S513, generating a vocabulary distribution Pvocab:
Figure BDA0003666765690000104
wherein V, V ', b and b' are adjustable parameters, P vocab Is the probability distribution of all words in the vocabulary, providing the final distribution of predicted words w
S514, distribution of words w
P(w)=P vocab (w) (8)
During the training process, the loss of time step t is the target word
Figure BDA0003666765690000105
Negative log probability at the following time step:
Figure BDA0003666765690000106
the total loss of the entire sequence is:
Figure BDA0003666765690000107
s52, constructing a pointer to generate a copy mechanism, and solving the problem of unregistered words;
since it allows both copying of words by pointing and from a fixed vocabularyA word is generated. In the pointer generator model (shown in FIG. 3), the attention distribution a is calculated in section 2.1 t And context vector
Figure BDA0003666765690000108
Furthermore, the probability P of generation of the time step t gen ∈[0,1]Is based on context vectors
Figure BDA0003666765690000111
Calculated, decoder state is represented as s t The decoder input is denoted x t :
Figure BDA0003666765690000112
Wherein the vector
Figure BDA0003666765690000113
w s ,w x And a scalar b ptr Is a learnable parameter, δ is a sigmoid function. p is a radical of gen Is regarded as a flexible selector, and can select from P vocab Whether to generate a word from the sample vocabulary of (a) or (b) t The sample copies a word from the input sequence. For each document, the extended vocabulary is the addition of the additional vocabulary and all the vocabulary that appears in the source document. The probability distribution of the extended vocabulary can be obtained as follows:
Figure BDA0003666765690000114
if w is an out-of-vocabulary word (OOV), then P vocab (w) is zero; similarly, if w does not appear in the source document, then
Figure BDA0003666765690000115
Also 0.
Therefore, the problem that new vocabularies cannot be generated is solved through a copy mechanism, and the limitation that vocabularies in a sequence-to-sequence model are limited in a preset vocabulary is broken through.
S53, constructing a coverage mechanism to increase punishment items and reduce the words which are already generated in the generated abstract, thereby reducing the repetition rate;
in the overlay mechanism, one overlay vector c is maintained t I.e. the sum of the attention distributions of all previous decoder time steps:
Figure BDA0003666765690000116
c t is a (non-normalized) distribution of source document words that represents the degree of coverage that these words have obtained from the attention mechanism to date. c. C 0 Is a zero vector because in the first time step, no source documents are involved. The coverage vector is used as an additional input to the attention mechanism, so the attention distribution at this time is as follows:
Figure BDA0003666765690000117
wherein w c Is a learnable parameter vector of the same length as v, which ensures that where the attention mechanism chooses as the next place of attention is influenced by the words already in the generated summary. This will make it easier for the attention mechanism to avoid repeatedly paying attention to the same location, and thus avoiding the generation of repeated text.
In addition, an extra coverage loss is defined to punish multiple attention to the same vocabulary:
Figure BDA0003666765690000118
however, this loss of coverage is limited and needs to be met
Figure BDA0003666765690000119
And finally, weighting the coverage loss through the hyper-parameter lambda, and adding the coverage loss into a loss function of the sequence model to obtain a new composite loss function:
Figure BDA0003666765690000121
therefore, the problem of word circulation and repetition based on a sequence-to-sequence model during the generation of the long document abstract is solved through a coverage mechanism.
The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (8)

1. A BERT SUM and PGN fused supply chain ecological district document abstract generating method is characterized by comprising the following steps:
the method comprises the following steps that firstly, a data set is preprocessed, wherein the preprocessing comprises the steps of carrying out sentence segmentation on documents in the data set to obtain sentences, and carrying out word segmentation on the sentences to obtain words;
secondly, coding sentences and words, coding paragraph marks, coding the positions of the sentences, and acquiring sentence vectors based on a BERT pre-training model according to the codes of the sentences and the words, the codes of the paragraph marks and the position codes of the sentences;
step three, capturing document-level features based on a transformer according to sentence vectors output by the BERT pre-training model;
step four, forming a transition document by the extracted key sentences and inputting the transition document into the generation model;
generating network copy or word level information based on the pointer;
and step six, generating a document abstract.
2. The method for generating the abstract of the BERT SUM and PGN fused supply chain ecosystem for document with a length according to claim 1, wherein the method comprises the following steps of coding sentences and words, coding paragraph identifiers, coding positions of sentences, and obtaining sentence vectors based on a BERT pre-training model according to the coding of sentences and words, the coding of paragraph identifiers and the coding positions of sentences, and comprises the following steps:
carrying out word vector coding, namely firstly coding a beginning symbol [ CLS ] and a sentence ending symbol [ SEP ] of a sentence and all words in a document to obtain the characteristics of a plurality of sentences;
carrying out the encoding of paragraph identification, and distinguishing a plurality of sentences in the document by using the paragraph identification;
coding the positions of sentences, wherein a position vector obtained after coding represents the absolute position of each word, each word in an input sequence is sequentially converted into position codes according to the subscript sequence of the word, and then the position codes are converted into real-value vectors by utilizing a position vector matrix to obtain position vectors;
and adding the word vector codes, the paragraph marks and the sentence position codes to obtain sentence vectors.
3. The method as claimed in claim 1, wherein the sentence vector output according to the BERT pre-training model captures the document-level features based on a transformer, comprising the steps of:
the inter-sentence Transformer extracts document-level features from the BERT output that are devoted to the summarization task:
Figure FDA0003666765680000011
Figure FDA0003666765680000012
wherein h is 0 Pomsemb (T) and T are sentence vectors output by BERT, PosEmb is an added position coding formula representing position information of sentence T, LN is layer normalization operation, MHA tt A multi-head attention mechanism, wherein the superscript l represents the depth of the superimposed layer;
calculating the final output layer by using a sigmoid classifier:
Figure FDA0003666765680000021
wherein h is L Is sent i The vector from the transform top layer (lth layer) is set to L ═ 2, i.e., 2 layers of transforms.
4. The method for generating the abstract of the BERT SUM and PGN fused supply chain ecosystem-oriented document according to claim 1, wherein the step of inputting the extracted key sentence forming transition document into the generation model includes the following steps:
extracting key sentences according to the document level characteristics to obtain a key sentence set, dividing words of the extracted key sentence set, cleaning, carrying out vector representation on the cleaned text by word2vec, coding, and generating the input of the network model by using the coded vector as a pointer.
5. The method for generating the summary of the BERT SUM and PGN-fused supply chain ecosystem-oriented document according to claim 1, wherein the generating network copy or generating word level information based on pointers comprises the following steps:
s51, constructing a sequence-to-sequence attention model framework;
s52, constructing a pointer generation copy mechanism;
s53, constructing a coverage mechanism to increase punishment items, reduce the words which are already generated in the generated summary, and reduce the repetition rate.
6. The method of claim 5, wherein the constructing the sequence-to-sequence attention model framework comprises the following steps:
s511, the word vectors of the words are sequentially input into an encoder, the encoder is a single-layer bidirectional LSTM, and a hidden state sequence h of the encoder is generated i At each step t, the decoder receives the word vector of the previous word:
Figure FDA0003666765680000022
a t =soft max(e t )
wherein, v, W h ,W s And b attn Is an adjustable parameter, the attention distribution can be seen as a probability distribution of the source word, which tells the decoder where to look for generating the next word;
s512, the attention distribution is used for generating a weighted sum of hidden states of the encoder, which is called a context vector
Figure FDA0003666765680000023
Figure FDA0003666765680000024
A context vector, being a fixed-size representation of the content read from the source code, connected to a decoder state stage provided by two linear layers;
s513, generating a vocabulary distribution Pvocab:
Figure FDA0003666765680000025
wherein V, V ', b and b' are adjustable parameters, P vocab Is the probability distribution of all words in the vocabulary, providing the final distribution of predicted words w
S514, distribution of words w
P(w)=P vocab (w)
During the training process, the loss of time step t is the target word
Figure FDA0003666765680000031
Negative log probability at the following time step:
Figure FDA0003666765680000032
the total loss of the entire sequence is:
Figure FDA0003666765680000033
7. the method for generating the summary of the BERT SUM and PGN fused supply chain ecosystem leader document according to claim 5, wherein the constructing pointer generates copy mechanism, which includes the following processes:
the decoder state is denoted as s t The decoder input is denoted x t :
Figure FDA0003666765680000034
Wherein the vector
Figure FDA0003666765680000035
w s ,w x And a scalar b ptr Is a learnable parameter, δ is a sigmoid function, p gen For flexible selectors, selecting from P vocab Generating a word from the sample vocabulary of (a) or from a t The probability distribution of the extended vocabulary, where the sample copies one word from the input sequence, is obtained for each document by adding the extra vocabulary to all the vocabularies present in the source document, as follows:
Figure FDA0003666765680000036
if w is an out-of-vocabulary word (OOV), then P vocab (w) is zero; similarly, if w is not present in the source document, then
Figure FDA0003666765680000037
Also 0.
8. The method for generating the abstract of the BERT SUM and PGN fused supply chain ecosystem-oriented document according to claim 5, wherein the constructing a coverage mechanism increases penalty terms, reduces words that have appeared in the generated abstract, and reduces repetition rate, including the following processes:
in the overlay mechanism, one overlay vector c is maintained t
Figure FDA0003666765680000038
c t Is a distribution of source document words that represents the degree of coverage that these words have obtained from the attention mechanism to date; c. C 0 Is a zero vector, the attention distribution at this time is as follows:
Figure FDA0003666765680000041
wherein w c Is a learnable parameter vector of the same length as v;
defining a coverage loss to penalize multiple attention to the same vocabulary:
Figure FDA0003666765680000042
coverage loss satisfies
Figure FDA0003666765680000043
Weighting the coverage loss through the hyper-parameter lambda, and adding the coverage loss into a loss function of the sequence model to obtain a new composite loss function:
Figure FDA0003666765680000044
CN202210593675.6A 2022-05-27 2022-05-27 Method for generating abstract of BERT SUM and PGN fused supply chain ecological district length document Pending CN115062140A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210593675.6A CN115062140A (en) 2022-05-27 2022-05-27 Method for generating abstract of BERT SUM and PGN fused supply chain ecological district length document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210593675.6A CN115062140A (en) 2022-05-27 2022-05-27 Method for generating abstract of BERT SUM and PGN fused supply chain ecological district length document

Publications (1)

Publication Number Publication Date
CN115062140A true CN115062140A (en) 2022-09-16

Family

ID=83197907

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210593675.6A Pending CN115062140A (en) 2022-05-27 2022-05-27 Method for generating abstract of BERT SUM and PGN fused supply chain ecological district length document

Country Status (1)

Country Link
CN (1) CN115062140A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116187163A (en) * 2022-12-20 2023-05-30 北京知呱呱科技服务有限公司 Construction method and system of pre-training model for patent document processing
CN116245178A (en) * 2023-05-08 2023-06-09 中国人民解放军国防科技大学 Biomedical knowledge extraction method and device of decoder based on pointer network
CN116933785A (en) * 2023-06-30 2023-10-24 国网湖北省电力有限公司武汉供电公司 Transformer-based electronic file abstract generation method, system and medium
CN117668213A (en) * 2024-01-29 2024-03-08 南京争锋信息科技有限公司 Chaotic engineering abstract generation method based on cascade extraction and graph comparison model
CN118506387A (en) * 2024-07-17 2024-08-16 中科晶锐(苏州)科技有限公司 Radar display control key information extraction device and method in electronic countermeasure

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112559730A (en) * 2020-12-08 2021-03-26 北京京航计算通讯研究所 Text abstract automatic generation method and system based on global feature extraction
CN113360601A (en) * 2021-06-10 2021-09-07 东北林业大学 PGN-GAN text abstract model fusing topics
CN114139497A (en) * 2021-12-13 2022-03-04 国家电网有限公司大数据中心 Text abstract extraction method based on BERTSUM model
CN114281982A (en) * 2021-12-29 2022-04-05 中山大学 Book propaganda abstract generation method and system based on multi-mode fusion technology

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112559730A (en) * 2020-12-08 2021-03-26 北京京航计算通讯研究所 Text abstract automatic generation method and system based on global feature extraction
CN113360601A (en) * 2021-06-10 2021-09-07 东北林业大学 PGN-GAN text abstract model fusing topics
CN114139497A (en) * 2021-12-13 2022-03-04 国家电网有限公司大数据中心 Text abstract extraction method based on BERTSUM model
CN114281982A (en) * 2021-12-29 2022-04-05 中山大学 Book propaganda abstract generation method and system based on multi-mode fusion technology

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ABIGAIL SEE等: "Get To The Point: Summarization with Pointer-Generator Networks", 《ARXIV:1704.04368V2》, 25 April 2017 (2017-04-25), pages 1 - 2 *
潘倩: "基于BERT模型的文本摘要方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 01, 15 January 2022 (2022-01-15), pages 2 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116187163A (en) * 2022-12-20 2023-05-30 北京知呱呱科技服务有限公司 Construction method and system of pre-training model for patent document processing
CN116187163B (en) * 2022-12-20 2024-02-20 北京知呱呱科技有限公司 Construction method and system of pre-training model for patent document processing
CN116245178A (en) * 2023-05-08 2023-06-09 中国人民解放军国防科技大学 Biomedical knowledge extraction method and device of decoder based on pointer network
CN116245178B (en) * 2023-05-08 2023-07-21 中国人民解放军国防科技大学 Biomedical knowledge extraction method and device of decoder based on pointer network
CN116933785A (en) * 2023-06-30 2023-10-24 国网湖北省电力有限公司武汉供电公司 Transformer-based electronic file abstract generation method, system and medium
CN117668213A (en) * 2024-01-29 2024-03-08 南京争锋信息科技有限公司 Chaotic engineering abstract generation method based on cascade extraction and graph comparison model
CN117668213B (en) * 2024-01-29 2024-04-09 南京争锋信息科技有限公司 Chaotic engineering abstract generation method based on cascade extraction and graph comparison model
CN118506387A (en) * 2024-07-17 2024-08-16 中科晶锐(苏州)科技有限公司 Radar display control key information extraction device and method in electronic countermeasure
CN118506387B (en) * 2024-07-17 2024-10-15 中科晶锐(苏州)科技有限公司 Radar display control key information extraction device and method in electronic countermeasure

Similar Documents

Publication Publication Date Title
CN113010693B (en) Knowledge graph intelligent question-answering method integrating pointer generation network
CN115062140A (en) Method for generating abstract of BERT SUM and PGN fused supply chain ecological district length document
CN111310471B (en) Travel named entity identification method based on BBLC model
CN111708882B (en) Transformer-based Chinese text information missing completion method
CN114201581B (en) Long text retrieval model based on contrast learning
CN110796160B (en) Text classification method, device and storage medium
CN113128229A (en) Chinese entity relation joint extraction method
CN111881677A (en) Address matching algorithm based on deep learning model
CN113190656A (en) Chinese named entity extraction method based on multi-label framework and fusion features
CN114757182A (en) BERT short text sentiment analysis method for improving training mode
CN114154504B (en) Chinese named entity recognition algorithm based on multi-information enhancement
CN112417901A (en) Non-autoregressive Mongolian machine translation method based on look-around decoding and vocabulary attention
CN114281982B (en) Book propaganda abstract generation method and system adopting multi-mode fusion technology
CN114169312A (en) Two-stage hybrid automatic summarization method for judicial official documents
CN113821635A (en) Text abstract generation method and system for financial field
CN112417138A (en) Short text automatic summarization method combining pointer generation type and self-attention mechanism
CN116663578A (en) Neural machine translation method based on strategy gradient method improvement
CN112069827B (en) Data-to-text generation method based on fine-grained subject modeling
CN112765983A (en) Entity disambiguation method based on neural network combined with knowledge description
CN115048511A (en) Bert-based passport layout analysis method
CN117744658A (en) Ship naming entity identification method based on BERT-BiLSTM-CRF
CN113297374A (en) Text classification method based on BERT and word feature fusion
CN112434512A (en) New word determining method and device in combination with context
CN116611428A (en) Non-autoregressive decoding Vietnam text regularization method based on editing alignment algorithm
Aharoni et al. Sequence to sequence transduction with hard monotonic attention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination