CN115062140A

CN115062140A - Method for generating abstract of BERT SUM and PGN fused supply chain ecological district length document

Info

Publication number: CN115062140A
Application number: CN202210593675.6A
Authority: CN
Inventors: 张川东; 张卜心; 阎德劲; 王伟; 廖伟智
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-05-27
Filing date: 2022-05-27
Publication date: 2022-09-16

Abstract

The invention discloses a method for generating a feed chain ecological district length document abstract fused with BERT SUM and PGN, which comprises the following steps: preprocessing the data set, wherein the preprocessing comprises the steps of carrying out sentence segmentation on documents in the data set to obtain sentences, and carrying out word segmentation on the sentences to obtain words; the method comprises the steps of coding sentences and words, coding paragraph marks, coding the positions of the sentences, and obtaining sentence vectors based on a BERT pre-training model according to the codes of the sentences and the words, the codes of the paragraph marks and the position codes of the sentences; capturing document-level features based on a transformer according to sentence vectors output by a BERT pre-training model; forming a transition document by the extracted key sentences and inputting the transition document into a generating model; generating a network replication or generating word level information based on the pointer; and generating a document abstract. The invention uses BERT as the code of the text abstract extraction part, is beneficial to machine understanding of the semantic information of the text and improves the effect of the final text abstract.

Description

Method for generating abstract of BERT SUM and PGN fused supply chain ecological district length document

Technical Field

The invention belongs to the field of supply chain collaborative ecological data processing, and particularly relates to a method for generating a supply chain ecological district length document abstract fused with BERT SUM and PGN.

Background

The text abstract aims to extract and simplify the complex and lengthy words on the basis of keeping the original meanings of the words. Traditional manual summarization is time-consuming and labor-consuming, which is particularly evident in supply chain collaborative ecosystems. The services in the ecological region of the supply chain are numerous, document information such as technical documents, operation instructions, news propaganda and the like in each service plate is endless, if key information is extracted after manual reading and understanding, a large amount of manpower is consumed to maintain the ecological region, and the speed of accurately acquiring the latest information by a user is greatly influenced. With the development of modern computer technology, a machine can learn how to automatically extract abstract contents of a text. The text abstract relates to a plurality of fields such as artificial intelligence, machine learning and the like, is a key direction of the natural language processing research field, and the abstract forming mode is mainly divided into an extraction type and a generation type.

The extraction type abstract is to extract key sentences and key words from a source document to form an abstract, wherein the abstract is all from original text. The method has low error rate in grammar and syntax and ensures certain effect. The traditional abstraction method is completed by using a graph method, clustering, a neural network and the like. The graph method is represented by TextRank, the algorithm imitates PageRank, sentences are used as nodes, similarity among the sentences is used, and undirected weighted edges are constructed. And (4) iteratively updating the node values by using the weight values on the edges, and finally selecting N nodes with the highest scores as the abstract. However, TextRank depends heavily on the word segmentation result, if a word is segmented into two words during word segmentation, the two words cannot be bonded together during key word extraction, and therefore the accuracy and the recall rate are greatly different. And TextRank, although considering the relationship between words, still tends to take frequent words as keywords. In addition, the extraction speed is low because the TextRank involves the construction of word graphs and iterative computation. In summary, the graph-based unsupervised method does not take semantic information of sentences into consideration when extracting the abstract, and is slow. Sentence feature vectors extracted by the neural network based on the BERT pre-training model can be expressed in a better semantic mode.

The abstract has certain grammatical and syntactic guarantees, but also faces certain problems, such as: content selection error, poor consistency, poor flexibility and the like. The generative abstract allows new words or phrases to be contained in the abstract, has high flexibility, and with the development of a neural network model in recent years, a sequence-to-sequence (Seq2Seq) model is widely used for the generative abstract task and achieves certain results. However, in the generated abstract, the generation process often lacks control and guidance of key information, for example, a pointer-generator network cannot well locate key words in the copy process.

Disclosure of Invention

The invention aims to overcome the defects of the prior art of technical document data processing in a supply chain collaborative ecological area, and provides a method for generating a summary of a long document in a supply chain ecological area with integration of BERT SUM and PGN, which comprises the following steps:

the method comprises the following steps that firstly, a data set is preprocessed, wherein the preprocessing comprises the steps of carrying out sentence segmentation on documents in the data set to obtain sentences, and carrying out word segmentation on the sentences to obtain words;

secondly, coding sentences and words, coding paragraph marks, coding the positions of the sentences, and acquiring sentence vectors based on a BERT pre-training model according to the codes of the sentences and the words, the codes of the paragraph marks and the position codes of the sentences;

step three, capturing document-level features based on a transformer according to sentence vectors output by the BERT pre-training model;

step four, forming a transition document by the extracted key sentences and inputting the transition document into the generation model;

step five, generating network replication or word level information based on the pointer;

and step six, generating a document abstract.

Further, the method for encoding sentences and words, paragraph marks and sentence positions includes the following steps of:

carrying out word vector coding, namely firstly coding a beginning symbol [ CLS ] and a sentence ending symbol [ SEP ] of a sentence and all words in a document to obtain the characteristics of a plurality of sentences;

carrying out the encoding of paragraph identification, and distinguishing a plurality of sentences in the document by using the paragraph identification;

coding the positions of sentences, wherein a position vector obtained after coding represents the absolute position of each word, each word in an input sequence is sequentially converted into position codes according to the subscript sequence of the word, and then the position codes are converted into real-value vectors by using a position vector matrix to obtain position vectors;

and adding the word vector codes, the paragraph marks and the sentence position codes to obtain sentence vectors.

Further, the sentence vector output according to the BERT pre-training model captures document-level features based on a transformer, comprising the following steps:

the inter-sentence Transformer extracts document-level features from the BERT output that are devoted to the summarization task:

wherein h is ⁰ Pomsemb (T) and T are sentence vectors output by BERT, PosEmb is an added position coding formula representing position information of sentence T, LN is layer normalization operation, MHA _tt Is the attention of multiple headsMechanism, superscript l denotes the depth of the superimposed layer;

calculating the final output layer by using a sigmoid classifier:

wherein h is ^L Is sent _i The vector from the transform top layer (lth layer) is set to L ═ 2, i.e., 2 layers of transforms.

Further, the method for generating network replication or generating word-level information based on pointers includes the following steps:

s51, constructing a sequence-to-sequence attention model framework;

s52, constructing a pointer generation copy mechanism;

s53, constructing a coverage mechanism to increase punishment items, reduce the words which are already generated in the generated summary, and reduce the repetition rate.

Further, the step of inputting the extracted key sentence forming transition document into the generative model includes the following steps:

extracting key sentences according to the document level characteristics to obtain a key sentence set, performing word segmentation on the extracted key sentence set, cleaning, performing vector representation on the cleaned text by word2vec, encoding, and generating the input of a network model by using the encoded vector as a pointer.

Further, the constructing of the sequence-to-sequence attention model framework comprises the following processes:

s511, the word vectors of the words are sequentially input into an encoder, the encoder is a single-layer bidirectional LSTM, and a hidden state sequence h of the encoder is generated _i At each step t, the decoder receives the word vector of the previous word:

a ^t ＝softmax(e ^t )

wherein，v，W _h ，W _s And b _attn Is an adjustable parameter, the attention distribution can be seen as a probability distribution of the source word, which tells the decoder where to look for generating the next word;

s512, the attention distribution is used for generating a weighted sum of hidden states of the encoder, which is called a context vector

A context vector, being a fixed-size representation of the content read from the source code, connected to a decoder state stage provided by two linear layers;

s513, generating a vocabulary distribution Pvocab:

wherein V, V ', b and b' are adjustable parameters, P _vocab Is the probability distribution of all words in the vocabulary, providing the final distribution of predicted words w

S514, distribution of words w

P(w)＝P _vocab (w)

During the training process, the loss of time step t is the target word

Negative log probability at the following time steps:

the total loss of the entire sequence is:

further, the constructing pointer generates copy mechanism, which includes the following processes:

the decoder state is denoted as s _t The decoder input is denoted x _t :

Wherein the vector

w _s ，w _x And a scalar b _ptr Is a learnable parameter, δ is a sigmoid function, p _gen For flexible selectors, selecting from P _vocab Generating a word from the sample vocabulary of (a) or from a ^t The probability distribution of the extended vocabulary, where the sample copies one word from the input sequence, is obtained for each document by adding the extra vocabulary to all the vocabularies present in the source document, as follows:

if w is an out-of-vocabulary word (OOV), then P _vocab (w) is zero; similarly, if w does not appear in the source document, then

Also 0.

Further, the construction of the coverage mechanism increases penalty terms, reduces words which have appeared in the generated summary, and reduces repetition rate, including the following processes:

in the overlay mechanism, one overlay vector c is maintained ^t ：

c ^t Is a distribution of source document words that represents the degree of coverage that these words have obtained from the attention mechanism to date; c. C ⁰ Is a zero vector, the attention distribution at this time is as follows:

wherein w _c Is a learnable parameter vector of the same length as v;

define a coverage loss to penalize multiple attention to the same vocabulary:

coverage loss satisfies

Weighting the coverage loss through the hyper-parameter lambda, and adding the coverage loss into a loss function of the sequence model to obtain a new composite loss function:

the invention has the beneficial effects that: 1. the invention uses BERT as the coding of the text abstract extraction part, has advancement, is beneficial to the machine to understand the semantic information of the text, and improves the effect of the final text abstract;

2. extracting document-level features by using a transformer between sentences, so that text information is more fully utilized;

3. the pointer generator model is used as a tool for generating the abstract, so that the repetition rate in generating abstract sentences can be reduced, the phenomenon of sentence circulation can be avoided, and the generated sentences are smoother.

Drawings

FIG. 1 is a schematic flow chart diagram of a method for generating a document abstract of a supply chain ecosystem with integration of BERT SUM and PGN;

FIG. 2 is a schematic diagram of an extraction model based on a BERT pre-training model;

FIG. 3 is a diagram of a transform-based document-level feature extraction architecture;

FIG. 4 is a schematic diagram of a sequence-to-sequence model;

FIG. 5 is a schematic diagram of a pointer generator model.

Detailed Description

The technical solutions of the present invention are further described in detail below with reference to the accompanying drawings, but the scope of the present invention is not limited to the following descriptions.

As shown in fig. 1 to 5, a method for generating a document summary of a BERT SUM and PGN fused supply chain ecosystem captain includes the following steps:

generating network copy or word level information based on the pointer;

and step six, generating a document abstract.

The method for coding sentences and words, paragraph marks and sentence positions comprises the following steps of:

carrying out encoding of paragraph identification, and distinguishing a plurality of sentences in the document by using the paragraph identification;

coding the positions of sentences, wherein a position vector obtained after coding represents the absolute position of each word, each word in an input sequence is sequentially converted into position codes according to the subscript sequence of the word, and then the position codes are converted into real-value vectors by utilizing a position vector matrix to obtain position vectors;

and adding the word vector code, the paragraph mark code and the sentence position code to obtain a sentence vector.

The sentence vector output according to the BERT pre-training model captures the document-level features based on a transformer, and the method comprises the following steps:

wherein h is ⁰ Pomsemb (T) and T are sentence vectors output by BERT, PosEmb is an added position coding formula representing position information of sentence T, LN is layer normalization operation, MHA _tt A multi-head attention mechanism, wherein the superscript l represents the depth of the superimposed layer;

calculating the final output layer by using a sigmoid classifier:

The method for forming the transition document by the extracted key sentences and inputting the transition document into the generating model comprises the following steps:

extracting key sentences according to the document level characteristics to obtain a key sentence set, dividing words of the extracted key sentence set, cleaning, carrying out vector representation on the cleaned text by word2vec, coding, and generating the input of the network model by using the coded vector as a pointer. The key sentence extraction method comprises the following steps:

the feature extraction is carried out by adopting a bidirectional Transformer as an encoder, and a Transformer encoding unit comprises two parts, namely a self-attention mechanism and a feedforward neural network. The input part from the attention mechanism is made up of three different vectors from the same word, a Query vector (Q), a Key vector (K), and a Value vector (V). The similarity between the word vectors of the input part is expressed by multiplying the Query vector and the Key vector, and is marked as QK ^T And through d _k And scaling is carried out to ensure that the obtained result is moderate in size. And finally, carrying out normalization operation by softmax to obtain probability distribution, and further obtaining the weight summation expression of all word vectors in the sentence.

And taking the sentence with the summation value of the top n as a key sentence and extracting the sentence, wherein n is the self-defined number of the key sentences.

The method for generating the network replication or generating the word level information based on the pointer comprises the following steps:

s51, constructing a sequence-to-sequence attention model framework;

s52, constructing a pointer generation copy mechanism;

The method for constructing the sequence-to-sequence attention model framework comprises the following steps:

s511, the word vectors of the words are sequentially input into the encoder, which is a single-layer bidirectional LSTM, to generateEncoder hidden state sequence h _i At each step t, the decoder receives the word vector of the previous word:

a ^t ＝softmax(e ^t )

wherein, v, W _h ，W _s And b _attn Is an adjustable parameter, the attention distribution can be seen as a probability distribution of the source word, which tells the decoder where to look for generating the next word;

s513, generating a vocabulary distribution Pvocab:

S514, distribution of words w

P(w)＝P _vocab (w)

During the training process, the loss of time step t is the target word

Negative log probability at the following time step:

The total loss of the entire sequence is:

the construction pointer generation copy mechanism comprises the following processes:

the decoder state is denoted as s _t The decoder input is denoted x _t :

Wherein the vector

if w is an out-of-vocabulary word (OOV), then P _vocab (w) is zero; similarly, if w is not present in the source document, then

Also 0.

The construction of the coverage mechanism enlarges penalty items, reduces words which are already appeared in the generated summary, and reduces repetition rate, and comprises the following processes:

in the overlay mechanism, one overlay vector c is maintained ^t ：

wherein w _c Is a learnable parameter vector of the same length as v;

defining a coverage loss to penalize multiple attention to the same vocabulary:

coverage loss satisfies

specifically, the method for generating the summary of the supply chain ecological district leader document fused by BERT SUM and PGN comprises the following steps:

s1, preprocessing the data set such as sentence segmentation and word segmentation;

s2, obtaining sentence vectors based on the BERT pre-training model;

s3, capturing document-level features based on a transformer;

s4, forming a transition document by the extracted key sentences and inputting the transition document into the generation model;

s5, generating network copy or word level information based on the pointer;

and S6, generating a final abstract.

Further, the step S2 includes the following sub-steps:

s21, carrying out word vector coding, firstly, coding the beginning symbol [ CLS ], the ending symbol [ SEP ] of the sentence and all the words in the document. In normal BERT, [ CLS ] is used as a symbol to aggregate features in one or a pair of sentences. Modifying the model using a plurality of [ CLS ] symbols to obtain characteristics of a plurality of sentences;

s22, paragraph mark is coded, wherein the paragraph mark is used to distinguish a plurality of sentences in the document, specifically, for the ith sentence, the E is used _A Or E _B To encode whether its position is in the odd or even position, e.g. E for the odd number of sentences _A Marking, for even sentences, by E _B Marking;

s23, coding the position of the sentence, wherein the position vector obtained after coding can represent the absolute position of each word, each word in the input sequence is sequentially converted into position codes according to the subscript sequence of the word, and then the position codes are converted into real value vectors by utilizing a position vector matrix to obtain the position vector;

and S24, adding the three parts of codes, wherein the dimensions of the three vectors are the same, and the sizes of the vectors are the products of the maximum length of the sequence and the dimension of the word vector.

Further, the capturing of document-level features based on the transform of step S3 includes the following sub-steps:

s31, applying more Transformer layers on sentence representation by the Transformer among sentences, and extracting document-level features which are concentrated in the abstract task from BERT output:

wherein h is ⁰ PosEmb (T) and T are sentence vectors output by BERT, and PosEmb is an added position coding formula to represent position information of the sentence T. LN is a layer normalization operation, MHA _tt Is a multi-headed attention mechanism, the superscript l denoting the depth of the overlying layer

S32, calculating the final output layer by using a sigmoid classifier:

wherein h is ^L Is sent _i A vector from the transform top layer (lth layer), where L is set to 2, i.e., a 2-layer transform.

Further, the step S5 includes the following sub-steps:

s51, constructing a sequence-to-sequence attention model framework;

s511, the word vectors of the words are input into an encoder one by one, the encoder is a single-layer bidirectional LSTM, and a hidden state sequence h of the encoder is generated _i . At each step t, the decoder (single-layer unidirectional LSTM) receives the word vector of the previous word

a ^t ＝softmax(e ^t ) (5)

Wherein, v, W _h ，W _s And b _attn Is a tunable parameter, the attention distribution can be seen as a probability distribution of the source word, which tells the decoder where to look for generating the next word.

S512, the attention distribution is used for generating an encoderWeighted sum of hidden states, called context vector

The context vector, which can be seen as a fixed size representation of the content read from the source code at this step, is connected to a decoder state stage provided by two linear layers.

S513, generating a vocabulary distribution Pvocab:

S514, distribution of words w

P(w)＝P _vocab (w) (8)

During the training process, the loss of time step t is the target word

Negative log probability at the following time step:

the total loss of the entire sequence is:

s52, constructing a pointer to generate a copy mechanism, and solving the problem of unregistered words;

since it allows both copying of words by pointing and from a fixed vocabularyA word is generated. In the pointer generator model (shown in FIG. 3), the attention distribution a is calculated in section 2.1 ^t And context vector

Furthermore, the probability P of generation of the time step t _gen ∈[0,1]Is based on context vectors

Calculated, decoder state is represented as s _t The decoder input is denoted x _t :

Wherein the vector

w _s ，w _x And a scalar b _ptr Is a learnable parameter, δ is a sigmoid function. p is a radical of _gen Is regarded as a flexible selector, and can select from P _vocab Whether to generate a word from the sample vocabulary of (a) or (b) ^t The sample copies a word from the input sequence. For each document, the extended vocabulary is the addition of the additional vocabulary and all the vocabulary that appears in the source document. The probability distribution of the extended vocabulary can be obtained as follows:

Also 0.

Therefore, the problem that new vocabularies cannot be generated is solved through a copy mechanism, and the limitation that vocabularies in a sequence-to-sequence model are limited in a preset vocabulary is broken through.

S53, constructing a coverage mechanism to increase punishment items and reduce the words which are already generated in the generated abstract, thereby reducing the repetition rate;

in the overlay mechanism, one overlay vector c is maintained ^t I.e. the sum of the attention distributions of all previous decoder time steps:

c ^t is a (non-normalized) distribution of source document words that represents the degree of coverage that these words have obtained from the attention mechanism to date. c. C ⁰ Is a zero vector because in the first time step, no source documents are involved. The coverage vector is used as an additional input to the attention mechanism, so the attention distribution at this time is as follows:

wherein w _c Is a learnable parameter vector of the same length as v, which ensures that where the attention mechanism chooses as the next place of attention is influenced by the words already in the generated summary. This will make it easier for the attention mechanism to avoid repeatedly paying attention to the same location, and thus avoiding the generation of repeated text.

In addition, an extra coverage loss is defined to punish multiple attention to the same vocabulary:

however, this loss of coverage is limited and needs to be met

And finally, weighting the coverage loss through the hyper-parameter lambda, and adding the coverage loss into a loss function of the sequence model to obtain a new composite loss function:

therefore, the problem of word circulation and repetition based on a sequence-to-sequence model during the generation of the long document abstract is solved through a coverage mechanism.

The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A BERT SUM and PGN fused supply chain ecological district document abstract generating method is characterized by comprising the following steps:

generating network copy or word level information based on the pointer;

and step six, generating a document abstract.

2. The method for generating the abstract of the BERT SUM and PGN fused supply chain ecosystem for document with a length according to claim 1, wherein the method comprises the following steps of coding sentences and words, coding paragraph identifiers, coding positions of sentences, and obtaining sentence vectors based on a BERT pre-training model according to the coding of sentences and words, the coding of paragraph identifiers and the coding positions of sentences, and comprises the following steps:

3. The method as claimed in claim 1, wherein the sentence vector output according to the BERT pre-training model captures the document-level features based on a transformer, comprising the steps of:

calculating the final output layer by using a sigmoid classifier:

4. The method for generating the abstract of the BERT SUM and PGN fused supply chain ecosystem-oriented document according to claim 1, wherein the step of inputting the extracted key sentence forming transition document into the generation model includes the following steps:

extracting key sentences according to the document level characteristics to obtain a key sentence set, dividing words of the extracted key sentence set, cleaning, carrying out vector representation on the cleaned text by word2vec, coding, and generating the input of the network model by using the coded vector as a pointer.

5. The method for generating the summary of the BERT SUM and PGN-fused supply chain ecosystem-oriented document according to claim 1, wherein the generating network copy or generating word level information based on pointers comprises the following steps:

s51, constructing a sequence-to-sequence attention model framework;

s52, constructing a pointer generation copy mechanism;

6. The method of claim 5, wherein the constructing the sequence-to-sequence attention model framework comprises the following steps:

a ^t ＝soft max(e ^t )

s513, generating a vocabulary distribution Pvocab:

S514, distribution of words w

P(w)＝P _vocab (w)

During the training process, the loss of time step t is the target word

Negative log probability at the following time step:

the total loss of the entire sequence is:

7. the method for generating the summary of the BERT SUM and PGN fused supply chain ecosystem leader document according to claim 5, wherein the constructing pointer generates copy mechanism, which includes the following processes:

the decoder state is denoted as s _t The decoder input is denoted x _t :

Wherein the vector

Also 0.

8. The method for generating the abstract of the BERT SUM and PGN fused supply chain ecosystem-oriented document according to claim 5, wherein the constructing a coverage mechanism increases penalty terms, reduces words that have appeared in the generated abstract, and reduces repetition rate, including the following processes:

in the overlay mechanism, one overlay vector c is maintained ^t ：

wherein w _c Is a learnable parameter vector of the same length as v;

defining a coverage loss to penalize multiple attention to the same vocabulary:

coverage loss satisfies