CN114254175A

CN114254175A - Method for extracting generative abstract of power policy file

Info

Publication number: CN114254175A
Application number: CN202111550623.2A
Authority: CN
Inventors: 郑福康; 陈正飞; 王嘉豪
Original assignee: Shenzhen Power Supply Bureau Co Ltd
Current assignee: Shenzhen Power Supply Bureau Co Ltd
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2022-03-29

Abstract

The invention discloses a method for extracting a generative abstract of a power policy file, which comprises the following steps: step S10, acquiring an electronic document of the power policy file by adopting a crawler technology; step S11, performing word segmentation processing on the electronic document, forming initial embedded data according to a word vector model, and inputting the initial embedded data into a pre-trained abstract generation model; step S12, adding position coding in the bottom embedding of the encoder and the decoder; and step S13, automatically generating abstract contents by using the output of the current time and the previous time of the decoder and the generation probability of the pointer generation network obtained by attention distribution splicing. The invention can improve the efficiency and the accuracy of generating the abstract.

Description

Method for extracting generative abstract of power policy file

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method for extracting a generating type automatic abstract of a power policy file.

Background

For power supply enterprises, the enhancement of power price management is an important guarantee that the sales income can be realized and the profit level can be improved. The method has the advantages of seriously executing national electricity price policies and regulations, standardizing the order of electricity price management, and having important significance for ensuring the regulation and control of national industrial policies, saving energy and maintaining the economic benefits of both power supply and power utilization parties. The electricity price policy needs to be known in time so as to make a reasonable electricity marketing strategy and promote the development of the electric power enterprises.

At present, the inherent habits of people in daily work and life are being changed rapidly by the rise of artificial intelligence and deep learning technology, the technical speciality of the automatic summarization method based on deep learning can be exerted in the field of electricity price policy management, generally speaking, electricity price policy information can be published on websites of the national level with strong specialty and authority, so that an electricity price policy electronic document can be obtained from the websites, in order to facilitate managers to know the key content of an electricity price policy text rapidly, key information in the electricity price policy electronic document needs to be extracted, then a summary document based on the electricity price policy electronic document is generated automatically, and policy makers are helped to obtain relevant information more efficiently.

In the prior art, the technology for realizing automatic summarization mainly comprises an abstract type summarization and a generated summarization, and a TextRank ordering algorithm is adopted in the abstract type summarization method, so that the abstract type summarization method is widely applied to the industry due to the characteristics of conciseness and high efficiency. However, the abstraction type abstract mainly considers word frequency, does not have too much semantic information, and cannot establish complete semantic information in text paragraphs.

The generated text abstract is mainly realized by a deep neural network structure, a basic framework is from Sequence-to-Sequence (Seq 2Seq) sequences proposed by the Google Brain team in 2014, and an Encoder-Decoder framework (Encoder-Decoder) is adopted, wherein the most classical Encoder and Decoder are both formed by a plurality of layers of RNN/LSTM, the Encoder is responsible for encoding an original text into a vector, and the Decoder is responsible for extracting information from the vector, acquiring semantics and generating the text abstract.

However, when the existing automatic summarization technology is applied to texts with certain writing formats, such as scientific papers and policy documents, the summarization effect is still insufficient, and the problems of Out of vocabulary (OOV), text repetition discontinuity and long-distance dependence mainly exist.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method for extracting a generative digest of a power policy document, which can solve the above mentioned problems and improve the efficiency and accuracy of generating the digest.

To solve the above technical problem, as an aspect of the present invention, there is provided a method for extracting a generative digest of a power policy file, including the steps of:

step S10, acquiring an electronic document of the power policy file from a specific website by adopting a crawler technology;

step S11, performing word segmentation processing on the electronic document, forming initial embedded data according to a word vector model, and inputting the initial embedded data into a pre-trained abstract generation model;

wherein the digest generation model employs an encoder-decoder framework in which an attention-based bidirectional Transformer model (Transformer) is used as a language representation model;

the encoder includes: the multi-head attention layer and the fully-connected feedforward neural network layer are formed by two sublayers, the connection between the sublayers adopts residual connection, and then layer normalization is carried out; the decoder at least comprises a multi-head attention layer with a mask, a multi-head attention layer and a fully-connected feedforward neural network layer, and the sub-layers are connected by adopting residual errors and are normalized;

step S12, adding position coding in the bottom embedding of the encoder and the decoder;

step S13, obtaining the generation probability of Pointer generation network (Pointer-Generator Networks) by using the output of the decoder at the current time and the previous time and attention distribution concatenation, and controlling to copy the content in the electronic document source text according to the generation probability or generate corresponding abstract content according to the attention.

Preferably, further comprising:

and constructing a summary generation model in advance and training to obtain the trained summary generation model.

Preferably, the pre-constructing a summary generation model and performing training, and the obtaining of the trained summary generation model further includes:

constructing a summary generation model adopting an encoder-decoder framework, wherein a bidirectional converter model based on an attention mechanism is used in both an encoder and a decoder;

counting all words in the training corpus and generating a dictionary file; and forming a training set;

and initially embedding the dictionary files in the training set into an encoder of the abstract generating model through a vector model, training the abstract generating model, and finally obtaining the trained abstract generating model.

Preferably, in step S13, the generation probability is obtained in the pointer generation network by:

calculating the attention product of each word embedding input and the decoder output, normalizing to obtain weights, and weighting and summing to obtain the attention score e_i：

e_i＝ν^Ttanh(W_hhⁱ+W_ss_t+W_cc^t)；

Then, attention is paid to the softmax operation

Calculating content vector c by multiplying attention and hidden layer of encoder_iAnd a vocabulary distribution P_vocab：

P_vocab＝softmax(L(s_t,c_i))

Wherein h isⁱEncoder hidden layer state representing the ith word, c_iRepresents a content vector, s_iThe decoder hidden layer state representing the ith word;

calculating a generation probability P by_gen：

p_gen＝σ(W_c'c_i+W_h'hⁱ+W_xx_t+b_ptr)

Obtaining the probability distribution of the final word list by combining the word list distribution and the attention distribution;

wherein, P_{_vocab}Is a distribution of the word list,

is the attention distribution.

Preferably, the pointer generation network employs a hierarchical pointer generation network.

The implementation of the invention has the following beneficial effects:

the invention provides a method for extracting a generative abstract of a power policy file, which adopts a Seq2Seq framework integrated with an attention mechanism as a basic model for generating the abstract, and adds a pointer to generate a network at the same time, so that words are directly copied from a source document to solve the OOV problem;

and then combining a hierarchical structure of the policy document, and adding language model modeling language segment information of a language segment level (section level) on the basis of a pointer generation network. In the technology of language model modeling language segment information, the invention abandons the traditional RNN and LSTM structures, introduces a bidirectional converter model as a language representation model in a Seq2Seq framework integrated with an attention mechanism, and effectively solves the problem of long-distance dependence. The invention designs an improved attention mechanism to solve the problems of incoherent irrelevant content and repeated sentences in long texts.

The invention designs an automatic abstract identification method suitable for the long text aiming at the characteristics of the electricity price policy text that the long text and the writing format are relatively fixed, and integrates the special format characteristics of the automatic abstract identification method. The efficiency and the accuracy of the abstract extraction process can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is within the scope of the present invention for those skilled in the art to obtain other drawings based on the drawings without inventive exercise.

Fig. 1 is a schematic main flow chart of an embodiment of a method for collecting an electricity price policy document according to the present invention;

fig. 2 is a schematic diagram of the hierarchical pointer generation network referred to in fig. 1.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

Fig. 1 is a main flow diagram illustrating an embodiment of a method for extracting a generative digest of a power policy file according to the present invention. Referring to fig. 2 together, in this embodiment, the method includes the following steps:

step S13, obtaining the generation probability of Pointer generation network (Pointer-Generator Networks) by using the output of the decoder at the current time and the previous time and attention distribution concatenation, and controlling to copy the content in the electronic document source text according to the generation probability or generate corresponding abstract content according to the attention. Specifically, if there is no decoded word in the lexical distribution, then it is replicated using the multi-head attention distribution, and if there is a decoded word in the lexical distribution, then a distributed representation of the decoded word is used

It is understood that, in a specific example of the present invention, further comprising:

In one example, the pre-constructing and training the summary generation model, and obtaining the trained summary generation model further includes:

Specifically, in step S13, the generation probability is obtained in the pointer generation network by:

e_i＝ν^Ttanh(W_hhⁱ+W_ss_t+W_cc^t)；

Then, attention is paid to the softmax operation

Calculating content vector c by multiplying attention and hidden layer of encoder_iAnd vocabulary distribution Pvocab:

P_vo_cab＝softmax(L(s_t,c_i))

calculating a generation probability P by_gen：

p_gen＝σ(W_c'c_i+W_h'hⁱ+W_xx_t+b_ptr)

wherein, P_{_vocab}Is a distribution of the word list,

is the attention distribution.

In one example of the present invention, the pointer generation network employs a hierarchical pointer generation network.

For better understanding, the following further describes each of the aspects of the present invention.

First, in the embodiments provided herein, the Sequence-to-Sequence (Sequence-to-Sequence) framework of Attention (Attention) mechanism is adopted, which attempts to use the RNN as an encoder and decoder first. And simultaneously adding a Pointer to generate a network (Pointer-Generator Networks), and using initial embedding of words obtained by a pre-trained word vector model as the input of the model.

It can be understood that, compared with the Seq2Seq framework which is only adopted and is integrated with the Attention mechanism, the decoder generates the distribution P of a word list through the softmax function_vocabDifferent, it means that attention calculation is performed once for the generating network and the words in the source document at the decoder stage, thereby generating an attention distribution. At this point, the pointer generation network computes the attention product with its decoder output by embedding each word in the input, then normalizing to get the weight, and weighted summing to get the attention score e_i：

e_i＝ν^Ttanh(W_hhⁱ+W_ss_t+W_cc^t)；

Then, attention is paid to the softmax operation

P_vocab＝softmax(L(s_t,c_i))

calculating a generation probability P by_gen：

p_gen＝σ(W_c'c_i+W_h'hⁱ+W_xx_t+b_ptr)

wherein, P_{_vocab}Is a distribution of the word list,

is the attention distribution. P_genCan be viewed as a switch that controls whether to copy words from the input queue or to generate new words, P if unregistered_{_vocab}0, which can only be obtained by replication; words can only be generated by the model if they do not appear in the input text.

The probability of generating words is obtained as the final output result.

Secondly, in the invention, a pointer with a hierarchical structure is adopted to generate a network.

Since price policy articles are often structured, they are organized into a section. When summarization is performed, information is generally extracted from different speech segments and then summarized. Therefore, the invention adds language model modeling language segment information of a language segment level (section level) on the basis of the prior art.

As shown in fig. 2, a schematic diagram of a hierarchical pointer generation network employed by the present invention is shown.

Wherein, for the encoder:

the lowest layer word-level RNN generates an expression of a section, wherein an superscript(s) represents the section, (t) represents a decode step, (e) represents an encoder, (d) represents a decoder, a subscript i represents a word number, and j represents the section number;

x_(j,i)the word representing the ith word, part of the jth word, embeds a vector.

Section-level RNN generates a representation of a document using underlying input

Representing the hidden layer state of the ith part.

For the decoder:

the specific method is that the context coefficient is provided with the information of section level, the context coefficient is represented as firstly summing in a section and then summing all the sections.

Wherein the content of the first and second substances,

the encoder hidden layer state of the ith word representing the jth part,

representing a corresponding attention. c. C_iA content vector is represented.

The newly introduced variable relates to the attribute of the section level:

wherein the content of the first and second substances,

the encoder hidden layer state of the jth section is represented,

representing the decoder hidden layer state at time t-1.

In general, the attention coefficient is calculated as follows,

wherein the content of the first and second substances,

the encoder hidden layer state representing the jth partial ith word,

representing the decoder hidden layer state at time t-1.

The coverage vector is calculated as:

and (3) calculating the final probability:

wherein the content of the first and second substances,

representing the decoder hidden layer state at time t, c_tIs a coverage vector.

Third, in the present invention, the RNN model in the encoder-decoder framework is replaced with a bidirectional Transformer model (Transformer).

The problem of long-distance dependence and parallel computation can be effectively solved by a bidirectional converter model (Transformer), which is based on a self-attention (self-attention) layer and is divided into an encoder part and a decoder part. The model structure can be combined with the model structure in the prior art.

The input word embedding vector and the corresponding element of the position embedding vector are added at the encoder end, so that the model can learn more information of word positions in the sentence, and finally the model can distinguish words at different positions in the sentence. They are input to the self attention layer, attention coefficients are calculated, and finally a vector Z is output to the next encoder.

Where the input to the attention mechanism is Q (query), the key-value pair (K, V) is used to store the context. For the attention mechanism, Q is K and V, and is calculated by the similarity of the text and the multiplication of the text and the text. And the result is spliced by using a multi-head attention mechanism, so that the current word and other words can show more relationships by the multi-head attention.

Then, the addition operation of the corresponding elements is performed through an addition (Add) Layer, and Layer Normalization is used to avoid the over-fitting problem. Then a Feed-Forward neural network (Feed Forward) is connected to map the attention matrix Z into a space with higher dimensionality, and the ReLU is used for carrying out nonlinear operation, and finally the attention matrix Z is restored to be in the same dimensionality as Z.

After 6 identical encoder processes, a vector R is finally output, representing all the encoded representation information for the source sequence. The vector R will be converted K, V into two vectors, where the key-value pairs (K, V) are used to store the context, which will be used for the computation of the encoding-decoding attention layer in the decoder part, thus enabling the information integration of the encoder and decoder.

In the decoder section, the process before the Linear layer is the same as the encoder. Because the encoder belongs to the prediction process, Linear operation is required to be performed on a Linear layer to realize dimension expansion (vector dimension is changed into the length of a dictionary), the final probability distribution of the whole dictionary is obtained through softmax normalization operation, and the index (index) with the maximum probability is selected, so that the corresponding generated word can be obtained.

This word is then input as the next predicted word, and so on until the sentence end flag < EOS > is generated, at which point the decoder portion ends.

It can be appreciated that the method adopted by the present invention solves the OOV problem in the policy document by adopting the Seq2Seq framework integrated with the attention mechanism and adopting the improved pointer to generate the network; compared with the traditional Seq2Seq framework integrated with attention mechanism, the pointer generation network has the capability of directly copying words from source documents, and the capability has a very good effect on generating some OOVs. Since the policy articles are often structured, they are organized into a section. When people abstract, information is usually extracted from different speech segments and then summarized. Therefore, the language model modeling language segment information of the language segment level (section level) is added to the basic pointer generation network, and the pointer generation network with the hierarchical structure is designed.

Meanwhile, a bidirectional converter model is introduced into a Seq2Seq framework integrated with an attention mechanism as a language representation model so as to solve the problem that the existing attention mechanism is easy to repeat and discontinuous for generating long texts; the abstract generation model designed by the invention can not pay more attention to a specific part and generate repeated sentences. First, the attention layer at the encoder calculates weights for each word in the input, which enables the generated content information to be overlaid on the original text without paying attention to a specific piece of content. And the attention layer at the decoder also calculates weights for the words that have already been generated, which avoids generating duplicate content. After the attention mechanism is used on the encoder and the decoder respectively, the encoder and the decoder are spliced together to decode to generate the next word, so that the generation of repeated sentences can be avoided.

The implementation of the invention has the following beneficial effects: the implementation of the invention has the following beneficial effects:

While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method for extracting a generative abstract of a power policy document is characterized by comprising the following steps:

2. The method of claim 1, further comprising:

3. The method of claim 2, wherein the pre-constructing and training the summary generation model, and obtaining the trained summary generation model further comprises:

4. The method of claim 1, wherein in step S13, the generation probability is obtained in the pointer generation network by:

e_i＝ν^Ttanh(W_hhⁱ+W_ss_t+W_cc^t)；

Then, attention is paid to the softmax operation

P_vocab＝soft max(L(s_t,c_i))

calculating a generation probability P by_gen：

p_gen＝σ(W_c'c_i+W_h'hⁱ+W_xx_t+b_ptr)

wherein, P_{_vocab}Is a distribution of the word list,

is the attention distribution.

5. The method of any of claims 1 to 4, wherein the pointer generation network employs a hierarchical pointer generation network.