CN115034217A

CN115034217A - Microblog text-oriented generative automatic text summarization method based on key information guidance

Info

Publication number: CN115034217A
Application number: CN202210608239.1A
Authority: CN
Inventors: 赵铁军; 郭常江; 杨沐昀; 朱聪慧; 徐冰; 曹海龙
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-09-09

Abstract

The invention provides a microblog text-oriented generating type automatic text summarization method based on key information guidance, which comprises the steps of firstly cleaning microblog texts and removing redundant information and other non-key information in the microblog texts; then, obtaining key words and key phrases in the microblog text through a key information extraction module; then designing a special deep learning neural network aiming at the task and training a model by using a public data set; finally, the processed microblog text and the key information are used as input, and the key information is used for guiding abstract generation to obtain a final abstract result; the invention aims to improve the precision of generating the abstract according to the microblog text, further improve the accuracy of content retrieval when a public opinion analysis system analyzes aiming at the microblog text, more briefly and accurately cover the main information of the microblog text and save the time for manually reading the full text.

Description

Microblog text-oriented generative automatic text summarization method based on key information guidance

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a microblog text-oriented generation type automatic text summarization method based on key information guidance.

Background

The automatic text abstract generation is a method for semantic compression and key information extraction aiming at a text, is commonly used for assisting reading, filtering redundant information and other scenes, and is also applied to public opinion monitoring systems recently so as to provide more comprehensive event development analysis.

The 21 st century can be said as the age of the internet, and the convenient condition of the internet changes a plurality of life styles of people, wherein the change of reading habits is only after the change of network e-commerce, and is accompanied by everyone. Because the mobile phone is portable and easy to use, people who are used to paper media for reading are willing to use electronic media for reading at present, various news websites and articles are endlessly developed, and the situation that tens of thousands of articles appear every day is caused while the people conveniently read the articles. Particularly, in a microblog platform, as a social network, microblog is undoubtedly one of the most fierce network media at present. Many network events are discovered at the first time on the microblog, however, for the general public without knowledge, for a certain public opinion event, much time is spent if the public opinion event is known from the beginning, so that the public can quickly know the general situation of the public opinion event;

on the other hand, currently, large enterprises, government departments and the like are all engaged in developing a specific or whole-domain internet public opinion monitoring system to deal with emergencies. Microblog is a very successful social media and is a first choice of all public opinion monitoring systems, and countless microblog texts also have an intangible burden on the public opinion monitoring systems, so that automatic text summarization technology aiming at similar microblog short texts is also needed in the public opinion monitoring systems.

Current automatic text summarization techniques are becoming mature, but depending on where they can be improved:

the existing automatic text summarization technology mainly takes an extraction formula as a main part and a generation formula as an auxiliary part, namely, most of texts are sequenced by using a model, and are extracted according to the importance of sentences to obtain a summarization result, so that the problem of incoherence among the sentences is caused;

secondly, the text generated by the generation type text summarization technology is limited by a word list, so that the phenomenon of 'unknown words' can occur, and the problems of blank space and incoherence occur when sentences are generated;

third, the generated text summarization technique may have no way to customize for the user at the time of generation, i.e. the generated summarization is fixed.

Disclosure of Invention

The invention provides a microblog text-oriented generation type automatic text summarization method based on key information guidance, which is combined with a pre-training model, a keyword extraction technology and a reinforcement learning method and aims to solve the problems that sentences are not connected with each other, blank regions are easy to appear and summaries cannot be customized when text summaries are generated.

The invention is realized by the following technical scheme:

a microblog text-oriented generative automatic text summarization method based on key information guidance comprises the following steps:

the method specifically comprises the following steps:

step 1: cleaning the microblog text to remove redundant information and other unnecessary information;

step 2: obtaining key words (groups) in the microblog text through a key information extraction module;

and step 3: designing an automatic microblog text abstract model based on a deep learning neural network, and training the model by using a public data set;

and 4, step 4: and (3) inputting the microblog text cleaned in the step one and the keyword (group) obtained in the step two into the model trained in the step 3, and using the key information to guide abstract generation to obtain a final abstract result.

Further, in the step 1,

the other unnecessary information is a special label of the microblog platform and comprises an '@' user name, a microblog station internal link, a hypertext link and a microblog emoticon.

Further, in step 1,

step 1.1: cleaning the acquired microblog text by using a regular expression, reserving Chinese, English and numeric characters, and removing useless microblog user names, microblog in-station links, hyperphone links, emoticons, spaces and non-Chinese characters;

step 1.2: and simplifying the source text by using a library function in a Python programming language, changing the traditional words in the source text into simplified words, and skipping the step if the original text has no traditional words.

Further, in step 2,

step 2.1: performing word segmentation on the text obtained in the step 1 by using a word segmentation tool to obtain a word segmentation result;

step 2.2: combining the grammar parsing tree and the set part of speech needing to be reserved to obtain keywords (groups) to be extracted;

step 2.3: counting the position information and frequency information of the keyword (group) to be extracted;

step 2.4: embedding and calculating the texts obtained in the step 1 and the step 2.2 by using a pre-trained word embedding model to obtain keyword distribution 1;

step 2.5: the keyword distribution 2 is obtained by applying the text of step 1 and step 2.2 and the text embedded representation of step 2.4 to the graph model TextRank.

Step 2.6: and (5) fusing the keyword distributions 1 and 2 in the step 2.4 and the step 2.5 to obtain final keyword distributions, and selecting the top 10 keywords as keywords (groups).

Further, in step 3, the public data set is an lcts data set, all data in the data set come from microblog texts, and the data set is preprocessed first;

step 3.1: scoring each abstract in the data set, wherein the score is the accuracy of the abstract, and keeping all data with accuracy for the training data set; while only data with a score greater than or equal to 3 are retained for the validation set and the test set;

step 3.2: processing the data screened in the step 3.1 by using the method in the step 2 to obtain a keyword (group) of each microblog text;

step 3.3: and (3) combining the keyword (group) obtained in the step (3.2) with the corresponding microblog text and the abstract into a new data, and finally obtaining a new data set.

Further, in step 3, the microblog text automatic summarization model comprises a microblog text encoder, a keyword (group) encoder and a decoder;

the microblog text encoder comprises a word embedding layer and a two-way LSTM network, and semantic expression vectors c of all moments are obtained by combining an attention mechanism _t Specifically, the method comprises the following steps:

mapping each word segmentation result in the step 2.1 through an Embedding layer to obtain a vector Embedding _i Wherein i represents the ith word in the sentence;

vector Embedding _i Inputting into a layer of bidirectional LSTM to obtain the expression of front and back semantics, and recording the forward expression as

Backward direction is shown as

Splicing the forward and backward vectorsA block being denoted as the representation of the word at decoding time t

Calculating the attention score of the current moment and the vector representation c of the whole microblog text at each moment _t ：

Wherein v, W _h ,W _s B are all learnable parameters, s _t Is the output result of the decoder at the moment t;

the keyword(s) set encoder uses the pre-training model albert-tiny as the Embedding layer,

the embedding results of the keywords (groups) are subjected to a layer of selective gating network to obtain scores score with different importance _i And combining word embedding results to obtain a semantic center vector of the keyword (group) at the time t:

wherein W, b are trainable parameters, s _t Is the state vector representation of the decoder at time t.

Modifying the preamble p _gen The generation method of (1):

p _gen ＝sigmoid(W _s ·s _t +W _h ·c _t +W _x ·x _t +W _k ·key _t +b)

wherein W _k Is a trainable parameter;

the decoder comprises an Embedding layer, a one-way LSTM layer and two fully connected layers;

the decoder maps the words to the previous time instant into a vector y _t-1 Then the vector and the semantic vector c of the microblog text at the previous moment are combined _t-1 Splicing to obtain input x of current time t _t ：

x _t ＝[y _t-1 ；c _t-1 ]

Inputting x at the time t _t Inputting into LSTM network of decoder, obtaining its hidden vector representation s _t (ii) a Then the hidden vectors are summed with c _t Splicing, and obtaining word distribution P (w) at the current moment through two full-connection layers:

P(w)＝Dense ₁ (Dense ₂ ([s _t ；c _t ]))

selecting the word w with the highest probability according to the word distribution _t As a result of the decoding at the current time.

Further, the air conditioner is characterized in that,

in step 3, the microblog text automatic abstracting model further comprises a pointer generating mechanism and a historical information covering mechanism;

the history information coverage mechanism: distributing historical attention

Recording, adding to obtain history information H _t And modifying the attention distribution at the current moment according to the information to avoid generating the same words:

wherein W _H Is a trainable parameter;

the pointer generation mechanism is used for modifying the generation mode of the word distribution P (w) in the decoder:

p _gen ＝sigmoid(W _s ·s _t +W _h ·c _t +W _x ·x _t +b)

P(w)＝p _gen ·P(w)+(1-p _gen )·H _t

wherein W _s ,W _h ,W _x And b are trainable parameters.

Further, in step 3,

the training process of the microblog text automatic abstract model comprises the following steps:

step 3.4: conventional supervised training is performed using a pointer generation mechanism, with the loss function of the training set to loss _sl ：

Step 3.5: starting historical information coverage mechanism training, setting the loss function of the training as

Step 3.6: training is carried out by combining a Self-critic reinforcement learning method, and a loss function of the training is set as loss:

loss＝α·loss _sl +(1-α)·loss _rl

wherein loss _rl In order to strengthen the loss function of learning, alpha is a hyperparameter, and the action degrees of the two loss functions are balanced.

An electronic device comprising a memory storing a computer program and a processor implementing the steps of any of the above methods when the processor executes the computer program.

A computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of any of the above methods.

The invention has the beneficial effects that

According to the method, the precision of generating the abstract according to the microblog text is improved, the accuracy of content retrieval when a public opinion analysis system analyzes the microblog text is further improved, the main information of the microblog text is more briefly and accurately covered, and the time for manually reading the full text is saved;

according to the method, the microblog texts obtained by grabbing are cleaned, key words and key phrases are obtained by combining a key information extraction technology, the effect of model optimization and abstract generation is achieved by utilizing a deep learning technology, a pre-training model and a reinforcement learning method, and finally an abstract result is obtained;

the method can be applied to a public opinion analysis system and helps a user to quickly know the rough situation of a public opinion event; or the method is applied to social media and is used as a function provided by a platform, namely, the event overall situation is rapidly known, and the user experience is further improved. Meanwhile, in order to meet the customized requirement, keywords can be manually input, and the abstract is generated according to the part which is interested by the user and has the key point.

Drawings

FIG. 1 is a system flow diagram of the present invention;

FIG. 2 is a flow chart of the data set LCTS processing of the present invention;

FIG. 3 is a diagram of a model for text word embedding according to the present invention;

fig. 4 is an exemplary diagram of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

With reference to fig. 1 to 4.

the method specifically comprises the following steps:

in practical application, after the microblog text is acquired, some peculiar labels of the microblog platform are included, including a "@" user name, a microblog station in-link, a hypertext link, a microblog emoticon and the like. The structured text is unnecessary information for the subsequent extraction of key information and summary generation, and even the generation result of the summary can be influenced.

The step aims to clear the acquired microblog texts, and clear microblog texts are obtained after clearing, so that subsequent related processing can be greatly facilitated.

the step is the same as the step 1, and is one of the preceding steps, the clean microblog texts obtained in the step 1 are used, and the texts are sent into a specific keyword (group) extraction model to obtain default keywords of the microblog. This keyword result can be considered to be artifact free; if the user manually enters the keywords at this step, the summary generated at the subsequent step is targeted according to the user's preferences. These keywords (groups) are then applied in step 4 to generate a summary.

the method comprises the main steps of building a special generative abstract model aiming at microblog texts by using a deep neural network, and then training by using a public microblog data set to obtain a final application model. It should be noted that this step is only required to be completed at the beginning, and generally repeated training is not required in the subsequent use; if there is a need for greater accuracy in later applications and there is enough new data to be iteratively trained using the trained model to achieve better results.

Because the public data set does not have keyword (group) information, the keyword (group) extraction needs to be carried out on the text in the data set in advance in the step, and the used model is consistent with the step 2;

after data processing and model building are finished, a pre-training model is used for modeling keywords (groups), a reinforcement learning method is integrated for training, and a final application model is obtained through tuning.

And 4, step 4: and (4) inputting the microblog text cleaned in the step one and the keyword (group) obtained in the step two into the trained model in the step three by taking the microblog text cleaned in the step one and the keyword (group) obtained in the step two as input, and generating a guidance abstract by using the key information to obtain a final abstract result.

And taking the step as the last step, and applying the model in the step 3 to obtain a final abstract result. If manually customized keywords are used in this step, then relevant digest results for the user's preferences are generated.

In the step 1, the process is carried out,

the other unnecessary information is a unique label of the microblog platform and comprises an '@' user name, a microblog station internal link, a hypertext link, a microblog emoticon and the like.

step 1.2: and (3) simplifying the source text by using a library function in a Python programming language, changing the traditional words in the source text into simplified words, and skipping the step if the original text has no traditional words.

In the step 2, the process is carried out,

step 2.3: counting the position information and the frequency information of the keywords (groups) to be extracted;

In step 3, the public data set is an LCTS data set, all data in the data set come from microblog texts, the information of keywords (groups) is not considered in the construction process of the data set, and in order to adapt to the invention, the data set needs to be processed first, and is preprocessed first;

step 3.1: scoring each abstract in the data set, wherein the score is the accuracy of the abstract, and for the training data set, retaining all data with accuracy; while only data with a score greater than or equal to 3 are retained for the validation set and the test set;

step 3.3: combining the keyword (group) obtained in the step 3.2 with the corresponding microblog text and abstract into a new data, and finally obtaining a new data set which can be completely applied to the invention.

In step 3, the microblog text automatic summarization model comprises a microblog text encoder, a keyword (group) encoder and a decoder;

Backward direction is shown as

Splicing the forward and backward vectors into a block to be marked as the representation of the word when the decoding time is t

the keyword (group) encoder uses a pre-training model albert-tiny as an Embedding layer, and as the pre-training model is trained on large-scale Chinese data, the pre-training model has certain semantic knowledge, and the keyword (group) is not a continuous statement, the effect of directly using a common Embedding layer is not good;

embedding the keyword (group) into a gating network to obtain scores score with different importance _i And combining word embedding results to obtain a semantic center vector of the keyword (group) at the time t:

Modifying the preamble p _gen The generation method of (1):

p _gen ＝sigmoid(W _s ·s _t +W _h ·c _t +W _x ·x _t +W _k ·key _t +b)

wherein W _k Is a trainable parameter;

the decoder comprises an Embedding layer, a one-way LSTM layer and two full-connection layers;

similar to the encoder section, the decoder maps the words to the previous time instant into a vector y _t-1 Then the vector and the semantic vector c of the microblog text at the previous moment are combined _t-1 Splicing to obtain input x of current time t _t ：

x _t ＝[y _t-1 ；c _t-1 ]

P(w)＝Dense ₁ (Dense ₂ ([s _t ；c _t ]))

the history information override mechanism: distributing historical attention

wherein W _H Is a trainable parameter;

in order to improve the size limitation of the vocabulary, the pointer generation mechanism causes the decoded word to be a seen word, and is used for modifying the generation mode of the word distribution P (w) in the decoder:

p _gen ＝sigmoid(W _s ·s _t +W _h ·c _t +W _x ·x _t +b)

P(w)＝p _gen ·P(w)+(1-p _gen )·H _t

wherein W _s ,W _h ,W _x And b is all ofAnd training parameters.

In the step 3, the process is carried out,

loss＝α·loss _sl +(1-α)·loss _rl

Examples

According to the steps, a simple microblog text automatic summary module can be realized, the module can be embedded into any existing system, the plug-and-play effect is achieved, and the beneficial effects of the invention are specifically verified as follows:

this example was carried out according to the scheme shown in FIG. 1. The system built by using the invention is divided into two parts: the data acquisition part and the algorithm extract the keyword part. The data acquisition part mainly acquires a microblog text from a microblog platform and stores the microblog text in a database or a file; the algorithm generation abstract part mainly comprises three parts, namely data cleaning, extraction of key words (groups) to be selected and abstract generation.

Fig. 4 is a demonstration example result.

After the system is started, the pre-training model albert-tiny is loaded into the memory; then starting a crawler module to collect microblog texts in real time;

the crawler process stores the crawled microblog texts in a system database; meanwhile, another process sequentially takes out microblog texts from a system database, generates a summary by using an algorithm module and stores the summary in the database;

when the abnormality occurs in the process, the process where the crawler module and the algorithm module are located is terminated, and the system is quitted.

The final practical operation result of the invention can be seen in fig. 4. According to the abstract generating effect and the generating process in the figure, the microblog text abstract technology realized by the invention is a generating type abstract technology, and the generated sentences can cover key information in microblog texts. The technology can be used as an auxiliary technology to help a user to read quickly and clear the context of an event.

The method for generating the microblog-text-oriented automatic text summarization based on the key information guidance is introduced in detail, the principle and the implementation mode of the method are explained, and the explanation of the embodiment is only used for helping to understand the method and the core idea of the method; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A microblog text-oriented generative automatic text summarization method based on key information guidance is characterized in that:

the method specifically comprises the following steps:

2. The method of claim 1, further comprising: in the step 1, the process is carried out,

the other unnecessary information is a unique label of the microblog platform and comprises an '@' user name, a microblog station internal link, a hypertext link and a microblog emoticon.

3. The method of claim 2, wherein:

in the step 1, the process is carried out,

4. The method of claim 3, further comprising:

in the step 2, the process is carried out,

5. The method of claim 4, wherein:

in step 3, the public data set is an LCTS data set, all data in the data set come from microblog texts, and the data set is preprocessed firstly;

step 3.2: processing the data screened out in the step 3.1 by using the method in the step 2 to obtain a keyword (group) of each microblog text;

6. The method of claim 5, further comprising:

the microblog text encoder includes wordsEmbedding a layer and a layer of bidirectional LSTM network, and obtaining a semantic representation vector c of each moment by combining an attention mechanism _t Specifically, the method comprises the following steps:

Backward direction is shown as

Splicing the forward and backward vectors into a block is marked as the expression of the word when the decoding time is t

Wherein v, W _h ，W _s B are all learnable parameters, s _t Is time tThe output result of the decoder;

the keyword(s) encoder uses the pre-trained model albert-tiny as the Embedding layer,

Modifying the preamble p _gen The generation method of (1):

p _gen ＝sigmoid(W _s ·s _t +W _h ·c _t +W _x ·x _t +W _k ·key _t +b)

wherein W _k Is a trainable parameter;

the decoder maps the words to the previous time instant into a vector y _t-1 Then the vector and the semantic vector c of the microblog text at the previous moment are combined _t-1 Splicing to obtain the input x of the current moment t _t ：

x _t ＝[y _t-1 ；c _t-1 ]

Inputting x at the time t _t Inputting into LSTM network of decoder, obtaining its hidden vector representation s _t (ii) a Then the hidden vectors are summed with c _t SplicingObtaining the word distribution P (w) of the current moment through two fully-connected layers:

P(w)＝Dense ₁ (Dense ₂ ([s _t ；c _t ]))

7. The method of claim 6, further comprising:

the history information coverage mechanism: distributing historical attention

wherein W _H Is a trainable parameter;

p _gen ＝sigmoid(W _s ·s _t +W _h ·c _t +W _x ·x _t +b)

P(w)＝p _gen ·P(w)+(1-p _gen )·H _t

wherein W _s ，W _h ，W _x And b are trainable parameters.

8. The method of claim 7, further comprising: in the step 3, the process is carried out,

Step 3.6: training is carried out by combining a Self-critic reinforcement learning method, and the loss function of the training is set as loss:

loss＝α·loss _sl +(1-α)loss _rl

9. An electronic device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method according to any one of claims 1 to 8 when executing the computer program.

10. A computer readable storage medium storing computer instructions which, when executed by a processor, carry out the steps of the method of any one of claims 1 to 8.