CN115034217A - Microblog text-oriented generative automatic text summarization method based on key information guidance - Google Patents

Microblog text-oriented generative automatic text summarization method based on key information guidance Download PDF

Info

Publication number
CN115034217A
CN115034217A CN202210608239.1A CN202210608239A CN115034217A CN 115034217 A CN115034217 A CN 115034217A CN 202210608239 A CN202210608239 A CN 202210608239A CN 115034217 A CN115034217 A CN 115034217A
Authority
CN
China
Prior art keywords
text
microblog
keyword
training
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210608239.1A
Other languages
Chinese (zh)
Inventor
赵铁军
郭常江
杨沐昀
朱聪慧
徐冰
曹海龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202210608239.1A priority Critical patent/CN115034217A/en
Publication of CN115034217A publication Critical patent/CN115034217A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9035Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a microblog text-oriented generating type automatic text summarization method based on key information guidance, which comprises the steps of firstly cleaning microblog texts and removing redundant information and other non-key information in the microblog texts; then, obtaining key words and key phrases in the microblog text through a key information extraction module; then designing a special deep learning neural network aiming at the task and training a model by using a public data set; finally, the processed microblog text and the key information are used as input, and the key information is used for guiding abstract generation to obtain a final abstract result; the invention aims to improve the precision of generating the abstract according to the microblog text, further improve the accuracy of content retrieval when a public opinion analysis system analyzes aiming at the microblog text, more briefly and accurately cover the main information of the microblog text and save the time for manually reading the full text.

Description

Microblog text-oriented generative automatic text summarization method based on key information guidance
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a microblog text-oriented generation type automatic text summarization method based on key information guidance.
Background
The automatic text abstract generation is a method for semantic compression and key information extraction aiming at a text, is commonly used for assisting reading, filtering redundant information and other scenes, and is also applied to public opinion monitoring systems recently so as to provide more comprehensive event development analysis.
The 21 st century can be said as the age of the internet, and the convenient condition of the internet changes a plurality of life styles of people, wherein the change of reading habits is only after the change of network e-commerce, and is accompanied by everyone. Because the mobile phone is portable and easy to use, people who are used to paper media for reading are willing to use electronic media for reading at present, various news websites and articles are endlessly developed, and the situation that tens of thousands of articles appear every day is caused while the people conveniently read the articles. Particularly, in a microblog platform, as a social network, microblog is undoubtedly one of the most fierce network media at present. Many network events are discovered at the first time on the microblog, however, for the general public without knowledge, for a certain public opinion event, much time is spent if the public opinion event is known from the beginning, so that the public can quickly know the general situation of the public opinion event;
on the other hand, currently, large enterprises, government departments and the like are all engaged in developing a specific or whole-domain internet public opinion monitoring system to deal with emergencies. Microblog is a very successful social media and is a first choice of all public opinion monitoring systems, and countless microblog texts also have an intangible burden on the public opinion monitoring systems, so that automatic text summarization technology aiming at similar microblog short texts is also needed in the public opinion monitoring systems.
Current automatic text summarization techniques are becoming mature, but depending on where they can be improved:
the existing automatic text summarization technology mainly takes an extraction formula as a main part and a generation formula as an auxiliary part, namely, most of texts are sequenced by using a model, and are extracted according to the importance of sentences to obtain a summarization result, so that the problem of incoherence among the sentences is caused;
secondly, the text generated by the generation type text summarization technology is limited by a word list, so that the phenomenon of 'unknown words' can occur, and the problems of blank space and incoherence occur when sentences are generated;
third, the generated text summarization technique may have no way to customize for the user at the time of generation, i.e. the generated summarization is fixed.
Disclosure of Invention
The invention provides a microblog text-oriented generation type automatic text summarization method based on key information guidance, which is combined with a pre-training model, a keyword extraction technology and a reinforcement learning method and aims to solve the problems that sentences are not connected with each other, blank regions are easy to appear and summaries cannot be customized when text summaries are generated.
The invention is realized by the following technical scheme:
a microblog text-oriented generative automatic text summarization method based on key information guidance comprises the following steps:
the method specifically comprises the following steps:
step 1: cleaning the microblog text to remove redundant information and other unnecessary information;
step 2: obtaining key words (groups) in the microblog text through a key information extraction module;
and step 3: designing an automatic microblog text abstract model based on a deep learning neural network, and training the model by using a public data set;
and 4, step 4: and (3) inputting the microblog text cleaned in the step one and the keyword (group) obtained in the step two into the model trained in the step 3, and using the key information to guide abstract generation to obtain a final abstract result.
Further, in the step 1,
the other unnecessary information is a special label of the microblog platform and comprises an '@' user name, a microblog station internal link, a hypertext link and a microblog emoticon.
Further, in step 1,
step 1.1: cleaning the acquired microblog text by using a regular expression, reserving Chinese, English and numeric characters, and removing useless microblog user names, microblog in-station links, hyperphone links, emoticons, spaces and non-Chinese characters;
step 1.2: and simplifying the source text by using a library function in a Python programming language, changing the traditional words in the source text into simplified words, and skipping the step if the original text has no traditional words.
Further, in step 2,
step 2.1: performing word segmentation on the text obtained in the step 1 by using a word segmentation tool to obtain a word segmentation result;
step 2.2: combining the grammar parsing tree and the set part of speech needing to be reserved to obtain keywords (groups) to be extracted;
step 2.3: counting the position information and frequency information of the keyword (group) to be extracted;
step 2.4: embedding and calculating the texts obtained in the step 1 and the step 2.2 by using a pre-trained word embedding model to obtain keyword distribution 1;
step 2.5: the keyword distribution 2 is obtained by applying the text of step 1 and step 2.2 and the text embedded representation of step 2.4 to the graph model TextRank.
Step 2.6: and (5) fusing the keyword distributions 1 and 2 in the step 2.4 and the step 2.5 to obtain final keyword distributions, and selecting the top 10 keywords as keywords (groups).
Further, in step 3, the public data set is an lcts data set, all data in the data set come from microblog texts, and the data set is preprocessed first;
step 3.1: scoring each abstract in the data set, wherein the score is the accuracy of the abstract, and keeping all data with accuracy for the training data set; while only data with a score greater than or equal to 3 are retained for the validation set and the test set;
step 3.2: processing the data screened in the step 3.1 by using the method in the step 2 to obtain a keyword (group) of each microblog text;
step 3.3: and (3) combining the keyword (group) obtained in the step (3.2) with the corresponding microblog text and the abstract into a new data, and finally obtaining a new data set.
Further, in step 3, the microblog text automatic summarization model comprises a microblog text encoder, a keyword (group) encoder and a decoder;
the microblog text encoder comprises a word embedding layer and a two-way LSTM network, and semantic expression vectors c of all moments are obtained by combining an attention mechanism t Specifically, the method comprises the following steps:
mapping each word segmentation result in the step 2.1 through an Embedding layer to obtain a vector Embedding i Wherein i represents the ith word in the sentence;
vector Embedding i Inputting into a layer of bidirectional LSTM to obtain the expression of front and back semantics, and recording the forward expression as
Figure BDA0003672300590000031
Backward direction is shown as
Figure BDA0003672300590000032
Splicing the forward and backward vectorsA block being denoted as the representation of the word at decoding time t
Figure BDA0003672300590000033
Figure BDA0003672300590000034
Calculating the attention score of the current moment and the vector representation c of the whole microblog text at each moment t
Figure BDA0003672300590000035
Figure BDA0003672300590000036
Figure BDA0003672300590000037
Wherein v, W h ,W s B are all learnable parameters, s t Is the output result of the decoder at the moment t;
the keyword(s) set encoder uses the pre-training model albert-tiny as the Embedding layer,
the embedding results of the keywords (groups) are subjected to a layer of selective gating network to obtain scores score with different importance i And combining word embedding results to obtain a semantic center vector of the keyword (group) at the time t:
Figure BDA0003672300590000041
Figure BDA0003672300590000042
Figure BDA0003672300590000043
wherein W, b are trainable parameters, s t Is the state vector representation of the decoder at time t.
Modifying the preamble p gen The generation method of (1):
p gen =sigmoid(W s ·s t +W h ·c t +W x ·x t +W k ·key t +b)
wherein W k Is a trainable parameter;
the decoder comprises an Embedding layer, a one-way LSTM layer and two fully connected layers;
the decoder maps the words to the previous time instant into a vector y t-1 Then the vector and the semantic vector c of the microblog text at the previous moment are combined t-1 Splicing to obtain input x of current time t t
x t =[y t-1 ;c t-1 ]
Inputting x at the time t t Inputting into LSTM network of decoder, obtaining its hidden vector representation s t (ii) a Then the hidden vectors are summed with c t Splicing, and obtaining word distribution P (w) at the current moment through two full-connection layers:
P(w)=Dense 1 (Dense 2 ([s t ;c t ]))
selecting the word w with the highest probability according to the word distribution t As a result of the decoding at the current time.
Further, the air conditioner is characterized in that,
in step 3, the microblog text automatic abstracting model further comprises a pointer generating mechanism and a historical information covering mechanism;
the history information coverage mechanism: distributing historical attention
Figure BDA0003672300590000044
Recording, adding to obtain history information H t And modifying the attention distribution at the current moment according to the information to avoid generating the same words:
Figure BDA0003672300590000045
Figure BDA0003672300590000046
wherein W H Is a trainable parameter;
the pointer generation mechanism is used for modifying the generation mode of the word distribution P (w) in the decoder:
p gen =sigmoid(W s ·s t +W h ·c t +W x ·x t +b)
P(w)=p gen ·P(w)+(1-p gen )·H t
wherein W s ,W h ,W x And b are trainable parameters.
Further, in step 3,
the training process of the microblog text automatic abstract model comprises the following steps:
step 3.4: conventional supervised training is performed using a pointer generation mechanism, with the loss function of the training set to loss sl
Figure BDA0003672300590000051
Step 3.5: starting historical information coverage mechanism training, setting the loss function of the training as
Figure BDA0003672300590000052
Figure BDA0003672300590000053
Step 3.6: training is carried out by combining a Self-critic reinforcement learning method, and a loss function of the training is set as loss:
loss=α·loss sl +(1-α)·loss rl
wherein loss rl In order to strengthen the loss function of learning, alpha is a hyperparameter, and the action degrees of the two loss functions are balanced.
An electronic device comprising a memory storing a computer program and a processor implementing the steps of any of the above methods when the processor executes the computer program.
A computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of any of the above methods.
The invention has the beneficial effects that
According to the method, the precision of generating the abstract according to the microblog text is improved, the accuracy of content retrieval when a public opinion analysis system analyzes the microblog text is further improved, the main information of the microblog text is more briefly and accurately covered, and the time for manually reading the full text is saved;
according to the method, the microblog texts obtained by grabbing are cleaned, key words and key phrases are obtained by combining a key information extraction technology, the effect of model optimization and abstract generation is achieved by utilizing a deep learning technology, a pre-training model and a reinforcement learning method, and finally an abstract result is obtained;
the method can be applied to a public opinion analysis system and helps a user to quickly know the rough situation of a public opinion event; or the method is applied to social media and is used as a function provided by a platform, namely, the event overall situation is rapidly known, and the user experience is further improved. Meanwhile, in order to meet the customized requirement, keywords can be manually input, and the abstract is generated according to the part which is interested by the user and has the key point.
Drawings
FIG. 1 is a system flow diagram of the present invention;
FIG. 2 is a flow chart of the data set LCTS processing of the present invention;
FIG. 3 is a diagram of a model for text word embedding according to the present invention;
fig. 4 is an exemplary diagram of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
With reference to fig. 1 to 4.
A microblog text-oriented generative automatic text summarization method based on key information guidance comprises the following steps:
the method specifically comprises the following steps:
step 1: cleaning the microblog text to remove redundant information and other unnecessary information;
in practical application, after the microblog text is acquired, some peculiar labels of the microblog platform are included, including a "@" user name, a microblog station in-link, a hypertext link, a microblog emoticon and the like. The structured text is unnecessary information for the subsequent extraction of key information and summary generation, and even the generation result of the summary can be influenced.
The step aims to clear the acquired microblog texts, and clear microblog texts are obtained after clearing, so that subsequent related processing can be greatly facilitated.
Step 2: obtaining key words (groups) in the microblog text through a key information extraction module;
the step is the same as the step 1, and is one of the preceding steps, the clean microblog texts obtained in the step 1 are used, and the texts are sent into a specific keyword (group) extraction model to obtain default keywords of the microblog. This keyword result can be considered to be artifact free; if the user manually enters the keywords at this step, the summary generated at the subsequent step is targeted according to the user's preferences. These keywords (groups) are then applied in step 4 to generate a summary.
And step 3: designing an automatic microblog text abstract model based on a deep learning neural network, and training the model by using a public data set;
the method comprises the main steps of building a special generative abstract model aiming at microblog texts by using a deep neural network, and then training by using a public microblog data set to obtain a final application model. It should be noted that this step is only required to be completed at the beginning, and generally repeated training is not required in the subsequent use; if there is a need for greater accuracy in later applications and there is enough new data to be iteratively trained using the trained model to achieve better results.
Because the public data set does not have keyword (group) information, the keyword (group) extraction needs to be carried out on the text in the data set in advance in the step, and the used model is consistent with the step 2;
after data processing and model building are finished, a pre-training model is used for modeling keywords (groups), a reinforcement learning method is integrated for training, and a final application model is obtained through tuning.
And 4, step 4: and (4) inputting the microblog text cleaned in the step one and the keyword (group) obtained in the step two into the trained model in the step three by taking the microblog text cleaned in the step one and the keyword (group) obtained in the step two as input, and generating a guidance abstract by using the key information to obtain a final abstract result.
And taking the step as the last step, and applying the model in the step 3 to obtain a final abstract result. If manually customized keywords are used in this step, then relevant digest results for the user's preferences are generated.
In the step 1, the process is carried out,
the other unnecessary information is a unique label of the microblog platform and comprises an '@' user name, a microblog station internal link, a hypertext link, a microblog emoticon and the like.
Step 1.1: cleaning the acquired microblog text by using a regular expression, reserving Chinese, English and numeric characters, and removing useless microblog user names, microblog in-station links, hyperphone links, emoticons, spaces and non-Chinese characters;
step 1.2: and (3) simplifying the source text by using a library function in a Python programming language, changing the traditional words in the source text into simplified words, and skipping the step if the original text has no traditional words.
In the step 2, the process is carried out,
step 2.1: performing word segmentation on the text obtained in the step 1 by using a word segmentation tool to obtain a word segmentation result;
step 2.2: combining the grammar parsing tree and the set part of speech needing to be reserved to obtain keywords (groups) to be extracted;
step 2.3: counting the position information and the frequency information of the keywords (groups) to be extracted;
step 2.4: embedding and calculating the texts obtained in the step 1 and the step 2.2 by using a pre-trained word embedding model to obtain keyword distribution 1;
step 2.5: the keyword distribution 2 is obtained by applying the text of step 1 and step 2.2 and the text embedded representation of step 2.4 to the graph model TextRank.
Step 2.6: and (5) fusing the keyword distributions 1 and 2 in the step 2.4 and the step 2.5 to obtain final keyword distributions, and selecting the top 10 keywords as keywords (groups).
In step 3, the public data set is an LCTS data set, all data in the data set come from microblog texts, the information of keywords (groups) is not considered in the construction process of the data set, and in order to adapt to the invention, the data set needs to be processed first, and is preprocessed first;
step 3.1: scoring each abstract in the data set, wherein the score is the accuracy of the abstract, and for the training data set, retaining all data with accuracy; while only data with a score greater than or equal to 3 are retained for the validation set and the test set;
step 3.2: processing the data screened in the step 3.1 by using the method in the step 2 to obtain a keyword (group) of each microblog text;
step 3.3: combining the keyword (group) obtained in the step 3.2 with the corresponding microblog text and abstract into a new data, and finally obtaining a new data set which can be completely applied to the invention.
In step 3, the microblog text automatic summarization model comprises a microblog text encoder, a keyword (group) encoder and a decoder;
the microblog text encoder comprises a word embedding layer and a two-way LSTM network, and semantic expression vectors c of all moments are obtained by combining an attention mechanism t Specifically, the method comprises the following steps:
mapping each word segmentation result in the step 2.1 through an Embedding layer to obtain a vector Embedding i Wherein i represents the ith word in the sentence;
vector Embedding i Inputting into a layer of bidirectional LSTM to obtain the expression of front and back semantics, and recording the forward expression as
Figure BDA0003672300590000081
Backward direction is shown as
Figure BDA0003672300590000082
Splicing the forward and backward vectors into a block to be marked as the representation of the word when the decoding time is t
Figure BDA0003672300590000083
Figure BDA0003672300590000084
Calculating the attention score of the current moment and the vector representation c of the whole microblog text at each moment t
Figure BDA0003672300590000091
Figure BDA0003672300590000092
Figure BDA0003672300590000093
Wherein v, W h ,W s B are all learnable parameters, s t Is the output result of the decoder at the moment t;
the keyword (group) encoder uses a pre-training model albert-tiny as an Embedding layer, and as the pre-training model is trained on large-scale Chinese data, the pre-training model has certain semantic knowledge, and the keyword (group) is not a continuous statement, the effect of directly using a common Embedding layer is not good;
embedding the keyword (group) into a gating network to obtain scores score with different importance i And combining word embedding results to obtain a semantic center vector of the keyword (group) at the time t:
Figure BDA0003672300590000094
Figure BDA0003672300590000095
Figure BDA0003672300590000096
wherein W, b are trainable parameters, s t Is the state vector representation of the decoder at time t.
Modifying the preamble p gen The generation method of (1):
p gen =sigmoid(W s ·s t +W h ·c t +W x ·x t +W k ·key t +b)
wherein W k Is a trainable parameter;
the decoder comprises an Embedding layer, a one-way LSTM layer and two full-connection layers;
similar to the encoder section, the decoder maps the words to the previous time instant into a vector y t-1 Then the vector and the semantic vector c of the microblog text at the previous moment are combined t-1 Splicing to obtain input x of current time t t
x t =[y t-1 ;c t-1 ]
Inputting x at the time t t Inputting into LSTM network of decoder, obtaining its hidden vector representation s t (ii) a Then the hidden vectors are summed with c t Splicing, and obtaining word distribution P (w) at the current moment through two full-connection layers:
P(w)=Dense 1 (Dense 2 ([s t ;c t ]))
selecting the word w with the highest probability according to the word distribution t As a result of the decoding at the current time.
In step 3, the microblog text automatic abstracting model further comprises a pointer generating mechanism and a historical information covering mechanism;
the history information override mechanism: distributing historical attention
Figure BDA0003672300590000101
Recording, adding to obtain history information H t And modifying the attention distribution at the current moment according to the information to avoid generating the same words:
Figure BDA0003672300590000102
Figure BDA0003672300590000103
wherein W H Is a trainable parameter;
in order to improve the size limitation of the vocabulary, the pointer generation mechanism causes the decoded word to be a seen word, and is used for modifying the generation mode of the word distribution P (w) in the decoder:
p gen =sigmoid(W s ·s t +W h ·c t +W x ·x t +b)
P(w)=p gen ·P(w)+(1-p gen )·H t
wherein W s ,W h ,W x And b is all ofAnd training parameters.
In the step 3, the process is carried out,
the training process of the microblog text automatic abstract model comprises the following steps:
step 3.4: conventional supervised training is performed using a pointer generation mechanism, with the loss function of the training set to loss sl
Figure BDA0003672300590000104
Step 3.5: starting historical information coverage mechanism training, setting the loss function of the training as
Figure BDA0003672300590000105
Figure BDA0003672300590000106
Step 3.6: training is carried out by combining a Self-critic reinforcement learning method, and a loss function of the training is set as loss:
loss=α·loss sl +(1-α)·loss rl
wherein loss rl In order to strengthen the loss function of learning, alpha is a hyperparameter, and the action degrees of the two loss functions are balanced.
An electronic device comprising a memory storing a computer program and a processor implementing the steps of any of the above methods when the processor executes the computer program.
A computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of any of the above methods.
Examples
According to the steps, a simple microblog text automatic summary module can be realized, the module can be embedded into any existing system, the plug-and-play effect is achieved, and the beneficial effects of the invention are specifically verified as follows:
this example was carried out according to the scheme shown in FIG. 1. The system built by using the invention is divided into two parts: the data acquisition part and the algorithm extract the keyword part. The data acquisition part mainly acquires a microblog text from a microblog platform and stores the microblog text in a database or a file; the algorithm generation abstract part mainly comprises three parts, namely data cleaning, extraction of key words (groups) to be selected and abstract generation.
Fig. 4 is a demonstration example result.
After the system is started, the pre-training model albert-tiny is loaded into the memory; then starting a crawler module to collect microblog texts in real time;
the crawler process stores the crawled microblog texts in a system database; meanwhile, another process sequentially takes out microblog texts from a system database, generates a summary by using an algorithm module and stores the summary in the database;
when the abnormality occurs in the process, the process where the crawler module and the algorithm module are located is terminated, and the system is quitted.
The final practical operation result of the invention can be seen in fig. 4. According to the abstract generating effect and the generating process in the figure, the microblog text abstract technology realized by the invention is a generating type abstract technology, and the generated sentences can cover key information in microblog texts. The technology can be used as an auxiliary technology to help a user to read quickly and clear the context of an event.
The method for generating the microblog-text-oriented automatic text summarization based on the key information guidance is introduced in detail, the principle and the implementation mode of the method are explained, and the explanation of the embodiment is only used for helping to understand the method and the core idea of the method; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A microblog text-oriented generative automatic text summarization method based on key information guidance is characterized in that:
the method specifically comprises the following steps:
step 1: cleaning the microblog text to remove redundant information and other unnecessary information;
step 2: obtaining key words (groups) in the microblog text through a key information extraction module;
and step 3: designing an automatic microblog text abstract model based on a deep learning neural network, and training the model by using a public data set;
and 4, step 4: and (4) inputting the microblog text cleaned in the step one and the keyword (group) obtained in the step two into the trained model in the step three by taking the microblog text cleaned in the step one and the keyword (group) obtained in the step two as input, and generating a guidance abstract by using the key information to obtain a final abstract result.
2. The method of claim 1, further comprising: in the step 1, the process is carried out,
the other unnecessary information is a unique label of the microblog platform and comprises an '@' user name, a microblog station internal link, a hypertext link and a microblog emoticon.
3. The method of claim 2, wherein:
in the step 1, the process is carried out,
step 1.1: cleaning the acquired microblog text by using a regular expression, reserving Chinese, English and numeric characters, and removing useless microblog user names, microblog in-station links, hyperphone links, emoticons, spaces and non-Chinese characters;
step 1.2: and simplifying the source text by using a library function in a Python programming language, changing the traditional words in the source text into simplified words, and skipping the step if the original text has no traditional words.
4. The method of claim 3, further comprising:
in the step 2, the process is carried out,
step 2.1: performing word segmentation on the text obtained in the step 1 by using a word segmentation tool to obtain a word segmentation result;
step 2.2: combining the grammar parsing tree and the set part of speech needing to be reserved to obtain keywords (groups) to be extracted;
step 2.3: counting the position information and frequency information of the keyword (group) to be extracted;
step 2.4: embedding and calculating the texts obtained in the step 1 and the step 2.2 by using a pre-trained word embedding model to obtain keyword distribution 1;
step 2.5: the keyword distribution 2 is obtained by applying the text of step 1 and step 2.2 and the text embedded representation of step 2.4 to the graph model TextRank.
Step 2.6: and (5) fusing the keyword distributions 1 and 2 in the step 2.4 and the step 2.5 to obtain final keyword distributions, and selecting the top 10 keywords as keywords (groups).
5. The method of claim 4, wherein:
in step 3, the public data set is an LCTS data set, all data in the data set come from microblog texts, and the data set is preprocessed firstly;
step 3.1: scoring each abstract in the data set, wherein the score is the accuracy of the abstract, and keeping all data with accuracy for the training data set; while only data with a score greater than or equal to 3 are retained for the validation set and the test set;
step 3.2: processing the data screened out in the step 3.1 by using the method in the step 2 to obtain a keyword (group) of each microblog text;
step 3.3: and (3) combining the keyword (group) obtained in the step (3.2) with the corresponding microblog text and the abstract into a new data, and finally obtaining a new data set.
6. The method of claim 5, further comprising:
in step 3, the microblog text automatic summarization model comprises a microblog text encoder, a keyword (group) encoder and a decoder;
the microblog text encoder includes wordsEmbedding a layer and a layer of bidirectional LSTM network, and obtaining a semantic representation vector c of each moment by combining an attention mechanism t Specifically, the method comprises the following steps:
mapping each word segmentation result in the step 2.1 through an Embedding layer to obtain a vector Embedding i Wherein i represents the ith word in the sentence;
vector Embedding i Inputting into a layer of bidirectional LSTM to obtain the expression of front and back semantics, and recording the forward expression as
Figure FDA0003672300580000021
Backward direction is shown as
Figure FDA0003672300580000022
Splicing the forward and backward vectors into a block is marked as the expression of the word when the decoding time is t
Figure FDA0003672300580000023
Figure FDA0003672300580000024
Calculating the attention score of the current moment and the vector representation c of the whole microblog text at each moment t
Figure FDA0003672300580000025
Figure FDA0003672300580000026
Figure FDA0003672300580000027
Wherein v, W h ,W s B are all learnable parameters, s t Is time tThe output result of the decoder;
the keyword(s) encoder uses the pre-trained model albert-tiny as the Embedding layer,
the embedding results of the keywords (groups) are subjected to a layer of selective gating network to obtain scores score with different importance i And combining word embedding results to obtain a semantic center vector of the keyword (group) at the time t:
Figure FDA0003672300580000031
Figure FDA0003672300580000032
Figure FDA0003672300580000033
wherein W, b are trainable parameters, s t Is the state vector representation of the decoder at time t.
Modifying the preamble p gen The generation method of (1):
p gen =sigmoid(W s ·s t +W h ·c t +W x ·x t +W k ·key t +b)
wherein W k Is a trainable parameter;
the decoder comprises an Embedding layer, a one-way LSTM layer and two fully connected layers;
the decoder maps the words to the previous time instant into a vector y t-1 Then the vector and the semantic vector c of the microblog text at the previous moment are combined t-1 Splicing to obtain the input x of the current moment t t
x t =[y t-1 ;c t-1 ]
Inputting x at the time t t Inputting into LSTM network of decoder, obtaining its hidden vector representation s t (ii) a Then the hidden vectors are summed with c t SplicingObtaining the word distribution P (w) of the current moment through two fully-connected layers:
P(w)=Dense 1 (Dense 2 ([s t ;c t ]))
selecting the word w with the highest probability according to the word distribution t As a result of the decoding at the current time.
7. The method of claim 6, further comprising:
in step 3, the microblog text automatic abstracting model further comprises a pointer generating mechanism and a historical information covering mechanism;
the history information coverage mechanism: distributing historical attention
Figure FDA0003672300580000034
Recording, adding to obtain history information H t And modifying the attention distribution at the current moment according to the information to avoid generating the same words:
Figure FDA0003672300580000035
Figure FDA0003672300580000036
wherein W H Is a trainable parameter;
the pointer generation mechanism is used for modifying the generation mode of the word distribution P (w) in the decoder:
p gen =sigmoid(W s ·s t +W h ·c t +W x ·x t +b)
P(w)=p gen ·P(w)+(1-p gen )·H t
wherein W s ,W h ,W x And b are trainable parameters.
8. The method of claim 7, further comprising: in the step 3, the process is carried out,
the training process of the microblog text automatic abstract model comprises the following steps:
step 3.4: conventional supervised training is performed using a pointer generation mechanism, with the loss function of the training set to loss sl
Figure FDA0003672300580000041
Step 3.5: starting historical information coverage mechanism training, setting the loss function of the training as
Figure FDA0003672300580000042
Figure FDA0003672300580000043
Step 3.6: training is carried out by combining a Self-critic reinforcement learning method, and the loss function of the training is set as loss:
loss=α·loss sl +(1-α)loss rl
wherein loss rl In order to strengthen the loss function of learning, alpha is a hyperparameter, and the action degrees of the two loss functions are balanced.
9. An electronic device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method according to any one of claims 1 to 8 when executing the computer program.
10. A computer readable storage medium storing computer instructions which, when executed by a processor, carry out the steps of the method of any one of claims 1 to 8.
CN202210608239.1A 2022-05-31 2022-05-31 Microblog text-oriented generative automatic text summarization method based on key information guidance Pending CN115034217A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210608239.1A CN115034217A (en) 2022-05-31 2022-05-31 Microblog text-oriented generative automatic text summarization method based on key information guidance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210608239.1A CN115034217A (en) 2022-05-31 2022-05-31 Microblog text-oriented generative automatic text summarization method based on key information guidance

Publications (1)

Publication Number Publication Date
CN115034217A true CN115034217A (en) 2022-09-09

Family

ID=83122656

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210608239.1A Pending CN115034217A (en) 2022-05-31 2022-05-31 Microblog text-oriented generative automatic text summarization method based on key information guidance

Country Status (1)

Country Link
CN (1) CN115034217A (en)

Similar Documents

Publication Publication Date Title
US11693894B2 (en) Conversation oriented machine-user interaction
Uc-Cetina et al. Survey on reinforcement learning for language processing
CN115238101B (en) Multi-engine intelligent question-answering system oriented to multi-type knowledge base
CN107798140B (en) Dialog system construction method, semantic controlled response method and device
CN110765277B (en) Knowledge-graph-based mobile terminal online equipment fault diagnosis method
CN113505209A (en) Intelligent question-answering system for automobile field
US20200183928A1 (en) System and Method for Rule-Based Conversational User Interface
CN114722839B (en) Man-machine cooperative dialogue interaction system and method
US20230395075A1 (en) Human-machine dialogue system and method
CN114357127A (en) Intelligent question-answering method based on machine reading understanding and common question-answering model
CN113326374B (en) Short text emotion classification method and system based on feature enhancement
CN115048447B (en) Database natural language interface system based on intelligent semantic completion
CN114428850A (en) Text retrieval matching method and system
CN112349294A (en) Voice processing method and device, computer readable medium and electronic equipment
CN112185361A (en) Speech recognition model training method and device, electronic equipment and storage medium
TWI734085B (en) Dialogue system using intention detection ensemble learning and method thereof
Hsueh et al. A Task-oriented Chatbot Based on LSTM and Reinforcement Learning
CN117648429A (en) Question-answering method and system based on multi-mode self-adaptive search type enhanced large model
CN116522165B (en) Public opinion text matching system and method based on twin structure
CN113705207A (en) Grammar error recognition method and device
Chowanda et al. Generative Indonesian conversation model using recurrent neural network with attention mechanism
GB2604317A (en) Dialogue management
CN117251524A (en) Short text classification method based on multi-strategy fusion
CN116483314A (en) Automatic intelligent activity diagram generation method
CN112150103B (en) Schedule setting method, schedule setting device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination