CN115438654B

CN115438654B - Article title generation method and device, storage medium and electronic equipment

Info

Publication number: CN115438654B
Application number: CN202211383959.9A
Authority: CN
Inventors: 熊汉卿; 阙越; 谭林丰; 郝书乐
Original assignee: East China Jiaotong University
Current assignee: East China Jiaotong University
Priority date: 2022-11-07
Filing date: 2022-11-07
Publication date: 2023-03-24
Anticipated expiration: 2042-11-07
Also published as: CN115438654A

Abstract

The invention provides an article title generation method, an article title generation device, a storage medium and electronic equipment, wherein the generation method comprises the following steps: extracting a target abstract from a target article according to a text abstract algorithm; generating a first candidate article title based on a pre-trained title generation model and a target abstract; generating a second candidate article title based on the title generation model and the target article; and calculating the title matching degree of the first candidate article title and the second candidate article title, and determining the target article title from the first candidate article title according to the title matching degree. According to the method, the first candidate article title generated by the title generation model and the target abstract and the second candidate article title generated by the title generation model and the target article are matched and calculated, and the title attached with the content of the target article is obtained from the first candidate article title as the target article title according to the matching calculation result, so that the accuracy of article title generation is improved.

Description

Article title generation method and device, storage medium and electronic equipment

Technical Field

The invention relates to the technical field of text processing, in particular to an article title generation method, an article title generation device, a storage medium and electronic equipment.

Background

Headlines and summaries are important to the authoring of an article, but it is not trivial to conceive an attractive one and close to the headlines of the article content and to extract or generate summaries from the article that conform to the subject matter of the article. Target summary generation requires the compression, summarization and summarization of the initial long chapters, resulting in short texts with short and concise meaning. The title generation of the article needs to be further refined and matched with an appropriate style on the basis of abstract generation.

The traditional manual reading and summarizing mode is low in efficiency, is greatly influenced by subjectivity of an author, and may cause misjudgment due to subjective factors of the author, so that the generated text titles and abstracts are often not accurate enough. In the field of natural language processing, the currently mainstream automatic title generation and abstract generation methods can be classified into an extraction formula and a generation formula. The extraction formula is to extract key sentences from the original text to form an abstract, and the method has the risk of information loss; the generation formula is to reorganize and express the language on the basis of understanding the original text, the method is difficult to ensure the smoothness of generating the abstract or the title, and the generation effect is difficult to guarantee if the length of the original text is too long.

Disclosure of Invention

The invention aims to provide an article title generation method, an article title generation device, a storage medium and electronic equipment so as to improve the accuracy of article title generation.

The invention provides an article title generation method, which comprises the following steps:

extracting a target abstract from a target article according to a text abstract algorithm;

generating a first candidate article title based on a pre-trained title generation model and the target abstract;

generating a second candidate article title based on the title generation model and the target article;

calculating the title matching degree of the first candidate article title and the second candidate article title, and determining a target article title from the first candidate article title according to the title matching degree;

the title generation model training method comprises the following steps:

acquiring an original text set for training, wherein the original text set comprises original articles and original titles;

preprocessing the original text set to obtain input data with a standard format, wherein the preprocessing is to unify the format of the original text set;

inputting the preprocessed input data into an improved GPT-2 model and training, wherein the improved GPT-2 model is obtained by adding an FC layer at the downstream of the GPT-2 model.

The article title generation method provided by the invention has the following beneficial effects: the title generation method comprises the steps of generating a first candidate article title by adopting a title generation model and a target abstract, generating a second candidate article title by adopting the title generation model and the target article, carrying out matching calculation on the first candidate article title generated by the target abstract and the second candidate article title generated by the target article, and obtaining a title attached with the content of the target article from the first candidate article title as the target article title according to the matching calculation result, so that the accuracy of article title generation is improved.

In addition, the article title generation method provided by the invention can also have the following additional technical characteristics:

further, the step of calculating the title matching degree of the first candidate article title and the second candidate article title, and determining the target article title from the first candidate article title according to the title matching degree comprises:

and calculating the title matching degree of the first candidate article title and the second candidate article title, and taking the first candidate article title with the highest matching degree with the second candidate article title as a target article title.

calculating the title matching degree of the first candidate article title and the second candidate article title, and calculating the title smoothness of the first candidate article title;

and determining the target article title of the first candidate article title according to the title matching degree and the title passing degree.

Further, the step of inputting the preprocessed input data into the improved GPT-2 model and performing training to obtain a pre-trained title generation model includes:

inputting the input data into an improved GPT-2 model, outputting each predicted token value by the improved GPT-2 model, calculating a loss value of the improved GPT-2 model according to the predicted token value and an original token value, and continuously optimizing the improved GPT-2 model according to the loss value to obtain a pre-trained title generation model.

Further, the step of extracting the target abstract from the target article according to the text abstract algorithm comprises:

calculating the total character length and the sentence number of a target article, and calculating the abstract length according to the total character length and the sentence number of the target article;

and calculating the weight of each sentence in the target article occupying the whole target article by using a TextRank algorithm, sorting the target articles in a descending order according to a weight sequence, selecting target sentences according to the weight sequence and the abstract length, and splicing the target sentences into target abstracts according to the sequence of the target sentences in the target article.

Further, the step of generating a second candidate article title based on the title generation model and the target article is as follows:

importing a target article into a pre-trained title generation model to obtain a predicted title list;

calculating the perplexity of each predicted title in the predicted title list through Kenlm, sequencing the perplexity of each predicted title in an ascending order, and taking the predicted title with the perplexity smaller than the preset perplexity as a second candidate article title.

The invention also provides an article title generation device, comprising:

the extraction module is used for extracting a target abstract from a target article according to a text abstract algorithm;

the first generation module is used for generating a first candidate article title based on a pre-trained title generation model and the target abstract;

a second generation module, configured to generate a second candidate article title based on the title generation model and the target article;

the calculation module is used for calculating the title matching degree of the first candidate article title and the second candidate article title and determining a target article title from the first candidate article title according to the title matching degree;

the title generation model training method comprises the following steps:

The present invention also provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the article title generation method as described above.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the article title generation method.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a flowchart of an article title generation method of a first embodiment of the present invention;

FIG. 2 is a flowchart of an article title generation method of a second embodiment of the present invention;

fig. 3 is a block diagram showing the structure of an article title generating apparatus according to a third embodiment of the present invention;

reference numerals:

10. an extraction module; 20. a first generation module; 30. a second generation module; 40. and a calculation module.

Detailed Description

In order to make the objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. Several embodiments of the invention are presented in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Referring to fig. 1 and 3, a first embodiment of the present invention provides a method for generating an article title, including steps S101 to S104:

s101, extracting a target abstract from a target article according to a text abstract algorithm;

the target article is a tested article which is to automatically generate a title by adopting the method of the invention, and the target article corresponding to the target article is obtained by the tested target article through the existing text summarization algorithm, including but not limited to a TextRank algorithm, a hash algorithm, an MD5 algorithm and the like.

The TextRank algorithm is an existing text sorting algorithm, and can extract key words and key word groups of a given text and extract key sentences of the text by using an extraction type automatic abstract method.

The TextRank algorithm constructs a network according to the co-occurrence relation among the words; the edges in the network constructed by the TextRank algorithm are undirected weighted edges. The importance degree of the edge connection between the two nodes can be obtained through a TextRank algorithm, so that sentences with more important information can be obtained through sequencing, and the obtained important sentences are spliced into an abstract in the original sequence of the original text.

In this embodiment, in the step S101, the step of extracting the target abstract from the target article according to the text abstract algorithm specifically includes:

step S1011, calculating the total character length and the sentence number of a target article, and calculating the abstract length according to the total character length and the sentence number of the target article;

for example, if the length of the article is smaller than the input (512 characters) of the title generation model, the article is not processed and is directly used as the target abstract, otherwise, the TextRank algorithm is used to calculate the weight (contribution) of each sentence in the article to the whole article, on the premise that the sum of the lengths of the extracted clauses is not more than 512, the clauses are sorted according to the weight, and the clauses are spliced into the target abstract according to the sequence of the clauses in the original article.

Step S1012, calculating the weight of each sentence in the target article occupying the whole target article by using a TextRank algorithm, sorting the target articles in a descending order according to a weight sequence, selecting target sentences according to the weight sequence and the abstract length, and splicing the target sentences into target abstract according to the sequence of the target sentences in the target article.

Specifically, a network is constructed for the articles according to the co-occurrence relation among words according to the TextRank algorithm, and the most important t words are obtained from the network and are used as top-t keywords; marking the obtained top-t key words in the original article, and extracting key word groups; iteratively calculating the TextRank value of each keyword in the sentence, calculating the TextRank value of each sentence in the article, sequencing, taking out sentences with proper length in a descending order (the total word number is limited below 512), and reordering according to the sequence appearing in the article to obtain the abstract of the article.

By adding the TextRank-Rouge algorithm in the data preprocessing process, the method can reduce the characteristic loss of the text input into the model, ensure the integrity of text information to the maximum extent, and reduce the requirement on video memory in subsequent calculation, thereby improving the accuracy of article abstract generation and reducing the calculation cost.

Step S102, generating a first candidate article title based on a pre-trained title generation model and the target abstract;

the present invention is preferably a modified GPT-2 (Generative Pre-Training) model, which is a title generation model based on Pre-Training, and which takes the target abstract obtained in step S101 as an input of the model and outputs a refined sentence as a title for the target abstract.

In this embodiment, the method for training the title generation model in step S102 includes the following steps:

s1021, acquiring an original text set for training, wherein the original text set comprises an original article and an original title;

before training, an original text set is divided into a training set and a verification set, wherein the training set is used for training a model, and the verification set is used for verifying the model.

Step S1022, preprocessing the original text set to obtain input data with a standard format, where the preprocessing is to unify the formats of the original text set;

specifically, the preprocessing comprises character coding standardization, english upper and lower case letter unification, chinese complex and simplified characters unification, and special symbol and space deletion. Through a regular expression, html labels in the article, such as special symbols like "/n", "NBSP", "s" and the like, and continuous redundant spaces are deleted, and input data with a uniform and standard format are obtained.

And S1023, inputting the preprocessed input data into an improved GPT-2 model and training, wherein the improved GPT-2 model is obtained by adding an FC layer at the downstream of the GPT-2 model.

Specifically, the text is converted into the input of the model, the input is wordlenbeading + segmentembeading + positionembeading, and the three codes are pieced together to form a tensor of (n, 3,512).

Adding an FC layer at the downstream of the GPT-2 model, amplifying the output of the last layer of the GPT-2 model to the size of a dictionary, predicting each token value according to the FC output of the model based on the principle of a MASK mechanism, and calculating the token of the title and the loss value of the original token; and iterating the above operations, saving the model of each iteration, and finally selecting a batch which best performs on the verification set.

By using the improved GPT-2 model in the training process, the headline generated by the article is matched with the headline generated by the abstract, the method can improve the efficiency of model training and the accuracy of headline generation;

step S103, generating a second candidate article title based on the title generation model and the target article;

the headline generation models in step S102 and step S103 are the same model, and the target sentence is input to the headline generation model, and the second candidate sentence headline is output.

In this embodiment, the step S103 specifically includes:

step S1031, importing a target article into the pre-trained title generation model to obtain a predicted title list;

step S1032 is to calculate a perplexity for each predicted title in the predicted title list by Kenlm, sort the perplexity of each predicted title in an ascending order, and take the predicted title with the perplexity smaller than a preset perplexity as a second candidate article title.

Kenlm is a C + + written statistical language model tool.

S represents a current sentence; n represents the sentence length; p (omega) _i ) Denotes the probability, p (ω), of the ith word _i |ω ₁ ω ₂ ω ₃ …ω _i-1 ) Means that based on the first i-1 words, the probability of finding the ith word is calculated,PP（S）representing the confusion of the sentence, i.e. the smoothness of the sentence. The lower the confusion, the more fluent the sentence.

In the step, the perplexity of the predicted title is calculated by using a Kenlm tool, and the perplexity is selected to be lower, namely the perplexity of the predicted title is smoother to be used as a second candidate article title, so that the smoothness of the target article title is ensured.

And step S104, calculating the title matching degree of the first candidate article title and the second candidate article title, and determining the target article title from the first candidate article title according to the title matching degree.

And (3) performing title matching degree calculation on each article title in the first candidate article title and the second candidate article title, respectively calculating to obtain the scores of rough-1, rough-2 and rough-L, and adding all the scores to obtain the final matching degree score.

Rouge is an existing method for evaluating an abstract based on co-occurrence information of n-grams in the abstract, and is an evaluation method for recall rate of the n-grams. The Rouge criterion consists of a series of evaluation methods, including Rouge-N (N is N in N-gram, and the value is 1,2,3,4), rouge-L and the like.

In this embodiment, step S104 specifically includes:

The method takes the title which is the highest in matching degree score and most fit with the content of the target article as the title of the target article, thereby improving the accuracy of the generation of the title of the article.

Referring to fig. 2, a second embodiment of the present invention provides a method for generating a title of an article, including the following steps:

step S201, extracting a target abstract from a target article according to a text abstract algorithm;

step S202, generating a first candidate article title based on a pre-trained title generation model and the target abstract;

step S203, generating a second candidate article title based on the title generation model and the target article;

step S204, calculating the title matching degree of the first candidate article title and the second candidate article title, and calculating the title passing degree of the first candidate article title;

step S205, determining a target article title of the first candidate article title according to the title matching degree and the title passing degree;

the title generation model training method comprises the following steps:

preprocessing the original text set to obtain input data with a standard format, wherein the preprocessing is to unify the formats of the original text set;

Where Model score represents the final total match score,PP(S)for sentence confusion score, score is the composite score, and the first candidate article title with the highest composite score is determined as the target article title.

Steps S201 to S203 in the second embodiment of the present invention are the same as steps S101 to S103 in the first embodiment, and the difference is that the second embodiment of the present invention calculates the compliance of the first candidate article title, and determines the final target article title after comprehensively considering the title matching degree and the title compliance, so that the accuracy and compliance of the finally obtained target article title are both ensured, and the compliance is calculated by the Kenlm in step S1032.

Referring to fig. 3, a third embodiment of the present invention provides an article title generating apparatus, including:

the extraction module 10 is used for extracting a target abstract from a target article according to a text abstract algorithm;

a first generation module 20, configured to generate a first candidate article title based on a pre-trained title generation model and the target abstract;

a second generation module 30, configured to generate a second candidate article title based on the title generation model and the target article;

a calculating module 40, configured to calculate a title matching degree between the first candidate article title and the second candidate article title, and determine a target article title from the first candidate article title according to the title matching degree.

In an embodiment of the present invention, the calculating module 40 specifically includes:

In another embodiment of the present invention, the calculation module 40 includes:

In an embodiment of the present invention, the extraction module 10 includes:

In an embodiment of the present invention, the second generating module 30 includes:

The title generation method of the article provided by the invention adopts a title generation model and a target abstract to generate a first candidate article title, adopts a title generation model and a target article to generate a second candidate article title, carries out matching calculation on the first candidate article title generated by the target abstract and the second candidate article title generated by the target article, and obtains a title attached with the content of the target article from the first candidate article title as the title of the target article according to the matching calculation result, thereby improving the accuracy of the article title generation.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Further, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Claims

1. A title generation method of an article is characterized by comprising the following steps:

the title generation model training method comprises the following steps:

inputting the preprocessed input data to an improved GPT-2 model and training, wherein the improved GPT-2 model is that an FC layer is added at the downstream of the GPT-2 model, the output of the last layer of the GPT-2 model is amplified to the size of a dictionary, then each predicted token value is output according to the output of the FC layer of the model based on the principle of MASK mechanism, the loss value of the improved GPT-2 model is calculated according to the predicted token value and the original token value, and the improved GPT-2 model is continuously optimized according to the loss value to obtain a pre-trained title generation model.

2. The article title generation method of claim 1, wherein the step of calculating the title matching degree of the first candidate article title and the second candidate article title, and determining the target article title from the first candidate article title according to the title matching degree comprises:

3. The article title generation method of claim 1, wherein the step of calculating the title matching degree of the first candidate article title and the second candidate article title, and determining the target article title from the first candidate article title according to the title matching degree comprises:

4. The article title generation method of claim 1, wherein said step of extracting a target abstract from a target article according to a text abstract algorithm comprises:

5. The article title generation method of claim 1, wherein the step of generating a second candidate article title based on the title generation model and the target article comprises:

6. An article title generation apparatus, comprising:

the title generation model training method comprises the following steps:

inputting the preprocessed input data into an improved GPT-2 model and training, wherein the improved GPT-2 model is that an FC layer is added at the downstream of the GPT-2 model, the output of the last layer of the GPT-2 model is amplified to the size of a dictionary, then each predicted token value is output according to the output of the FC layer of the model based on the principle of an MASK mechanism, the loss value of the improved GPT-2 model is calculated according to the predicted token value and the original token value, and the improved GPT-2 model is continuously optimized according to the loss value to obtain a pre-trained title generation model.

7. A storage medium on which a computer program is stored, the program realizing the article title generation method according to any one of claims 1 to 5 when executed by a processor.

8. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the article title generation method of any one of claims 1 to 5 when executing the computer program.