CN115438654B - Article title generation method and device, storage medium and electronic equipment - Google Patents

Article title generation method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN115438654B
CN115438654B CN202211383959.9A CN202211383959A CN115438654B CN 115438654 B CN115438654 B CN 115438654B CN 202211383959 A CN202211383959 A CN 202211383959A CN 115438654 B CN115438654 B CN 115438654B
Authority
CN
China
Prior art keywords
title
article
target
candidate
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211383959.9A
Other languages
Chinese (zh)
Other versions
CN115438654A (en
Inventor
熊汉卿
阙越
谭林丰
郝书乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Jiaotong University
Original Assignee
East China Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Jiaotong University filed Critical East China Jiaotong University
Priority to CN202211383959.9A priority Critical patent/CN115438654B/en
Publication of CN115438654A publication Critical patent/CN115438654A/en
Application granted granted Critical
Publication of CN115438654B publication Critical patent/CN115438654B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an article title generation method, an article title generation device, a storage medium and electronic equipment, wherein the generation method comprises the following steps: extracting a target abstract from a target article according to a text abstract algorithm; generating a first candidate article title based on a pre-trained title generation model and a target abstract; generating a second candidate article title based on the title generation model and the target article; and calculating the title matching degree of the first candidate article title and the second candidate article title, and determining the target article title from the first candidate article title according to the title matching degree. According to the method, the first candidate article title generated by the title generation model and the target abstract and the second candidate article title generated by the title generation model and the target article are matched and calculated, and the title attached with the content of the target article is obtained from the first candidate article title as the target article title according to the matching calculation result, so that the accuracy of article title generation is improved.

Description

Article title generation method and device, storage medium and electronic equipment
Technical Field
The invention relates to the technical field of text processing, in particular to an article title generation method, an article title generation device, a storage medium and electronic equipment.
Background
Headlines and summaries are important to the authoring of an article, but it is not trivial to conceive an attractive one and close to the headlines of the article content and to extract or generate summaries from the article that conform to the subject matter of the article. Target summary generation requires the compression, summarization and summarization of the initial long chapters, resulting in short texts with short and concise meaning. The title generation of the article needs to be further refined and matched with an appropriate style on the basis of abstract generation.
The traditional manual reading and summarizing mode is low in efficiency, is greatly influenced by subjectivity of an author, and may cause misjudgment due to subjective factors of the author, so that the generated text titles and abstracts are often not accurate enough. In the field of natural language processing, the currently mainstream automatic title generation and abstract generation methods can be classified into an extraction formula and a generation formula. The extraction formula is to extract key sentences from the original text to form an abstract, and the method has the risk of information loss; the generation formula is to reorganize and express the language on the basis of understanding the original text, the method is difficult to ensure the smoothness of generating the abstract or the title, and the generation effect is difficult to guarantee if the length of the original text is too long.
Disclosure of Invention
The invention aims to provide an article title generation method, an article title generation device, a storage medium and electronic equipment so as to improve the accuracy of article title generation.
The invention provides an article title generation method, which comprises the following steps:
extracting a target abstract from a target article according to a text abstract algorithm;
generating a first candidate article title based on a pre-trained title generation model and the target abstract;
generating a second candidate article title based on the title generation model and the target article;
calculating the title matching degree of the first candidate article title and the second candidate article title, and determining a target article title from the first candidate article title according to the title matching degree;
the title generation model training method comprises the following steps:
acquiring an original text set for training, wherein the original text set comprises original articles and original titles;
preprocessing the original text set to obtain input data with a standard format, wherein the preprocessing is to unify the format of the original text set;
inputting the preprocessed input data into an improved GPT-2 model and training, wherein the improved GPT-2 model is obtained by adding an FC layer at the downstream of the GPT-2 model.
The article title generation method provided by the invention has the following beneficial effects: the title generation method comprises the steps of generating a first candidate article title by adopting a title generation model and a target abstract, generating a second candidate article title by adopting the title generation model and the target article, carrying out matching calculation on the first candidate article title generated by the target abstract and the second candidate article title generated by the target article, and obtaining a title attached with the content of the target article from the first candidate article title as the target article title according to the matching calculation result, so that the accuracy of article title generation is improved.
In addition, the article title generation method provided by the invention can also have the following additional technical characteristics:
further, the step of calculating the title matching degree of the first candidate article title and the second candidate article title, and determining the target article title from the first candidate article title according to the title matching degree comprises:
and calculating the title matching degree of the first candidate article title and the second candidate article title, and taking the first candidate article title with the highest matching degree with the second candidate article title as a target article title.
Further, the step of calculating the title matching degree of the first candidate article title and the second candidate article title, and determining the target article title from the first candidate article title according to the title matching degree comprises:
calculating the title matching degree of the first candidate article title and the second candidate article title, and calculating the title smoothness of the first candidate article title;
and determining the target article title of the first candidate article title according to the title matching degree and the title passing degree.
Further, the step of inputting the preprocessed input data into the improved GPT-2 model and performing training to obtain a pre-trained title generation model includes:
inputting the input data into an improved GPT-2 model, outputting each predicted token value by the improved GPT-2 model, calculating a loss value of the improved GPT-2 model according to the predicted token value and an original token value, and continuously optimizing the improved GPT-2 model according to the loss value to obtain a pre-trained title generation model.
Further, the step of extracting the target abstract from the target article according to the text abstract algorithm comprises:
calculating the total character length and the sentence number of a target article, and calculating the abstract length according to the total character length and the sentence number of the target article;
and calculating the weight of each sentence in the target article occupying the whole target article by using a TextRank algorithm, sorting the target articles in a descending order according to a weight sequence, selecting target sentences according to the weight sequence and the abstract length, and splicing the target sentences into target abstracts according to the sequence of the target sentences in the target article.
Further, the step of generating a second candidate article title based on the title generation model and the target article is as follows:
importing a target article into a pre-trained title generation model to obtain a predicted title list;
calculating the perplexity of each predicted title in the predicted title list through Kenlm, sequencing the perplexity of each predicted title in an ascending order, and taking the predicted title with the perplexity smaller than the preset perplexity as a second candidate article title.
The invention also provides an article title generation device, comprising:
the extraction module is used for extracting a target abstract from a target article according to a text abstract algorithm;
the first generation module is used for generating a first candidate article title based on a pre-trained title generation model and the target abstract;
a second generation module, configured to generate a second candidate article title based on the title generation model and the target article;
the calculation module is used for calculating the title matching degree of the first candidate article title and the second candidate article title and determining a target article title from the first candidate article title according to the title matching degree;
the title generation model training method comprises the following steps:
acquiring an original text set for training, wherein the original text set comprises original articles and original titles;
preprocessing the original text set to obtain input data with a standard format, wherein the preprocessing is to unify the format of the original text set;
inputting the preprocessed input data into an improved GPT-2 model and training, wherein the improved GPT-2 model is obtained by adding an FC layer at the downstream of the GPT-2 model.
The present invention also provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the article title generation method as described above.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the article title generation method.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a flowchart of an article title generation method of a first embodiment of the present invention;
FIG. 2 is a flowchart of an article title generation method of a second embodiment of the present invention;
fig. 3 is a block diagram showing the structure of an article title generating apparatus according to a third embodiment of the present invention;
reference numerals:
10. an extraction module; 20. a first generation module; 30. a second generation module; 40. and a calculation module.
Detailed Description
In order to make the objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. Several embodiments of the invention are presented in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
Referring to fig. 1 and 3, a first embodiment of the present invention provides a method for generating an article title, including steps S101 to S104:
s101, extracting a target abstract from a target article according to a text abstract algorithm;
the target article is a tested article which is to automatically generate a title by adopting the method of the invention, and the target article corresponding to the target article is obtained by the tested target article through the existing text summarization algorithm, including but not limited to a TextRank algorithm, a hash algorithm, an MD5 algorithm and the like.
The TextRank algorithm is an existing text sorting algorithm, and can extract key words and key word groups of a given text and extract key sentences of the text by using an extraction type automatic abstract method.
The TextRank algorithm constructs a network according to the co-occurrence relation among the words; the edges in the network constructed by the TextRank algorithm are undirected weighted edges. The importance degree of the edge connection between the two nodes can be obtained through a TextRank algorithm, so that sentences with more important information can be obtained through sequencing, and the obtained important sentences are spliced into an abstract in the original sequence of the original text.
In this embodiment, in the step S101, the step of extracting the target abstract from the target article according to the text abstract algorithm specifically includes:
step S1011, calculating the total character length and the sentence number of a target article, and calculating the abstract length according to the total character length and the sentence number of the target article;
for example, if the length of the article is smaller than the input (512 characters) of the title generation model, the article is not processed and is directly used as the target abstract, otherwise, the TextRank algorithm is used to calculate the weight (contribution) of each sentence in the article to the whole article, on the premise that the sum of the lengths of the extracted clauses is not more than 512, the clauses are sorted according to the weight, and the clauses are spliced into the target abstract according to the sequence of the clauses in the original article.
Step S1012, calculating the weight of each sentence in the target article occupying the whole target article by using a TextRank algorithm, sorting the target articles in a descending order according to a weight sequence, selecting target sentences according to the weight sequence and the abstract length, and splicing the target sentences into target abstract according to the sequence of the target sentences in the target article.
Specifically, a network is constructed for the articles according to the co-occurrence relation among words according to the TextRank algorithm, and the most important t words are obtained from the network and are used as top-t keywords; marking the obtained top-t key words in the original article, and extracting key word groups; iteratively calculating the TextRank value of each keyword in the sentence, calculating the TextRank value of each sentence in the article, sequencing, taking out sentences with proper length in a descending order (the total word number is limited below 512), and reordering according to the sequence appearing in the article to obtain the abstract of the article.
By adding the TextRank-Rouge algorithm in the data preprocessing process, the method can reduce the characteristic loss of the text input into the model, ensure the integrity of text information to the maximum extent, and reduce the requirement on video memory in subsequent calculation, thereby improving the accuracy of article abstract generation and reducing the calculation cost.
Step S102, generating a first candidate article title based on a pre-trained title generation model and the target abstract;
the present invention is preferably a modified GPT-2 (Generative Pre-Training) model, which is a title generation model based on Pre-Training, and which takes the target abstract obtained in step S101 as an input of the model and outputs a refined sentence as a title for the target abstract.
In this embodiment, the method for training the title generation model in step S102 includes the following steps:
s1021, acquiring an original text set for training, wherein the original text set comprises an original article and an original title;
before training, an original text set is divided into a training set and a verification set, wherein the training set is used for training a model, and the verification set is used for verifying the model.
Step S1022, preprocessing the original text set to obtain input data with a standard format, where the preprocessing is to unify the formats of the original text set;
specifically, the preprocessing comprises character coding standardization, english upper and lower case letter unification, chinese complex and simplified characters unification, and special symbol and space deletion. Through a regular expression, html labels in the article, such as special symbols like "/n", "NBSP", "s" and the like, and continuous redundant spaces are deleted, and input data with a uniform and standard format are obtained.
And S1023, inputting the preprocessed input data into an improved GPT-2 model and training, wherein the improved GPT-2 model is obtained by adding an FC layer at the downstream of the GPT-2 model.
Specifically, the text is converted into the input of the model, the input is wordlenbeading + segmentembeading + positionembeading, and the three codes are pieced together to form a tensor of (n, 3,512).
Adding an FC layer at the downstream of the GPT-2 model, amplifying the output of the last layer of the GPT-2 model to the size of a dictionary, predicting each token value according to the FC output of the model based on the principle of a MASK mechanism, and calculating the token of the title and the loss value of the original token; and iterating the above operations, saving the model of each iteration, and finally selecting a batch which best performs on the verification set.
By using the improved GPT-2 model in the training process, the headline generated by the article is matched with the headline generated by the abstract, the method can improve the efficiency of model training and the accuracy of headline generation;
step S103, generating a second candidate article title based on the title generation model and the target article;
the headline generation models in step S102 and step S103 are the same model, and the target sentence is input to the headline generation model, and the second candidate sentence headline is output.
In this embodiment, the step S103 specifically includes:
step S1031, importing a target article into the pre-trained title generation model to obtain a predicted title list;
step S1032 is to calculate a perplexity for each predicted title in the predicted title list by Kenlm, sort the perplexity of each predicted title in an ascending order, and take the predicted title with the perplexity smaller than a preset perplexity as a second candidate article title.
Kenlm is a C + + written statistical language model tool.
Figure DEST_PATH_IMAGE001
S represents a current sentence; n represents the sentence length; p (omega) i ) Denotes the probability, p (ω), of the ith word i1 ω 2 ω 3 …ω i-1 ) Means that based on the first i-1 words, the probability of finding the ith word is calculated,PP(S)representing the confusion of the sentence, i.e. the smoothness of the sentence. The lower the confusion, the more fluent the sentence.
In the step, the perplexity of the predicted title is calculated by using a Kenlm tool, and the perplexity is selected to be lower, namely the perplexity of the predicted title is smoother to be used as a second candidate article title, so that the smoothness of the target article title is ensured.
And step S104, calculating the title matching degree of the first candidate article title and the second candidate article title, and determining the target article title from the first candidate article title according to the title matching degree.
And (3) performing title matching degree calculation on each article title in the first candidate article title and the second candidate article title, respectively calculating to obtain the scores of rough-1, rough-2 and rough-L, and adding all the scores to obtain the final matching degree score.
Rouge is an existing method for evaluating an abstract based on co-occurrence information of n-grams in the abstract, and is an evaluation method for recall rate of the n-grams. The Rouge criterion consists of a series of evaluation methods, including Rouge-N (N is N in N-gram, and the value is 1,2,3,4), rouge-L and the like.
In this embodiment, step S104 specifically includes:
and calculating the title matching degree of the first candidate article title and the second candidate article title, and taking the first candidate article title with the highest matching degree with the second candidate article title as a target article title.
The method takes the title which is the highest in matching degree score and most fit with the content of the target article as the title of the target article, thereby improving the accuracy of the generation of the title of the article.
Referring to fig. 2, a second embodiment of the present invention provides a method for generating a title of an article, including the following steps:
step S201, extracting a target abstract from a target article according to a text abstract algorithm;
step S202, generating a first candidate article title based on a pre-trained title generation model and the target abstract;
step S203, generating a second candidate article title based on the title generation model and the target article;
step S204, calculating the title matching degree of the first candidate article title and the second candidate article title, and calculating the title passing degree of the first candidate article title;
step S205, determining a target article title of the first candidate article title according to the title matching degree and the title passing degree;
the title generation model training method comprises the following steps:
acquiring an original text set for training, wherein the original text set comprises original articles and original titles;
preprocessing the original text set to obtain input data with a standard format, wherein the preprocessing is to unify the formats of the original text set;
inputting the preprocessed input data into an improved GPT-2 model and training, wherein the improved GPT-2 model is obtained by adding an FC layer at the downstream of the GPT-2 model.
Where Model score represents the final total match score,PP(S)for sentence confusion score, score is the composite score, and the first candidate article title with the highest composite score is determined as the target article title.
Figure DEST_PATH_IMAGE003
Steps S201 to S203 in the second embodiment of the present invention are the same as steps S101 to S103 in the first embodiment, and the difference is that the second embodiment of the present invention calculates the compliance of the first candidate article title, and determines the final target article title after comprehensively considering the title matching degree and the title compliance, so that the accuracy and compliance of the finally obtained target article title are both ensured, and the compliance is calculated by the Kenlm in step S1032.
Referring to fig. 3, a third embodiment of the present invention provides an article title generating apparatus, including:
the extraction module 10 is used for extracting a target abstract from a target article according to a text abstract algorithm;
a first generation module 20, configured to generate a first candidate article title based on a pre-trained title generation model and the target abstract;
a second generation module 30, configured to generate a second candidate article title based on the title generation model and the target article;
a calculating module 40, configured to calculate a title matching degree between the first candidate article title and the second candidate article title, and determine a target article title from the first candidate article title according to the title matching degree.
In an embodiment of the present invention, the calculating module 40 specifically includes:
and calculating the title matching degree of the first candidate article title and the second candidate article title, and taking the first candidate article title with the highest matching degree with the second candidate article title as a target article title.
In another embodiment of the present invention, the calculation module 40 includes:
calculating the title matching degree of the first candidate article title and the second candidate article title, and calculating the title smoothness of the first candidate article title;
and determining the target article title of the first candidate article title according to the title matching degree and the title passing degree.
In an embodiment of the present invention, the extraction module 10 includes:
calculating the total character length and the sentence number of a target article, and calculating the abstract length according to the total character length and the sentence number of the target article;
and calculating the weight of each sentence in the target article occupying the whole target article by using a TextRank algorithm, sorting the target articles in a descending order according to a weight sequence, selecting target sentences according to the weight sequence and the abstract length, and splicing the target sentences into target abstracts according to the sequence of the target sentences in the target article.
In an embodiment of the present invention, the second generating module 30 includes:
importing a target article into a pre-trained title generation model to obtain a predicted title list;
calculating the perplexity of each predicted title in the predicted title list through Kenlm, sequencing the perplexity of each predicted title in an ascending order, and taking the predicted title with the perplexity smaller than the preset perplexity as a second candidate article title.
The title generation method of the article provided by the invention adopts a title generation model and a target abstract to generate a first candidate article title, adopts a title generation model and a target article to generate a second candidate article title, carries out matching calculation on the first candidate article title generated by the target abstract and the second candidate article title generated by the target article, and obtains a title attached with the content of the target article from the first candidate article title as the title of the target article according to the matching calculation result, thereby improving the accuracy of the article title generation.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Further, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Claims (8)

1. A title generation method of an article is characterized by comprising the following steps:
extracting a target abstract from a target article according to a text abstract algorithm;
generating a first candidate article title based on a pre-trained title generation model and the target abstract;
generating a second candidate article title based on the title generation model and the target article;
calculating the title matching degree of the first candidate article title and the second candidate article title, and determining a target article title from the first candidate article title according to the title matching degree;
the title generation model training method comprises the following steps:
acquiring an original text set for training, wherein the original text set comprises original articles and original titles;
preprocessing the original text set to obtain input data with a standard format, wherein the preprocessing is to unify the format of the original text set;
inputting the preprocessed input data to an improved GPT-2 model and training, wherein the improved GPT-2 model is that an FC layer is added at the downstream of the GPT-2 model, the output of the last layer of the GPT-2 model is amplified to the size of a dictionary, then each predicted token value is output according to the output of the FC layer of the model based on the principle of MASK mechanism, the loss value of the improved GPT-2 model is calculated according to the predicted token value and the original token value, and the improved GPT-2 model is continuously optimized according to the loss value to obtain a pre-trained title generation model.
2. The article title generation method of claim 1, wherein the step of calculating the title matching degree of the first candidate article title and the second candidate article title, and determining the target article title from the first candidate article title according to the title matching degree comprises:
and calculating the title matching degree of the first candidate article title and the second candidate article title, and taking the first candidate article title with the highest matching degree with the second candidate article title as a target article title.
3. The article title generation method of claim 1, wherein the step of calculating the title matching degree of the first candidate article title and the second candidate article title, and determining the target article title from the first candidate article title according to the title matching degree comprises:
calculating the title matching degree of the first candidate article title and the second candidate article title, and calculating the title smoothness of the first candidate article title;
and determining the target article title of the first candidate article title according to the title matching degree and the title passing degree.
4. The article title generation method of claim 1, wherein said step of extracting a target abstract from a target article according to a text abstract algorithm comprises:
calculating the total character length and the sentence number of a target article, and calculating the abstract length according to the total character length and the sentence number of the target article;
and calculating the weight of each sentence in the target article occupying the whole target article by using a TextRank algorithm, sorting the target articles in a descending order according to a weight sequence, selecting target sentences according to the weight sequence and the abstract length, and splicing the target sentences into target abstracts according to the sequence of the target sentences in the target article.
5. The article title generation method of claim 1, wherein the step of generating a second candidate article title based on the title generation model and the target article comprises:
importing a target article into a pre-trained title generation model to obtain a predicted title list;
calculating the perplexity of each predicted title in the predicted title list through Kenlm, sequencing the perplexity of each predicted title in an ascending order, and taking the predicted title with the perplexity smaller than the preset perplexity as a second candidate article title.
6. An article title generation apparatus, comprising:
the extraction module is used for extracting a target abstract from a target article according to a text abstract algorithm;
the first generation module is used for generating a first candidate article title based on a pre-trained title generation model and the target abstract;
a second generation module, configured to generate a second candidate article title based on the title generation model and the target article;
the calculation module is used for calculating the title matching degree of the first candidate article title and the second candidate article title and determining a target article title from the first candidate article title according to the title matching degree;
the title generation model training method comprises the following steps:
acquiring an original text set for training, wherein the original text set comprises original articles and original titles;
preprocessing the original text set to obtain input data with a standard format, wherein the preprocessing is to unify the formats of the original text set;
inputting the preprocessed input data into an improved GPT-2 model and training, wherein the improved GPT-2 model is that an FC layer is added at the downstream of the GPT-2 model, the output of the last layer of the GPT-2 model is amplified to the size of a dictionary, then each predicted token value is output according to the output of the FC layer of the model based on the principle of an MASK mechanism, the loss value of the improved GPT-2 model is calculated according to the predicted token value and the original token value, and the improved GPT-2 model is continuously optimized according to the loss value to obtain a pre-trained title generation model.
7. A storage medium on which a computer program is stored, the program realizing the article title generation method according to any one of claims 1 to 5 when executed by a processor.
8. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the article title generation method of any one of claims 1 to 5 when executing the computer program.
CN202211383959.9A 2022-11-07 2022-11-07 Article title generation method and device, storage medium and electronic equipment Active CN115438654B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211383959.9A CN115438654B (en) 2022-11-07 2022-11-07 Article title generation method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211383959.9A CN115438654B (en) 2022-11-07 2022-11-07 Article title generation method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN115438654A CN115438654A (en) 2022-12-06
CN115438654B true CN115438654B (en) 2023-03-24

Family

ID=84252564

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211383959.9A Active CN115438654B (en) 2022-11-07 2022-11-07 Article title generation method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN115438654B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022104967A1 (en) * 2020-11-19 2022-05-27 深圳大学 Pre-training language model-based summarization generation method

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2580533B2 (en) * 1994-03-31 1997-02-12 国立衛生試験所長 Transgenic mouse for gene mutation detection
CN101393545A (en) * 2008-11-06 2009-03-25 新百丽鞋业(深圳)有限公司 Method for implementing automatic abstracting by utilizing association model
US9342561B2 (en) * 2014-01-08 2016-05-17 International Business Machines Corporation Creating and using titles in untitled documents to answer questions
CN105631018B (en) * 2015-12-29 2018-12-18 上海交通大学 Article Feature Extraction Method based on topic model
CN107832299B (en) * 2017-11-17 2021-11-23 北京百度网讯科技有限公司 Title rewriting processing method and device based on artificial intelligence and readable medium
US10839013B1 (en) * 2018-05-10 2020-11-17 Facebook, Inc. Generating a graphical representation of relationships among a set of articles and information associated with the set of articles
CN110866391A (en) * 2019-11-15 2020-03-06 腾讯科技(深圳)有限公司 Title generation method, title generation device, computer readable storage medium and computer equipment
CN111507090A (en) * 2020-02-27 2020-08-07 平安科技(深圳)有限公司 Abstract extraction method, device, equipment and computer readable storage medium
CN111930929B (en) * 2020-07-09 2023-11-10 车智互联(北京)科技有限公司 Article title generation method and device and computing equipment
CN111898369B (en) * 2020-08-17 2024-03-08 腾讯科技(深圳)有限公司 Article title generation method, model training method and device and electronic equipment
CN112231468A (en) * 2020-10-15 2021-01-15 平安科技(深圳)有限公司 Information generation method and device, electronic equipment and storage medium
CN113434664A (en) * 2021-06-30 2021-09-24 平安科技(深圳)有限公司 Text abstract generation method, device, medium and electronic equipment
CN113378552B (en) * 2021-07-06 2024-04-19 焦点科技股份有限公司 Commodity title generation method based on multi-mode GPT2 model
CN114611498A (en) * 2022-03-18 2022-06-10 腾讯科技(深圳)有限公司 Title generation method, model training method and device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022104967A1 (en) * 2020-11-19 2022-05-27 深圳大学 Pre-training language model-based summarization generation method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
结合注意力机制的新闻标题生成模型;李慧等;《山西大学学报(自然科学版)》;20171115(第04期);全文 *

Also Published As

Publication number Publication date
CN115438654A (en) 2022-12-06

Similar Documents

Publication Publication Date Title
US20180267956A1 (en) Identification of reading order text segments with a probabilistic language model
Mohtaj et al. Parsivar: A language processing toolkit for Persian
Eskander et al. Foreign words and the automatic processing of Arabic social media text written in Roman script
CN112818093B (en) Evidence document retrieval method, system and storage medium based on semantic matching
CN111310470B (en) Chinese named entity recognition method fusing word and word features
WO2001037128A2 (en) A system and iterative method for lexicon, segmentation and language model joint optimization
CN111611810A (en) Polyphone pronunciation disambiguation device and method
CN111930792A (en) Data resource labeling method and device, storage medium and electronic equipment
US11983501B2 (en) Apparatus and method for automatic generation of machine reading comprehension training data
Mukund et al. A vector space model for subjectivity classification in Urdu aided by co-training
CN111368130A (en) Quality inspection method, device and equipment for customer service recording and storage medium
Weerasinghe et al. Feature Vector Difference based Authorship Verification for Open-World Settings.
JP5441937B2 (en) Language model learning device, language model learning method, language analysis device, and program
Singh et al. Review of real-word error detection and correction methods in text documents
CN113361252B (en) Text depression tendency detection system based on multi-modal features and emotion dictionary
CN110929518A (en) Text sequence labeling algorithm using overlapping splitting rule
Van Den Bosch Scalable classification-based word prediction and confusible correction
Hao et al. SCESS: a WFSA-based automated simplified chinese essay scoring system with incremental latent semantic analysis
CN112380834A (en) Tibetan language thesis plagiarism detection method and system
JP6495124B2 (en) Term semantic code determination device, term semantic code determination model learning device, method, and program
CN115438654B (en) Article title generation method and device, storage medium and electronic equipment
CN116629238A (en) Text enhancement quality evaluation method, electronic device and storage medium
CN113553853B (en) Named entity recognition method and device, computer equipment and storage medium
US8977538B2 (en) Constructing and analyzing a word graph
JPWO2009113289A1 (en) NEW CASE GENERATION DEVICE, NEW CASE GENERATION METHOD, AND NEW CASE GENERATION PROGRAM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant