CN112860881A - Abstract generation method and device, electronic equipment and storage medium - Google Patents

Abstract generation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112860881A
CN112860881A CN201911182819.3A CN201911182819A CN112860881A CN 112860881 A CN112860881 A CN 112860881A CN 201911182819 A CN201911182819 A CN 201911182819A CN 112860881 A CN112860881 A CN 112860881A
Authority
CN
China
Prior art keywords
candidate
abstract
text
sentence
calculating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911182819.3A
Other languages
Chinese (zh)
Inventor
刘龑龙
佟津乐
谢海华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Pku Founder Information Industry Group Co ltd
Peking University Founder Group Co Ltd
Original Assignee
Pku Founder Information Industry Group Co ltd
Peking University Founder Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pku Founder Information Industry Group Co ltd, Peking University Founder Group Co Ltd filed Critical Pku Founder Information Industry Group Co ltd
Priority to CN201911182819.3A priority Critical patent/CN112860881A/en
Publication of CN112860881A publication Critical patent/CN112860881A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a device for generating an abstract, electronic equipment and a storage medium, wherein the method for generating the abstract comprises the following steps: the method comprises the steps of extracting at least one abstract sentence from a plurality of text chapters, carrying out full arrangement on the at least one abstract sentence to generate a first candidate abstract set, calculating the smoothness of each first candidate abstract in the first candidate abstract set by using a language model, and outputting the first candidate abstract with the maximum smoothness as an abstract. In the abstract generating method, the candidate abstract is processed by using the language model to output the smoothness of each candidate abstract, and then the candidate paragraph with the best smoothness is output as the abstract, so that the generated abstract has good logic order, high smoothness and good readability.

Description

Abstract generation method and device, electronic equipment and storage medium
Technical Field
The invention relates to the field of automatic abstractions in natural language processing, in particular to an abstract generating method and device, electronic equipment and a storage medium.
Background
The multi-document automatic summarization refers to that the main information of a plurality of texts under the same subject is abstracted into a summary by compression.
The existing multi-document automatic abstract generation method mainly adopts an extraction type automatic abstract generation technology, namely, a plurality of subject terms are extracted from a plurality of documents, the subject terms appearing at different positions in the documents are concatenated into a paragraph, and finally an abstract is generated.
However, the existing multi-document automatic summary generation method is not sorted because the subject words at different positions are directly dropped into a paragraph, so that the generated summary lacks logical order and has poor readability.
Disclosure of Invention
The invention provides a summary generation method, a device, electronic equipment and a storage medium, which are used for solving the technical problem that the generated summary lacks a logic order because the existing summary generation method is not sequenced.
In a first aspect, the present invention provides a method for generating a summary, including:
extracting at least one abstract sentence from a plurality of text chapters;
fully arranging at least one abstract sentence to generate a first candidate abstract set; wherein the first candidate summary set comprises at least one first candidate summary;
and calculating the compliance degree of each first candidate abstract in the first candidate abstract set by utilizing a language model so as to obtain the first candidate abstract with the maximum compliance degree.
Optionally, calculating a compliance degree of each first candidate summary in the first candidate summary set by using a language model to obtain a first candidate summary with a maximum compliance degree, specifically including:
for each first candidate abstract, extracting partial abstract sentences from the first candidate abstract to generate a second candidate abstract;
processing the second candidate abstract by using the language model to obtain the smoothness of the second candidate abstract;
selecting part of second candidate digests according to the popularity;
and calculating the compliance degree of the first candidate abstract corresponding to the selected part of the second candidate abstract by using a language model to obtain the first candidate abstract with the maximum compliance degree.
Optionally, extracting a partial abstract sentence from the first candidate abstract to generate a second candidate abstract, which specifically includes:
and intercepting the first M summary sentences of the first candidate summary to generate a second candidate summary, wherein M is a positive integer and is less than or equal to N.
Optionally, calculating a compliance degree of each first candidate summary in the first candidate summary set by using the language model specifically includes:
processing the first candidate abstract by using a language model to obtain N probabilities;
summing the N probabilities to obtain a compliance of the first candidate summary;
wherein, the ith probability in the N probabilities represents that the ith position is a summary sentence xiN represents the number of abstract sentences, i is greater than or equal to 1 and less than or equal to N.
Optionally, summing the N probabilities to obtain the compliance of the first candidate summary includes: calculating the smoothness of the first candidate abstract according to the following formula;
Figure BDA0002291719340000021
wherein, P (x)i|x1,x2,…,xi-1,xi+1,…,xN) The ith position is shown as a summary sentence xiThe probability of (c).
Optionally, the extracting at least one abstract sentence from a plurality of text chapters specifically includes:
extracting at least one subject term from a plurality of text chapters;
calculating the correlation between the subject word and the text chapter;
a portion of the text runs is selected based on the relevance, and at least one abstract sentence is extracted from the portion of the text runs.
Optionally, calculating the relevancy between the subject term and the text chapter specifically includes:
calculating the correlation degree of the subject words and the text chapters aiming at each subject word and each text chapter;
for each text chapter, the sum of all the relevancy degrees corresponding to the text chapter is calculated to obtain the relevancy degrees of the subject word and the text chapter.
In a second aspect, the present invention provides a digest generation apparatus, including:
the extraction module is used for extracting at least one abstract sentence from a plurality of text chapters;
the arrangement module is used for carrying out full arrangement on at least one abstract sentence to generate a first candidate abstract set; wherein the first candidate summary set comprises at least one first candidate summary;
and the calculation module is used for calculating the compliance degree of each first candidate abstract in the first candidate abstract set by utilizing the language model so as to obtain the first candidate abstract with the maximum compliance degree.
Optionally, the calculation module is specifically configured to:
for each first candidate abstract, extracting partial abstract sentences from the first candidate abstract to generate a second candidate abstract;
processing the second candidate abstract by using the language model to obtain the smoothness of the second candidate abstract;
and selecting part of second candidate digests according to the size of the compliance degree, and calculating the compliance degree of the first candidate digests corresponding to the selected part of second candidate digests by using a language model so as to output the first candidate digests with the maximum compliance degree as the digests.
Optionally, the calculation module is specifically configured to:
and intercepting the first M summary sentences of the first candidate summary to generate a second candidate summary, wherein M is a positive integer and is less than or equal to N.
Optionally, the calculation module is specifically configured to:
processing the first candidate abstract by using a language model to obtain N probabilities;
summing the N probabilities to obtain a compliance of the first candidate summary;
wherein, the ith probability in the N probabilities represents that the ith position is a summary sentence xiI is more than or equal to 1 and less than or equal to N.
Optionally, the calculation module is specifically configured to:
calculating the smoothness of the first candidate abstract according to the following formula;
Figure BDA0002291719340000031
wherein, P (x)i|x1,x2,…,xi-1,xi+1,…,xN) The ith position is shown as a summary sentence xiThe probability of (c).
Optionally, the extraction module is specifically configured to:
extracting at least one subject term from a plurality of text chapters;
calculating the correlation between the subject word and the text chapter;
a portion of the text runs is selected based on the relevance, and at least one abstract sentence is extracted from the portion of the text runs.
Optionally, the extraction module is specifically configured to:
calculating the correlation degree of the subject words and the text chapters aiming at each subject word and each text chapter;
for each text chapter, the sum of all the relevancy degrees corresponding to the text chapter is calculated to obtain the relevancy degrees of the subject word and the text chapter.
In a third aspect, the present invention provides an electronic device, comprising:
a memory for storing a program;
a processor for executing the program stored in the memory, the processor being adapted to perform the summary generation method according to the first aspect and the alternative when the program is executed.
In a fourth aspect, the present invention provides a computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the summary generation method according to the first aspect and the alternative.
In the abstract generating method, a language model is used for processing candidate abstract to output the smoothness of each candidate abstract, and then the candidate paragraph with the best smoothness is taken as abstract output, so that the generated abstract has good logical order, high smoothness and good readability. In addition, a second candidate abstract with fewer abstract sentences is formed by intercepting the first candidate abstract, the number of the second candidate abstract is far smaller than that of the first candidate abstract due to the fact that the formed second candidate abstract has the same number, the second candidate abstract is trained by utilizing a language model, the first candidate abstract corresponding to the second candidate abstract with high compliance is selected to calculate compliance, and the calculation scale is greatly reduced.
Drawings
FIG. 1 is a flowchart illustrating a summary generation method according to an exemplary embodiment of the present invention;
FIG. 2 is a flowchart illustrating a digest generation method according to another exemplary embodiment of the present invention;
FIG. 3 is a schematic diagram of a summary generation method shown in the embodiment of FIG. 2;
FIG. 4 is a schematic diagram illustrating a structure of a summary generation apparatus according to an exemplary embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The existing multi-text automatic abstract generation method adopts an extraction type technology, namely, a subject word is extracted from a text, the subject words at different positions in the text are connected into paragraphs, and the generated paragraphs have no logical order and poor readability because the sequencing processing is not carried out.
The invention conception of the abstract generation method provided by the invention is as follows: firstly, the candidate abstracts are processed by using a language model to output the smoothness of each candidate abstract, and then the candidate paragraphs with the best smoothness are output as abstracts, so that the generated abstracts have good logical order, high smoothness and good readability. Secondly, if the smoothness is calculated by using the full arrangement of the abstract sentences, the calculation scale is
Figure BDA0002291719340000041
The method comprises the steps of firstly calculating the smoothness of partial sentences of a first candidate abstract, selecting the first candidate abstract corresponding to the sentences with high smoothness to calculate the smoothness, and reducing the calculation scale of the smoothness.
Fig. 1 is a flowchart illustrating a digest generation method according to an exemplary embodiment of the present invention. As shown in fig. 1, the present invention provides a digest generation method, including:
s101, extracting at least one abstract sentence from a plurality of text chapters.
Wherein the plurality of text chapters refer to a plurality of text chapters related in content composed of natural language sentences. For example: a plurality of content-related news stories. The method for extracting at least one abstract sentence from a plurality of text chapters specifically comprises the following steps: in each text chapter, a summary sentence is extracted from the text chapter according to the importance of each sentence.
S102, at least one abstract sentence is arranged completely to generate a first candidate abstract set.
The fully arranging at least one abstract sentence to generate a first candidate abstract set specifically includes: and (4) all the abstract sentences extracted in the step (S101) are arranged completely to generate a first candidate abstract set, wherein the first candidate abstract set comprises at least one first candidate abstract. For example: there are 4 summary sentences, and the 4 summary sentences are arranged completely to generate 120 first candidate summaries.
S103, calculating the smoothness of each first candidate abstract by using a language model; and outputting the first candidate summary with the highest popularity as the summary.
The language model can be a neural network language model, the neural network language model is trained through a large number of sentence paragraphs, and then the trained language model is used for calculating the smoothness of the first candidate abstract. Here, the calculation of the compliance of the first candidate digest using the neural network language model is not limited, and may be performed using another language model. And after the smoothness is obtained, outputting the first candidate summary with the maximum smoothness as the summary.
In the summary generation method provided by the invention, at least one summary sentence is arranged completely to obtain a first candidate summary set, the smoothness of each first candidate summary in the first candidate summary set is calculated by using a language model, the first candidate summary with the largest smoothness is selected for output, and the generated summary has good logic order and high readability.
Fig. 2 is a flowchart illustrating a digest generation method according to another exemplary embodiment of the present invention. As shown in fig. 2, the summary generation method provided by the present invention includes the following steps:
s201, extracting at least one subject term from a plurality of text chapters.
The method includes acquiring a plurality of text chapters, and performing denoising processing on each text chapter, for example: and removing the title, remarks, references, pictures, messy codes and other parts in the text chapters, and only keeping the text part. And then, the text chapters are subjected to de-duplication treatment to remove the duplicated text chapters. And finally, extracting the subject term from the text chapters. In this embodiment, the subject word can be extracted by utilizing LDA algorithm, TF-IDF algorithm and Textrank algorithm. Here, the type of algorithm used is not limited. After the subject term is extracted, the subject term with high importance can be selected according to the importance degree of the subject term, and the relevance of the subject term and the text chapter is further calculated.
For example: a plurality of hot news of 'the father of the romantic wonder comes to the death' are collected, and 33 news are collected. Each article in 33 news is divided into words, stop words are removed, the articles are connected in series to form character strings, subject words are extracted, the first 5 articles are selected, and the examples are selected as 'romantic Wei', 'Stanli', 'died', 'cartoon' and 'movie'.
S202, calculating the correlation degree of the subject term and the text chapters.
Wherein, calculating the correlation between the subject term and the text chapter specifically comprises: calculating the correlation degree of the subject words and the text chapters aiming at each subject word and each text chapter; for each text chapter, the sum of all the relevancy degrees corresponding to the text chapter is calculated to obtain the relevancy degrees of the subject word and the text chapter. In this embodiment, the correlation between the subject word and the text chapter can be calculated by using LDA algorithm, TF-IDF algorithm and Textrank algorithm. Here, the type of algorithm used is not limited.
For example: an article is claused with a period (|), exclamation point (|), question mark (. Remove sentences with less than 10 words and an end sign question? The sentence (1). Stop words are removed, for example: also, it is equal. Calculating sentence vectors according to TF-IDF algorithm, and calculating the similarity between the article and the subject term by cosine similarity calculation method
S203, selecting partial text chapters according to the relevance, and extracting at least one abstract sentence from the partial text chapters.
Wherein, selecting partial text chapters according to the relevance specifically comprises: and sequencing the relevancy of the subject word and the text chapters, arranging the text chapters with high relevancy in the front and the text chapters with low relevancy in the back, and selecting the front L text chapters for merging to form the merged document. If the text chapters under the subject word are lower than the L chapters, all the text chapters are combined before.
Extracting at least one abstract sentence from a partial text chapter, which specifically comprises the following steps: first, a sentence is divided into a merged document, the sentence length is filtered, and too short and too long sentences are deleted, and sentences unsuitable for being used as abstract sentences are deleted, for example: an interrogative sentence. And secondly, segmenting the sentence and removing stop words. And finally, calculating the importance of the sentence.
In this implementationIn the example, sentence importance is calculated using the Textrank algorithm. Specifically, a graph model is constructed, and the TextRank algorithm model can be represented as a directed weighted graph G ═ (V, E), wherein points represent sentences and are collected by points V { (V) }1,v2,……,vvAnd the set of edges E { E }1,e2,……,eeAnd E is a subset of F, wherein F represents a set formed by edges between any two points, and the number of elements in the set is v multiplied by v. Any two points v in the figurei、vjThe weight of the edge in between is wijFor a given point vi,In(vi) Is a point v of directioniSet of points of (v), Out (v)i) Is a point viA set of pointed to points. Point viThe score of (c) is defined as follows:
Figure BDA0002291719340000071
wherein d is a damping coefficient, the value range is 0 to 1, which represents the probability of pointing to any other point from a certain point in the graph, and the value is generally 0.85.
According to the formula, the score of each sentence is calculated by iterative propagation weight so as to obtain the importance of each sentence, and the sentence with high importance is selected as the abstract sentence.
In this embodiment, after the importance of each sentence is obtained, the redundancy removal processing may be performed on the sentence, specifically, the word segmentation processing is performed on the sentence, stop words are removed, the word repetition rate after the word segmentation is compared, if the repetition rate reaches a preset value, one of the sentences is deleted, and then the sentence with the high importance is selected as the abstract sentence.
For example: and (4) sorting according to the sum of the similarity of the articles corresponding to each subject term, selecting the first 3 documents with high relevance, and combining the documents to serve as an abstract article set. Wherein, the selected news is: the book "father of romantic Power" was reached, the father of romantic Power, Liangying, immortal spirit, the father of romantic Power "Stan. Liangying half of Angel and half of devil". The sentences in 3 documents are scored and sorted according to the Textrank algorithm, and 10 sentences are selected as the abstract sentence set. And (3) segmenting words of 10 abstract sentences, comparing the word repetition rate after segmentation, deleting one of the 10 abstract sentences when the word repetition rate reaches more than 50%, and taking the abstract sentence with the importance degree ranked in the top 4 as the final abstract.
S204, at least one abstract sentence is arranged completely to generate a first candidate abstract set.
This step is the same as S102 in the embodiment shown in fig. 1, and is not repeated here.
S205, calculating the smoothness of each first candidate abstract by using a language model; and outputting the first candidate summary with the highest popularity as the summary.
The abstract sentences are selected from different articles, and in order to ensure the continuity and readability of the final abstract sentences, the abstract sentences need to be sequenced. The language model is adopted to convert the problem of sequencing the abstract sentences into: and calculating the smoothness of the candidate abstract formed by arranging different abstract sentences. The candidate abstract with the highest popularity is the final abstract.
The method includes the following steps that the compliance degree of each candidate abstract is calculated by using a language model, and the first candidate abstract with the maximum compliance degree is output as an abstract, and specifically includes the following steps:
s3001, for each first candidate summary, extracting partial summary sentences from the first candidate summary to generate a second candidate summary.
Extracting partial abstract sentences from the first candidate abstract to generate a second candidate abstract, wherein the method specifically comprises the following steps: and intercepting the partial abstract sentence of the first candidate abstract to generate a second candidate abstract. Optionally, the first M summary sentences of the first candidate summary are intercepted to generate a second candidate summary, where M is a positive integer, M is less than or equal to N, and N is the number of summary sentences in the first candidate summary.
For example: if there are 4 summary sentences, there are 24 first candidate summaries in the first candidate summary set, and for each first candidate summary, the first 2 summary sentences are selected to form the second candidate summary. The second candidate digests are S1+ S2, S2+ S1, S2+ S3, S2+ S4, S1+ S3, S1+ S4, S3+ S1, S3+ S2, S3+ S4, S4+ S1, S4+ S2, S4+ S3. Wherein, S1 to S4 respectively represent the first to fifth abstract sentences.
S3002, processing the second candidate abstract by using the language model to obtain the compliance degree of the second candidate abstract.
The language model is the same as S103 in the embodiment shown in fig. 1, and is not described here again.
S3003, selecting partial second candidate digests according to the size of the compliance degree, and calculating the compliance degree of the first candidate digests corresponding to the selected partial second candidate digests by using the language model so as to output the first candidate digests with the maximum compliance degree as the digests.
And the second candidate digests are sequentially ranked according to the passing degree from high to low, the L first candidate digests ranked in the front are selected, and the first candidate digests corresponding to the selected second digests are determined. For example: firstly, calculating the compliance of a second candidate summary formed by the 12 sentence combinations, sorting the compliance from high to low, and deleting the Z combination modes with low compliance. Assuming that Z is 9, there are three remaining sorting modes, assumed as S1+ S2, S2+ S1, S2+ S3. Then, the method of sequencing all the abstract sentences corresponding to the three sequencing methods is determined. There are 6 kinds of sorting ways, specifically: s1+ S2 corresponds to S1+ S2+ S3+ S4, S1+ S2+ S4+ S3. S2+ S1 corresponds to S2+ S1+ S3+ S4, S2+ S1+ S4+ S3. S2+ S3 corresponds to S2+ S3+ S1+ S4, S2+ S3+ S4+ S1. And (4) calculating the smoothness of the sorting modes in the step (6), and selecting the sorting mode with high smoothness as the final abstract.
The method for calculating the compliance of each first candidate abstract by using the language model specifically comprises the following steps:
the first candidate abstract is processed by a language model to obtain N probabilities, and the N probabilities are summed to obtain the compliance of the first candidate abstract.
The compliance degree of the first candidate abstract is calculated according to the following formula:
Figure BDA0002291719340000081
wherein, P (x)i|x1,x2,…,xi-1,xi+1,…,xN) Watch (A)Shows the ith position as a summary sentence xiProbability of (x)iShowing the ith abstract sentence.
According to the formula, the larger the value of P (S), the higher the smoothness of the sentence, and the better the logic and fluency.
Fig. 3 is a schematic diagram illustrating the principle of the abstract generation method shown in fig. 2, in which a plurality of text chapters are performing text data, and then performing subject word relevancy calculation on the text to screen out the text related to the subject word. Extracting abstract sentences from the text to obtain candidate abstract sentences, performing redundancy removal processing on the abstract sentences to obtain selected candidate abstract sentences, sequencing the candidate abstract sentences, and inputting the sequenced result into a language model for calculation to generate an abstract.
In the digest generation method provided in this embodiment, a second candidate digest with fewer digest sentences is formed by intercepting the first candidate digest, and since the formed second candidate digests are the same and the number of the second candidate digests is much smaller than that of the first candidate digest, the language model is used to train the second candidate digest, and the first candidate digest corresponding to the second candidate digest with high compliance is selected to calculate compliance, thereby greatly reducing the calculation scale.
By utilizing the abstract generating method provided by the invention, the abstracts of a plurality of news with more than 40 subjects, such as 'naked donation of weekly wetting', 'depression concern', 'hero alliance capture', 'doctor-patient contradiction', 'university student employment', and the like are extracted, the automatically generated abstract of the short text is evaluated manually, the fluency of the abstract and the important point coverage rate of the plurality of news are taken as standards, the result is compared with a TextRank algorithm (hereinafter called algorithm 1, the algorithm core is that abstract sentence extraction and abstract sentence redundancy processing), and more than 85% of the processing results of the plurality of news are better than the processing results of the algorithm 1. One example is selected below for detailed description.
A total of 33 news "the father of the romantic world" series were searched in the web page. The summary generated using algorithm 1 is: in the heat mapping venom, Stan and Li have no missing part, and at the end of the film, Stan and Li perform a dog walking passer while walking the dog, so male is warned not to give up. Stant-Li is a legendary character in the comic field, has become one of the important marks of the American popular culture, creates 80% of the well-known roles of romantic power, and brings the dream of hero in countless people. Yesterday early, a sad message is transmitted by the movie circle: the world of the romantic power is created by one hand, super heroic comics of spider knight-errands, X-war policemen, Thymus, iron and steel knight-errands, magic four knight-errands, green giant people, panther and the like are created, the world comes to death in a medical center of Hollywood at local time Monday, and the world enjoys 95 years old.
The abstract generated by the abstract generation method provided by the invention is as follows: the parent of the romantic Power, namely Stant and plum, has one less active shadow from the romantic and cinematographic world, and the old who appears as a lover in each romantic super hero movie and becomes a colored egg and cast fun to leave. Stant-Li is a legendary character in the comic field, has become one of the important marks of the American popular culture, creates 80% of the well-known roles of romantic power, and brings the dream of hero in countless people. In the current year of commercial wealth of free-range, Stan-Li launches the free-range universe, in the parallel universe, ironmen, American captain and green giant build members of a 'revenge alliance', then the Thor, the Ant and the like are added, and perhaps the Stan-Li at that time does not realize that the series of free-range characters derive huge commercial values in the next decades.
Compared with the algorithm 1, the summary generation method provided by the invention has the advantages that the content consistency and the generalization are higher than those of the algorithm 1 when the summary generated by processing a plurality of documents is compared with the algorithm 1, the method is in accordance with the condition of writing the summary of the short text, is reasonable and effective, and has outstanding and remarkable effect, good use value and application prospect.
Fig. 4 is a schematic structural diagram of a summary generation apparatus according to an exemplary embodiment of the present invention. As shown in fig. 4, the present invention provides a summary generation apparatus 400, wherein the apparatus 400 includes:
an extraction module 401 for extracting at least one abstract sentence from a plurality of text chapters;
a ranking module 402, configured to perform full ranking on at least one abstract sentence to generate a first candidate abstract set; wherein the first candidate summary set comprises at least one first candidate summary;
a calculating module 403, configured to calculate, by using a language model, a compliance degree of each first candidate summary in the first candidate summary set to obtain a first candidate summary with a maximum compliance degree.
Optionally, the calculating module 403 is specifically configured to:
for each first candidate abstract, extracting partial abstract sentences from the first candidate abstract to generate a second candidate abstract;
processing the second candidate abstract by using the language model to obtain the smoothness of the second candidate abstract;
and selecting part of second candidate digests according to the size of the compliance degree, and calculating the compliance degree of the first candidate digests corresponding to the selected part of second candidate digests by using a language model so as to output the first candidate digests with the maximum compliance degree as the digests.
Optionally, the calculating module 403 is specifically configured to:
and intercepting partial abstract sentences of the first candidate abstract to generate a second candidate abstract.
Optionally, the calculating module 403 is specifically configured to:
processing the first candidate abstract by using a language model to obtain N probabilities;
summing the N probabilities to obtain a compliance of the first candidate summary;
wherein, the ith probability in the N probabilities represents that the ith position is a summary sentence xiI is more than or equal to 1 and less than or equal to N.
Optionally, the calculating module 403 is specifically configured to:
calculating the smoothness of the first candidate abstract according to the following formula;
Figure BDA0002291719340000101
wherein, P (x)i|x1,x2,…,xi-1,xi+1,…,xN) The ith position is shown as a summary sentence xiThe probability of (c).
Optionally, the extraction module 401 is specifically configured to:
extracting at least one subject term from a plurality of text chapters;
calculating the correlation between the subject word and the text chapter;
a portion of the text runs is selected based on the relevance, and at least one abstract sentence is extracted from the portion of the text runs.
Optionally, the extraction module 401 is specifically configured to:
calculating the correlation degree of the subject words and the text chapters aiming at each subject word and each text chapter;
for each text chapter, the sum of all the relevancy degrees corresponding to the text chapter is calculated to obtain the relevancy degrees of the subject word and the text chapter.
Fig. 5 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present invention. As shown in fig. 5, the electronic device 500 of the present embodiment includes: a processor 501 and a memory 502.
Memory 502 for storing computer execution instructions;
the processor 501 is configured to execute computer-executable instructions stored in the memory to implement the steps performed by the receiving device in the above embodiments. Reference may be made in particular to the description relating to the method embodiments described above.
Alternatively, the memory 502 may be separate or integrated with the processor 501.
When the memory 502 is separately provided, the electronic device 500 further includes a bus 503 for connecting the memory 502 and the processor 501.
The embodiment of the present invention further provides a computer-readable storage medium, in which computer execution instructions are stored, and when a processor executes the computer execution instructions, the abstract generation method is implemented.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for generating a summary, comprising:
extracting at least one abstract sentence from a plurality of text chapters;
fully arranging the at least one abstract sentence to generate a first candidate abstract set; wherein the first set of candidate digests comprises at least one first candidate digest;
and calculating the compliance degree of each first candidate abstract in the first candidate abstract set by using a language model to obtain the first candidate abstract with the maximum compliance degree.
2. The method according to claim 1, wherein the calculating, by using a language model, a compliance degree of each first candidate summary in the first candidate summary set to obtain a first candidate summary with a maximum compliance degree specifically includes:
for each first candidate abstract, extracting partial abstract sentences from the first candidate abstract to generate a second candidate abstract;
processing the second candidate abstract by using a language model to obtain the smoothness of the second candidate abstract;
selecting part of the second candidate digests according to the size of the popularity;
and calculating the compliance degree of the first candidate abstract corresponding to the second candidate abstract of the selected part by using the language model to obtain the first candidate abstract with the maximum compliance degree.
3. The method according to claim 2, wherein the extracting partial abstract sentences from the first candidate abstract to generate a second candidate abstract specifically comprises:
and extracting the first M summary sentences of the first candidate summary to generate a second candidate summary, wherein M is a positive integer and is less than or equal to N.
4. The method according to any one of claims 1 to 3, wherein the calculating, by using a language model, the compliance of each first candidate summary in the first candidate summary set specifically includes:
processing the first candidate abstract by using a language model to obtain N probabilities;
summing the N probabilities to obtain a compliance of the first candidate summary;
wherein, the ith probability in the N probabilities represents that the ith position is a summary sentence xiN represents the number of abstract sentences, i is greater than or equal to 1 and less than or equal to N.
5. The method according to claim 4, wherein said summing the N probabilities to obtain the compliance of the first candidate summary specifically comprises: calculating the compliance degree of the first candidate abstract according to the following formula;
Figure FDA0002291719330000021
wherein, P (x)i|x1,x2,…,xi-1,xi+1,…,xN) The ith position is shown as a summary sentence xiThe probability of (c).
6. The method of any of claims 1 to 3, wherein the extracting at least one abstract sentence from a plurality of text chapters comprises:
extracting at least one subject term from the plurality of text chapters;
calculating the correlation degree of the subject term and the text chapters;
and selecting part of the text chapters according to the relevance, and extracting at least one abstract sentence from the part of the text chapters.
7. The method of claim 6, wherein the calculating the relevancy of the subject term and the text chapter comprises:
calculating the relevance of the subject word and the text chapters aiming at each subject word and each text chapter;
and calculating the sum of all the relevance degrees corresponding to the text chapters aiming at each text chapters to obtain the relevance degrees of the subject word and the text chapters.
8. An apparatus for generating a summary, comprising:
the extraction module is used for extracting at least one abstract sentence from a plurality of text chapters;
the arrangement module is used for carrying out full arrangement on the at least one abstract sentence to generate a first candidate abstract set; wherein the first set of candidate digests comprises at least one first candidate digest;
and the calculation module is used for calculating the compliance of each first candidate abstract in the first candidate abstract set by using a language model so as to obtain the first candidate abstract with the maximum compliance.
9. An electronic device, comprising:
a memory for storing a program;
a processor for executing the program stored in the memory, the processor being configured to perform the digest generation method according to any one of claims 1 to 7 when the program is executed.
10. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the digest generation method according to any one of claims 1 to 7.
CN201911182819.3A 2019-11-27 2019-11-27 Abstract generation method and device, electronic equipment and storage medium Pending CN112860881A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911182819.3A CN112860881A (en) 2019-11-27 2019-11-27 Abstract generation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911182819.3A CN112860881A (en) 2019-11-27 2019-11-27 Abstract generation method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112860881A true CN112860881A (en) 2021-05-28

Family

ID=75984674

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911182819.3A Pending CN112860881A (en) 2019-11-27 2019-11-27 Abstract generation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112860881A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114490976A (en) * 2021-12-30 2022-05-13 北京百度网讯科技有限公司 Method, device and equipment for generating dialogue abstract training data and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108664598A (en) * 2018-05-09 2018-10-16 北京理工大学 A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage
CN109241536A (en) * 2018-09-21 2019-01-18 浙江大学 It is a kind of based on deep learning from the sentence sort method of attention mechanism

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108664598A (en) * 2018-05-09 2018-10-16 北京理工大学 A kind of extraction-type abstract method based on integral linear programming with comprehensive advantage
CN109241536A (en) * 2018-09-21 2019-01-18 浙江大学 It is a kind of based on deep learning from the sentence sort method of attention mechanism

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114490976A (en) * 2021-12-30 2022-05-13 北京百度网讯科技有限公司 Method, device and equipment for generating dialogue abstract training data and storage medium

Similar Documents

Publication Publication Date Title
CN111177365B (en) Unsupervised automatic abstract extraction method based on graph model
CN108197111B (en) Text automatic summarization method based on fusion semantic clustering
Nicosia et al. QCRI: Answer selection for community question answering-experiments for Arabic and English
Li et al. Markuplm: Pre-training of text and markup language for visually-rich document understanding
CN113553429B (en) Normalized label system construction and text automatic labeling method
Erdmann et al. Improving the extraction of bilingual terminology from Wikipedia
CN111680488A (en) Cross-language entity alignment method based on knowledge graph multi-view information
Tiwari et al. Ensemble approach for twitter sentiment analysis
Lins et al. The CNN-corpus: A large textual corpus for single-document extractive summarization
Qi et al. DuReadervis: A Chinese dataset for open-domain document visual question answering
Agarwal et al. Authorship clustering using tf-idf weighted word-embeddings
JP2007157006A (en) Question-answer device, question-answer method and question-answer program
Lee et al. Alignment of bilingual named entities in parallel corpora using statistical models and multiple knowledge sources
CN112860881A (en) Abstract generation method and device, electronic equipment and storage medium
CN110929022A (en) Text abstract generation method and system
Tahrat et al. Text2geo: from textual data to geospatial information
Liu et al. An Efficient Machine-Generated Data Modeling Approach Based on Domain-Aware Knowledge for Intelligent Consumer Electronics
Huang et al. An effective method for constructing knowledge graph of online course
Yang et al. Exploring word similarity to improve chinese personal name disambiguation
Maylawati et al. Feature-based approach and sequential pattern mining to enhance quality of Indonesian automatic text summarization
Schneider et al. Golden retriever: A real-time multi-modal text-image retrieval system with the ability to focus
Sindhu et al. Plagiarism detection in Malayalam language text using a composition of similarity measures
Chen et al. The Chinese Persons Name Diambiguation Evaluation: Exploration of Personal Name Disambiguation in Chinese News
Song et al. A Two-stage User Intent Detection Model on Complicated Utterances with Multi-task Learning
Mohtaj et al. PerPaDa: A Persian Paraphrase Dataset based on Implicit Crowdsourcing Data Collection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination