CN114611520A

CN114611520A - Text abstract generating method

Info

Publication number: CN114611520A
Application number: CN202210380604.8A
Authority: CN
Inventors: 刘明童; 王泽坤; 周明
Original assignee: Beijing Lanzhou Technology Co ltd
Current assignee: Beijing Lanzhou Technology Co ltd
Priority date: 2022-04-12
Filing date: 2022-04-12
Publication date: 2022-06-10

Abstract

The invention relates to the technical field of text abstract generation, in particular to a text abstract generation method, which specifically comprises the following steps: randomly combining at least two preset articles to generate a tree-shaped balanced binary tree, wherein leaf nodes of the tree represent one article; and fusing every two articles connected in the tree-shaped balanced binary tree layer by layer to generate a target text abstract fused with key information of at least two articles. It can be understood that the target text abstract is generated by fusing the key information of the articles, so that the reader can quickly acquire the key information of the articles, and the reading efficiency of the reader is improved.

Description

Text abstract generating method

Technical Field

The invention relates to the technical field of text abstract generation, in particular to a text abstract generation method.

Background

The research and report in the professional field is an important source for people to obtain high-quality information, such as an industry development report, a security analysis report and the like. Due to the logical and professional nature of the research report, a research report often contains very rich information, and meanwhile, for the same event and the same thing, many professional organizations and experts often perform research reports, which results in that people need to read a large amount of research and report contents to know the analyzed object, for example, financial investors need to read all information related to a certain target company to find out the concerned answers to help make more accurate decisions. In the face of the overload problem of a large amount of information, the technology for improving the research and report reading and information processing in the professional field is indispensable for improving the working efficiency of people.

The traditional intelligent research and report reading system focuses on collecting and sorting information, for example, related research and reports of the same company are aggregated together through a keyword clustering algorithm, so that people can read and search the information conveniently. However, a study often contains tens of thousands of words, and simple aggregation of information has not been able to satisfy the need for people to quickly obtain core information of interest. On the other hand, the technology adopted in the existing intelligent research and report reading is mainly based on an algorithm of N-gram matching, such as algorithms of content retrieval and clustering based on key words, on the basis, people need to understand the research and report content and often need to read the research and report throughout to find answers to concerned core problems, and the process of searching key information needs to be completed by people, and is time-consuming and labor-consuming.

Disclosure of Invention

The invention provides a text abstract generating method, which aims to solve the problem that core concern information of a newspaper cannot be quickly acquired when the newspaper is read.

The invention provides a text abstract generating method, which solves the technical problem and specifically comprises the following steps:

randomly combining at least two preset articles to generate a tree-shaped balanced binary tree, wherein leaf nodes of the tree represent one article;

and fusing every two articles connected in the tree-shaped balanced binary tree layer by layer to generate a target text abstract fused with key information of at least two articles.

Preferably, the tree-shaped balanced binary tree includes nodes in layers 1 to N, where the node in layer N is a leaf node, N is a positive integer, and every two articles connected in the tree-shaped balanced binary tree are fused layer by layer, including:

fusing the N-th layer of node articles into N-1-th layer of node articles, wherein each node at the N-1-th layer represents one N-1-th layer of node articles, and each N-1-th layer of node articles is generated by fusing the N-th layer of node articles connected to the same N-1-th layer of node;

and fusing layer by layer from the N-1 to the 1 st layer to generate a layer 1 node article, wherein the layer 1 node article is generated by fusing layer 2 node articles connected to the layer 1 node, and the layer 1 node article is the target text abstract.

Preferably, the two-two article fusion specifically comprises the following steps:

determining two connected articles through the same upper-layer node;

identifying the demand information in two connected articles by a named entity identification technology, and extracting a key sentence from the demand information, wherein the key sentence is a sentence containing key information in the article;

screening the extracted key sentences based on the similarity between the key sentences, wherein the similarity between the key sentences is determined by the similarity calculated based on the anchor points and the cosine similarity calculated based on the semantics;

and splicing the screened key sentences to obtain a key sentence set after the two articles are fused.

Preferably, the identifying the requirement information in two connected articles by the named entity identifying technology and extracting the key sentence therefrom specifically includes the following steps:

identifying the demand information in the two connected articles by a named entity identification technology, and filling the demand information into a preset problem template so as to generate a plurality of problems;

and (3) extracting answer segments from each article for each question by using a Benson-BERT model, wherein the obtained answer segments are key sentences.

Preferably, the filtering the extracted key sentences based on the similarity between the key sentences specifically includes the following steps:

matching every two elements in the two key sentence sets according to the similarity to form a bipartite graph;

and selecting a key sentence set with the most information content and the least sentences through a greedy algorithm.

Preferably, the similarity between the key sentences is calculated by the formula

；

For cosine similarity based on semantic calculation, vector representation of each key sentence is calculated through a Meng pre-training model

Wherein

The number of the key sentence is defined, x is a positive integer, and then the cosine similarity between the vectors is calculated, wherein the specific formula is

；

The similarity calculated based on the anchor point can be confirmed by a number comparison method or a character comparison method,

a weight coefficient representing the similarity calculated based on the anchor point.

Preferably, the text summary generating method further includes the steps of:

after the fusion of all articles is completed, sorting key sentences in the finally fused articles by adopting a Benson-BERT pre-training model to obtain an initial target text abstract; and generating transition texts between key sentences of the initial target text abstract through a monte-T5 model so as to obtain the target text abstract with the transition texts.

Preferably, the generating of the transition text between the key sentences of the initial target text excerpt through the monte-T5 model so as to obtain the target text excerpt having the transition text specifically includes the following steps:

setting masks between key sentences of the initial target text abstract respectively;

and predicting the content of the mask through a monte-T5 model to obtain a generated transition text so as to obtain a target text abstract with the transition text, wherein the transition text is used for perfecting the logic relationship between adjacent key sentences in the initial target text abstract.

Preferably, the text summary generating method further includes the following steps:

and generating a summary and/or a summary text for the target text abstract with the transition text, wherein the summary text is a paragraph or a sentence, and the final target text abstract is obtained.

Preferably, the generating a summary and/or a summary text for the target text abstract with the transition text, where the summary text is a paragraph or a sentence, so as to obtain a final target text abstract specifically includes the following steps:

generating a topic vocabulary for each key sentence in sequence through a Meng-T5 model and marking the topic vocabulary;

designing a prompt question-answer template for inquiring and generating main lecture contents related to each topic word in a target text abstract with a transition text, setting answer answers corresponding to questions as masks, and predicting the contents of the masks by using a Mengzi-T5 model to obtain an overview and/or summary text;

and combining the summary and/or summary text and the target text abstract with the transition text to obtain a final target text abstract.

Compared with the prior art, the text abstract generating method has the following advantages:

1. the text abstract generating method specifically comprises the following steps: randomly combining at least two preset articles to generate a tree-shaped balanced binary tree, wherein leaf nodes of the tree represent one article; and fusing every two articles connected in the tree-shaped balanced binary tree layer by layer to generate a target text abstract fused with key information of at least two articles. It can be understood that the target text abstract is generated by fusing the key information of the articles, so that the reader can quickly acquire the key information of the articles, and the reading efficiency of the reader is improved.

2. In the invention, the tree-shaped balanced binary tree comprises nodes of the 1 st to the Nth layers, the nodes of the Nth layer are leaf nodes, N is a positive integer, and every two articles connected in the tree-shaped balanced binary tree are fused layer by layer, which comprises the following steps: fusing the N-th layer of node articles into N-1-th layer of node articles, wherein each node at the N-1-th layer represents one N-1-th layer of node articles, and each N-1-th layer of node articles is generated by fusing the N-th layer of node articles connected to the same N-1-th layer of node; and fusing layer by layer from the N-1 to the 1 st layer to generate a layer 1 node article, wherein the layer 1 node article is generated by fusing layer 2 node articles connected to the layer 1 node, and the layer 1 node article is the target text abstract. The abstract main body obtained by the method obtains the final key set by fusing two articles connected in the tree-shaped balanced binary tree layer by layer, so that the final key set can be ensured to contain all key information, some less important information can be removed in layer-by-layer fusion, the information redundancy is reduced under the condition of ensuring comprehensive information coverage, and the reading experience is improved.

3. In the invention, the article is fused pairwise, and the method specifically comprises the following steps: determining two connected articles through the same upper-layer node; identifying the demand information in two connected articles by a named entity identification technology, and extracting a key sentence from the demand information, wherein the key sentence is a sentence containing key information in the article; screening the extracted key sentences based on the similarity between the key sentences, wherein the similarity between the key sentences is determined by the similarity calculated based on the anchor points and the cosine similarity calculated based on the semantics; and splicing the screened key sentences to obtain a key sentence set after the two articles are fused. It can be understood that the key sentences are screened and spliced based on the similarity between the key sentences, which is beneficial to ensuring that the obtained key sentence set can contain all important information.

4. In the invention, the requirement information in two connected articles is identified by a named entity identification technology and is filled in a preset problem template, thereby generating a plurality of problems; and (3) extracting answer segments from each article for each question by using a Benson-BERT model, wherein the obtained answer segments are key sentences. It can be understood that the generated questions can be controlled through the preset question template, which is beneficial for the user to introduce the preference of the user on the important concern subjects or entities; on the other hand, by using the extraction type question-answering model, a more accurate key sentence set segment in the paragraph is convenient to find.

5. In the invention, pairwise elements in two key sentence sets are matched according to the similarity to form a bipartite graph; and selecting a key sentence set with most information content and least sentence number through a greedy algorithm as a key sentence set after the multiple articles are fused. It can be understood that the more compact the key sentence set after the plurality of articles are fused, the better, the comprehensiveness of the content is ensured under the condition of reducing the reading amount, and the method is beneficial to helping the user to quickly acquire information.

6. In the invention, the similarity between the key sentences is comprehensively determined by the similarity calculated based on the anchor points and the cosine similarity calculated based on the semantics. The design is beneficial to enhancing the reliability of the similarity comparison result between the key sentences so as to enhance the readability of the sequenced texts and improve the reading experience of the user.

7. In the invention, transition texts are generated among key sentences of the initial target text abstract through a Mengzi-T5 model, so that the target text abstract with the transition texts is obtained. It can be understood that, through the previous steps, the initial target text abstract already contains basically all the information necessary for an article abstract, but the information is only very directly spliced, so that the consistency and readability are poor when reading is performed; and transition texts are generated among the key sentences of the initial target text abstract, so that the readability of the target text abstract can be further improved, and the reading experience of a user can be further improved.

8. In the invention, the mode of generating the transition text is to respectively set masks between key sentences of the initial target text abstract; and predicting the content of the mask through a monte-T5 model to obtain a generated transition text so as to obtain a target text abstract with the transition text, wherein the transition text is used for perfecting the logic relationship between adjacent key sentences in the initial target text abstract. Through the previous algorithm steps, the initial target text abstract basically contains all necessary information of a research and report abstract, so that the transition text is often a relatively short logical phrase, a connecting word or a subtitle and the like, and does not contain excessive useful information text. Then, the generation problem of the transition text can be simplified into the generation problem of logic phrases, connection words or subtitles between sentences. The Mengzi-T5 model uses mass data of 300GB for pre-training and stores rich prior knowledge, so that a higher generation effect can be achieved under the setting of less sample fine adjustment.

9. The text abstract generating method of the invention also comprises the following steps: and generating a summary and/or a summary text for the target text abstract with the transition text, wherein the summary text is a paragraph or a sentence, and combining the summary and/or the summary text and the target text abstract with the transition text to obtain a final target text abstract. As can be appreciated, generating the summary and/or the summary text can make the final target text abstract more convenient to read, assist the user to obtain information more quickly, and further improve the reading experience of the user.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a flowchart of a text summary generating method according to a first embodiment of the present invention.

Fig. 2 is a flowchart of step S2 of the text summary generating method according to the first embodiment of the present invention.

Fig. 3 is a flowchart of step S22 of the text summary generating method according to the first embodiment of the present invention.

Fig. 4 is a flowchart of step S221 of the text summary generating method according to the first embodiment of the present invention.

Fig. 5 is a flowchart of step S23 of the text summary generating method according to the first embodiment of the present invention.

Fig. 6 is a schematic diagram of coherence ordering in a text summary generation method according to a first embodiment of the present invention.

Fig. 7 is another flowchart of a text summary generating method according to the first embodiment of the present invention.

Fig. 8 is a flowchart of step S3 of the text summary generating method according to the first embodiment of the present invention.

Fig. 9 is a flowchart of step S4 of the text summary generating method according to the first embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

The terms "vertical," "horizontal," "left," "right," "up," "down," "left up," "right up," "left down," "right down," and the like as used herein are for illustrative purposes only.

Referring to fig. 1, a first embodiment of the present invention provides a text summary generating method, which specifically includes the following steps:

step S1: randomly combining at least two preset articles to generate a tree-shaped balanced binary tree, wherein leaf nodes of the tree represent one article;

step S2: and fusing every two articles connected in the tree-shaped balanced binary tree layer by layer to generate a target text abstract fused with key information of at least two articles.

It can be understood that the target text abstract is generated by fusing the key information of the articles, so that the reader can quickly acquire the key information of the articles, and the reading efficiency of the reader is improved

Specifically, in this embodiment, the article is a newspaper.

Further, the tree-shaped balanced binary tree includes nodes in layers 1 to N, where the node in layer N is a leaf node and N is a positive integer, and every two articles connected in the tree-shaped balanced binary tree are fused layer by layer, which specifically includes the following steps:

fusing layer by layer from the N-1 to the 1 st layer to generate a layer 1 node article, wherein the layer 1 node article is generated by fusing a layer 2 node article connected to the layer 1 node, the layer 1 node article is the target text abstract, specifically, fusing the layer N-1 node article into a layer N-2 node article, each node of the layer N-2 represents a layer N-2 node article, and each layer N-2 node article is generated by fusing the layer N-1 node article connected to the same layer N-2 node; .., merging the layer 3 node articles into a layer 2 node article, wherein each node of the layer 2 represents a layer 2 node article, each layer 2 node article is generated by merging the layer 3 node articles connected to the same layer 2 node, and merging the layer 2 node articles into a layer 1 node article, wherein the layer 1 node article is the target text abstract.

The abstract main body obtained by the method obtains the final key set by fusing two articles connected in the tree-shaped balanced binary tree layer by layer, so that the final key set can be ensured to contain all key information, some less important information can be removed in layer-by-layer fusion, the information redundancy is reduced under the condition of ensuring comprehensive information coverage, and the reading experience is improved.

Please refer to fig. 1 and fig. 2, the step 2 of fusing the articles pairwise specifically includes the following steps:

step S21: determining two connected articles through the same upper-layer node;

step S22: identifying the demand information in two connected articles by a named entity identification technology, and extracting a key sentence from the demand information, wherein the key sentence is a sentence containing key information in the article;

step S23: screening the extracted key sentences based on the similarity between the key sentences, wherein the similarity between the key sentences is determined by the similarity calculated based on the anchor points and the cosine similarity calculated based on the semantics;

step S24: and splicing the screened key sentences to obtain a key sentence set after the two articles are fused.

It can be understood that the key sentences are screened and spliced based on the similarity between the key sentences, which is beneficial to ensuring that the obtained key sentence set can contain all important information.

Referring to fig. 1 to 3, step S22 specifically includes the following steps:

step S221: identifying the demand information in the two connected articles by a named entity identification technology, and filling the demand information into a preset problem template so as to generate a plurality of problems;

step S222: and (3) extracting an answer segment from each article for each question by using a Benson-BERT model, wherein the obtained answer segment is the key sentence.

It can be understood that the generated problems can be controlled through the preset problem template, which is beneficial for the user to introduce the preference of the user on the important concern subjects or entities; on the other hand, by using the extraction type question-answering model, a more accurate key sentence set segment in the paragraph is convenient to find.

It should be noted that the named entity identification technology is the prior art, and for details, see: chen, W., Feng, Y, Qin, L, & Liu, T. (2021). N-LTP An Open-source Neural Language Technology Platform for Chinese.

Referring to fig. 1 to 4, step S221 specifically includes the following steps:

step S2211: identifying the demand information in each article through a named entity identification technology, and filling the demand information into a preset problem template so as to generate a plurality of problems; that is, the requirement information is determined according to the vacancy in the question template;

step S2212: and (3) extracting an answer fragment from each key sentence for each question by using a monte-BERT model, wherein the obtained answer fragment is the key sentence.

It can be understood that the generated questions can be controlled through the preset question template, which is beneficial for the user to introduce the preference of the user on the important concern subjects or entities; on the other hand, by using the extraction type question-answering model, a more accurate key sentence set segment in the paragraph is convenient to find.

Further, a predetermined question template may be prepared with a plurality of questions, for example, "[ what is the business model of a certain company ], and then entity names of several companies in the paragraph are identified by the named entity identification technology and filled in the question template, thereby generating a question.

Further, when the answer segment is extracted, a plurality of questions generated by each research and report are recorded as Q1 and Q2.. QY in sequence, wherein Y is a positive integer; for example, the two questions generated by the a-study are denoted as a.q1 and a.q2, and the corresponding answers are denoted as a1 and a2, so as to facilitate subsequent matching.

Referring to fig. 2 and 5, step S23 specifically includes the following steps:

step S231: matching every two elements in the two key sentence sets according to the similarity to form a bipartite graph;

step S232: and selecting a key sentence set with most information content and least sentence number through a greedy algorithm as a key sentence set after the multiple articles are fused.

It can be understood that the more compact the key sentence set after the multiple articles are fused, the better, the comprehensiveness of the content is ensured under the condition of reducing the reading amount, and the method is favorable for helping the user to quickly obtain information.

It should be noted that the matching algorithm of the bipartite graph refers to the prior art, and please refer to royal beauty, zhouyanqing, yanasia star for details. CN106547739B, 2019-04-02. The present invention uses key sentences rather than topics as nodes. Namely: and connecting the two sentences if the similarity of the two key sentences is greater than a set threshold value. After iteration, a bipartite graph is finally generated. The nodes of this bipartite graph represent the sentence numbers in the study and the edges represent the similarity of the two sentences. The method comprises the steps of firstly calculating the out degree of each node, namely the number of edges taking the node as an end point, then taking the out degree as the representation of the information content of corresponding sentences, and selecting a key sentence set with the most information content and the least sentence number through a greedy algorithm as a key sentence set after a plurality of research and report are fused.

Further, the similarity between key sentences is calculated according to the formula

；

Wherein

The number of the key sentence is defined, x is a positive integer, and the cosine similarity between the vectors is calculated by the specific formula

；

It can be understood that the design is beneficial to enhancing the reliability of the similarity comparison result between the key sentences so as to enhance the readability of the sequenced texts and improve the reading experience of the user.

Further, in the part of calculating the similarity based on the anchor point, only when the number is an important component of the article content, the number comparison mode can be used for confirming, and it is considered that if the two sentences contain the same number, the two sentences are similar sentences.

Further, in the part of calculating the similarity based on the anchor point, for the character comparison mode, the similarity may be calculated by using the edit distance of two sentences, the longest common subsequence length, or the N-Gram similarity.

Further, in step S2, the arrangement of the key sentences is completed by a monte-BERT pre-training model; the monte-BERT pre-training model can calculate the probability of the key sentences arranged at the ith position, and the key sentences are sorted based on the calculation result, wherein i is a positive integer.

The Benzry-BERT pre-training model adopts a pre-training task of sentence order prediction, so that the method can be well adapted to downstream tasks of inter-sentence consistency sequencing.

Referring to FIG. 6, an example of a key sentence arrangement is given for three sentences

Randomly disordered and arranged as

The input samples to construct a monte encoder are shown in the figure, where "[ CLS]"means inputSample initiator "

"represents the beginning of a sentence," and "represents the textual content of the corresponding sentence," [ SEP]"denotes a separator (which may be considered here as an end of input sample). Then using a Monte-Sum encoder to encode the image to obtain

Respectively are

(ii) a Is provided with

As the key vector K and the value vector V of the monte decoder, it will be

As a query vector Q of a MengziDecoder (MengziDecoder), decoded

(ii) a Finally, the data can be obtained through a pointer network

And the probability that the j-1 st sentence is arranged at the ith position is shown. Thus, the consistency ordering of the three sentences is completed; it will be appreciated that the ordering of more than three sentences works the same way.

Details of how to use the Monte-BERT pre-training model to accomplish the consistency ordering are described in Lee, H., Hudson, D.A., Lee, K., & Manning, C.D. (2020). SLM: Learning a Discourse Language reproduction with Senterce Unshuffling.

Referring to fig. 7, the summary generation method further includes the following steps:

step S3: after the fusion of all articles is completed, sorting key sentences in the finally fused articles by adopting a Benson-BERT pre-training model to obtain an initial target text abstract; and generating transition texts between key sentences of the initial target text abstract through a Mengzi-T5 model so as to obtain the target text abstract with the transition texts.

It can be understood that, through the previous steps, the initial target text abstract already contains basically all the information necessary for an article abstract, but the information is only very directly spliced, so that the consistency and readability are poor when reading is performed; and transition texts are generated among the key sentences of the initial target text abstract, so that the readability of the target text abstract can be further improved, and the reading experience of a user can be further improved.

Referring to fig. 7 and 8, step S3 specifically includes the following steps:

step S31: respectively setting masks among key sentences of the initial target text abstract;

step S32: and predicting the content of the mask through a monte-T5 model to obtain a generated transition text so as to obtain a target text abstract with the transition text, wherein the transition text is used for perfecting the logic relationship between adjacent key sentences in the initial target text abstract.

It will be appreciated that through the preceding algorithmic steps, the initial target text summary already contains substantially all of the information necessary for a research summary, and thus, the transition text is often a relatively short logical phrase, conjunctions, or subtitles, etc., and does not contain too much useful information text. Then, the generation problem of the transition text can be simplified into the generation problem of logic phrases, connection words or subtitles between sentences. The Bengzi-T5 model uses mass data of 300GB to pre-train and stores rich prior knowledge, so that a higher generation effect can be achieved under the setting of less sample fine tuning.

For example, three ordered key sentences are input and are respectively recorded as:

、

and

the input template is set as:

finally, by means of a Mengzi-T5 model pair<Mask 1>、<Mask 2>The contents of the text are predicted to obtain a generated transition text, and \ s represents an input end symbol. Recording the target text abstract with the transition text as follows: sum.

Further, the monte-T5 model needs to be refined with a refined dataset before use, and the refined dataset is constructed based on part-of-speech tagging and/or punctuation identification and/or subtitles.

It can be understood that the generation problem of the transition text can be simplified into the generation problem of logic phrases, conjunctions or subtitles among the sentences, and the targeted construction of the fine tuning data set can effectively ensure the generation effect of the monte-T5 model.

Specifically, the application scenario for constructing the refined data set based on the part-of-speech tagging is as follows: the method includes the steps that a plurality of words with parts of speech such as prepositions or conjunctions exist in an article generally, the prepositions, the conjunctions and the like are replaced by masks, when a model is adjusted accurately, the replaced words are used as labels, and texts at the positions of the masks are generated to be training tasks.

The application scene plot for constructing the fine adjustment data set based on punctuation mark recognition is as follows: generally, it is common to report: "investment advice: … … ", colon and the like, followed by a summary of the contents. Thus, the text before the colon is replaced with a mask symbol and the fine tuning task is also generated for the text at the mask location.

The application scene plot for constructing the fine adjustment data set based on the subtitles is as follows: there are often a large number of subtitles in a study, in "subtitle text: paragraph text reconstructs data by the same method as the punctuation mark identification-based fine adjustment data set.

With continuing reference to fig. 7, the summary generation method herein further includes the following steps:

step S4: and generating a summary and/or a summary text for the target text abstract with the transition text, wherein the summary text is a paragraph or a sentence, and the final target text abstract is obtained.

As can be appreciated, generating the summary and/or summary text can assist the user in obtaining information more quickly, further improving the reading experience of the user.

Referring to fig. 7 and 9, step S4 specifically includes the following steps:

step S41: sequentially generating a topic vocabulary for each key sentence through a Mengzi-T5 model and marking the topic vocabulary;

step S42: designing a prompt question-answer template for inquiring and generating main lecture contents related to each topic word in a target text abstract with a transition text, setting answer answers corresponding to questions as masks, and predicting the contents of the masks by using a Mengzi-T5 model to obtain an overview and/or summary text;

step S43: and combining the summary and/or summary text and the target text abstract with the transition text to obtain a final target text abstract.

As can be appreciated, generating the summary and/or the summary text can make the final target text abstract more convenient to read, assist the user to obtain information more quickly, and further improve the reading experience of the user.

Specifically, taking the summary as an example, three key sentences are input, which are respectively recorded as:

、

and

firstly, a topic vocabulary is sequentially generated for each key sentence through a Bengzi-T5 model and is respectively recorded as:

、

and

then, designing a prompt template as follows: "sum. ask: this section relates to,

And

mainly saying what "," answer:<mask code>", finally using the monte-T5 model pair<Mask code>The summary text can be obtained by predicting the content of (1). This step requires the fine tuning of the model using a small amount of data, and data set source references include DOU Z-Y, LIU P, HAYASHI H, et al, 2021, GSum: A General Framework for Guided reactive simulation abs/2010.08014, and HEJ, KRYSCINSKI W, MCCANN B, et al, 2020: Towards General Controllable Text simulation abs/2012.04281, and the like.

Specifically, the topic generation data set construction method is as follows: firstly, for a certain paragraph in a research report and a corresponding subtitle thereof, extracting a maximum public subsequence as a candidate topic word from the paragraph, and constructing a large amount of training data by taking the paragraph as a key sentence. The design prompt template is as follows: "key sentence: paragraph text; topic words: < mask > ", the corresponding topic word can be generated at the" < mask > "position by finely tuning the monton-T5 model using the training data.

The data set construction method generated by the answers is as follows: the data set of the part is basically consistent with the data set generated by the topic, but considering that a paragraph in the data set generated by the topic can only correspond to one sub-title and one topic word, a plurality of paragraphs, corresponding sub-titles and corresponding key words need to be combined, and then the combined sub-titles are replaced by mask marks to construct a large amount of pseudo data. Finally, using the pseudo-data fine tune the mengzo-T5 model, the corresponding summary or summarizing text can be generated at the "< mask >" location.

Generally, the summary and summary content are not very different, and the summary and/or summary may be optionally generated.

Specifically, in this embodiment, the selection is only to generate an overview.

5. In the invention, pairwise elements in two key sentence sets are matched according to the similarity to form a bipartite graph; and selecting a key sentence set with most information content and least sentence number through a greedy algorithm as a key sentence set after the multiple articles are fused. It can be understood that the more compact the key sentence set after the multiple articles are fused, the better, the comprehensiveness of the content is ensured under the condition of reducing the reading amount, and the method is favorable for helping the user to quickly obtain information.

7. In the invention, transition texts are generated among key sentences of the initial target text abstract through a Mengzi-T5 model, so that the target text abstract with the transition texts is obtained. It can be understood that, through the previous steps, the initial target text abstract already contains basically all the information necessary for an article abstract, but the information is only spliced very directly, so that the continuity and readability are poor when reading is performed; and transition texts are generated among the key sentences of the initial target text abstract, so that the readability of the target text abstract can be further improved, and the reading experience of a user can be further improved.

8. In the invention, the transition text is generated in a mode that masks are respectively arranged between key sentences of an initial target text abstract; and predicting the content of the mask through a monte-T5 model to obtain a generated transition text so as to obtain a target text abstract with the transition text, wherein the transition text is used for perfecting the logic relationship between adjacent key sentences in the initial target text abstract. Through the previous algorithm steps, the initial target text abstract basically contains all necessary information of a research and report abstract, so that the transition text is often a short logical phrase, a connective word or a subtitle and does not contain excessive useful information text. Then, the generation problem of the transition text can be simplified into the generation problem of logic phrases, connection words or subtitles between sentences. The Bengzi-T5 model uses mass data of 300GB to pre-train and stores rich prior knowledge, so that a higher generation effect can be achieved under the setting of less sample fine tuning.

9. The text abstract generating method further comprises the following steps: and generating a summary and/or a summary text for the target text abstract with the transition text, wherein the summary text is a paragraph or a sentence, and combining the summary and/or the summary text and the target text abstract with the transition text to obtain a final target text abstract. As can be appreciated, generating the summary and/or the summary text can make the final target text abstract more convenient to read, assist the user to acquire information more quickly, and further improve the reading experience of the user.

The text abstract generating method disclosed by the embodiment of the invention is described in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for the persons skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present description should not be construed as a limitation to the present invention, and any modification, equivalent replacement, and improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A text abstract generating method is characterized by comprising the following steps:

and fusing every two connected articles in the tree-shaped balanced binary tree layer by layer to generate a target text abstract fused with key information of at least two articles.

2. The method for generating a text summary according to claim 1, wherein the tree-shaped balanced binary tree includes nodes in layers 1 to N, the node in layer N is a leaf node, N is a positive integer, and the step of fusing every two connected articles in the tree-shaped balanced binary tree includes:

3. The text abstract generating method of claim 1, wherein the step of fusing every two articles connected in the tree-balanced binary tree layer by layer specifically comprises the following steps:

determining two connected articles through the same upper-layer node;

4. The text abstract generating method as claimed in claim 3, wherein the step of identifying the requirement information in two connected articles by the named entity identification technology and extracting the key sentence therefrom comprises the following steps:

5. The method for generating a text summary according to claim 3, wherein the step of screening the extracted key sentences based on the similarity between the key sentences specifically comprises the steps of:

6. The text summary generation method of claim 3, wherein:

the similarity between key sentences is calculated according to the formula

；

Wherein

；

7. The text summary generation method of claim 1, further comprising the steps of:

8. The method for generating a text abstract according to claim 7, wherein the step of generating a transition text between key sentences of an initial target text abstract through a monte-T5 model to obtain a target text abstract having a transition text comprises the following steps:

respectively setting masks among key sentences of the initial target text abstract;

9. The text summary generation method of claim 7, further comprising the steps of:

10. The method for generating a text abstract according to claim 9, wherein the generating of the summary and/or the summary text for the target text abstract with the transition text, the summary text being a paragraph or a sentence, thereby obtaining the final target text abstract, comprises the following steps:

sequentially generating a topic vocabulary for each key sentence through a Mengzi-T5 model and marking the topic vocabulary;