CN114611520A - Text abstract generating method - Google Patents

Text abstract generating method Download PDF

Info

Publication number
CN114611520A
CN114611520A CN202210380604.8A CN202210380604A CN114611520A CN 114611520 A CN114611520 A CN 114611520A CN 202210380604 A CN202210380604 A CN 202210380604A CN 114611520 A CN114611520 A CN 114611520A
Authority
CN
China
Prior art keywords
layer
text
articles
key
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210380604.8A
Other languages
Chinese (zh)
Inventor
刘明童
王泽坤
周明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Lanzhou Technology Co ltd
Original Assignee
Beijing Lanzhou Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Lanzhou Technology Co ltd filed Critical Beijing Lanzhou Technology Co ltd
Priority to CN202210380604.8A priority Critical patent/CN114611520A/en
Publication of CN114611520A publication Critical patent/CN114611520A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of text abstract generation, in particular to a text abstract generation method, which specifically comprises the following steps: randomly combining at least two preset articles to generate a tree-shaped balanced binary tree, wherein leaf nodes of the tree represent one article; and fusing every two articles connected in the tree-shaped balanced binary tree layer by layer to generate a target text abstract fused with key information of at least two articles. It can be understood that the target text abstract is generated by fusing the key information of the articles, so that the reader can quickly acquire the key information of the articles, and the reading efficiency of the reader is improved.

Description

Text abstract generating method
Technical Field
The invention relates to the technical field of text abstract generation, in particular to a text abstract generation method.
Background
The research and report in the professional field is an important source for people to obtain high-quality information, such as an industry development report, a security analysis report and the like. Due to the logical and professional nature of the research report, a research report often contains very rich information, and meanwhile, for the same event and the same thing, many professional organizations and experts often perform research reports, which results in that people need to read a large amount of research and report contents to know the analyzed object, for example, financial investors need to read all information related to a certain target company to find out the concerned answers to help make more accurate decisions. In the face of the overload problem of a large amount of information, the technology for improving the research and report reading and information processing in the professional field is indispensable for improving the working efficiency of people.
The traditional intelligent research and report reading system focuses on collecting and sorting information, for example, related research and reports of the same company are aggregated together through a keyword clustering algorithm, so that people can read and search the information conveniently. However, a study often contains tens of thousands of words, and simple aggregation of information has not been able to satisfy the need for people to quickly obtain core information of interest. On the other hand, the technology adopted in the existing intelligent research and report reading is mainly based on an algorithm of N-gram matching, such as algorithms of content retrieval and clustering based on key words, on the basis, people need to understand the research and report content and often need to read the research and report throughout to find answers to concerned core problems, and the process of searching key information needs to be completed by people, and is time-consuming and labor-consuming.
Disclosure of Invention
The invention provides a text abstract generating method, which aims to solve the problem that core concern information of a newspaper cannot be quickly acquired when the newspaper is read.
The invention provides a text abstract generating method, which solves the technical problem and specifically comprises the following steps:
randomly combining at least two preset articles to generate a tree-shaped balanced binary tree, wherein leaf nodes of the tree represent one article;
and fusing every two articles connected in the tree-shaped balanced binary tree layer by layer to generate a target text abstract fused with key information of at least two articles.
Preferably, the tree-shaped balanced binary tree includes nodes in layers 1 to N, where the node in layer N is a leaf node, N is a positive integer, and every two articles connected in the tree-shaped balanced binary tree are fused layer by layer, including:
fusing the N-th layer of node articles into N-1-th layer of node articles, wherein each node at the N-1-th layer represents one N-1-th layer of node articles, and each N-1-th layer of node articles is generated by fusing the N-th layer of node articles connected to the same N-1-th layer of node;
and fusing layer by layer from the N-1 to the 1 st layer to generate a layer 1 node article, wherein the layer 1 node article is generated by fusing layer 2 node articles connected to the layer 1 node, and the layer 1 node article is the target text abstract.
Preferably, the two-two article fusion specifically comprises the following steps:
determining two connected articles through the same upper-layer node;
identifying the demand information in two connected articles by a named entity identification technology, and extracting a key sentence from the demand information, wherein the key sentence is a sentence containing key information in the article;
screening the extracted key sentences based on the similarity between the key sentences, wherein the similarity between the key sentences is determined by the similarity calculated based on the anchor points and the cosine similarity calculated based on the semantics;
and splicing the screened key sentences to obtain a key sentence set after the two articles are fused.
Preferably, the identifying the requirement information in two connected articles by the named entity identifying technology and extracting the key sentence therefrom specifically includes the following steps:
identifying the demand information in the two connected articles by a named entity identification technology, and filling the demand information into a preset problem template so as to generate a plurality of problems;
and (3) extracting answer segments from each article for each question by using a Benson-BERT model, wherein the obtained answer segments are key sentences.
Preferably, the filtering the extracted key sentences based on the similarity between the key sentences specifically includes the following steps:
matching every two elements in the two key sentence sets according to the similarity to form a bipartite graph;
and selecting a key sentence set with the most information content and the least sentences through a greedy algorithm.
Preferably, the similarity between the key sentences is calculated by the formula
Figure 884225DEST_PATH_IMAGE001
Figure 500015DEST_PATH_IMAGE002
For cosine similarity based on semantic calculation, vector representation of each key sentence is calculated through a Meng pre-training model
Figure 137800DEST_PATH_IMAGE003
Wherein
Figure 522645DEST_PATH_IMAGE004
The number of the key sentence is defined, x is a positive integer, and then the cosine similarity between the vectors is calculated, wherein the specific formula is
Figure 977898DEST_PATH_IMAGE005
Figure 205617DEST_PATH_IMAGE006
The similarity calculated based on the anchor point can be confirmed by a number comparison method or a character comparison method,
Figure 37306DEST_PATH_IMAGE007
a weight coefficient representing the similarity calculated based on the anchor point.
Preferably, the text summary generating method further includes the steps of:
after the fusion of all articles is completed, sorting key sentences in the finally fused articles by adopting a Benson-BERT pre-training model to obtain an initial target text abstract; and generating transition texts between key sentences of the initial target text abstract through a monte-T5 model so as to obtain the target text abstract with the transition texts.
Preferably, the generating of the transition text between the key sentences of the initial target text excerpt through the monte-T5 model so as to obtain the target text excerpt having the transition text specifically includes the following steps:
setting masks between key sentences of the initial target text abstract respectively;
and predicting the content of the mask through a monte-T5 model to obtain a generated transition text so as to obtain a target text abstract with the transition text, wherein the transition text is used for perfecting the logic relationship between adjacent key sentences in the initial target text abstract.
Preferably, the text summary generating method further includes the following steps:
and generating a summary and/or a summary text for the target text abstract with the transition text, wherein the summary text is a paragraph or a sentence, and the final target text abstract is obtained.
Preferably, the generating a summary and/or a summary text for the target text abstract with the transition text, where the summary text is a paragraph or a sentence, so as to obtain a final target text abstract specifically includes the following steps:
generating a topic vocabulary for each key sentence in sequence through a Meng-T5 model and marking the topic vocabulary;
designing a prompt question-answer template for inquiring and generating main lecture contents related to each topic word in a target text abstract with a transition text, setting answer answers corresponding to questions as masks, and predicting the contents of the masks by using a Mengzi-T5 model to obtain an overview and/or summary text;
and combining the summary and/or summary text and the target text abstract with the transition text to obtain a final target text abstract.
Compared with the prior art, the text abstract generating method has the following advantages:
1. the text abstract generating method specifically comprises the following steps: randomly combining at least two preset articles to generate a tree-shaped balanced binary tree, wherein leaf nodes of the tree represent one article; and fusing every two articles connected in the tree-shaped balanced binary tree layer by layer to generate a target text abstract fused with key information of at least two articles. It can be understood that the target text abstract is generated by fusing the key information of the articles, so that the reader can quickly acquire the key information of the articles, and the reading efficiency of the reader is improved.
2. In the invention, the tree-shaped balanced binary tree comprises nodes of the 1 st to the Nth layers, the nodes of the Nth layer are leaf nodes, N is a positive integer, and every two articles connected in the tree-shaped balanced binary tree are fused layer by layer, which comprises the following steps: fusing the N-th layer of node articles into N-1-th layer of node articles, wherein each node at the N-1-th layer represents one N-1-th layer of node articles, and each N-1-th layer of node articles is generated by fusing the N-th layer of node articles connected to the same N-1-th layer of node; and fusing layer by layer from the N-1 to the 1 st layer to generate a layer 1 node article, wherein the layer 1 node article is generated by fusing layer 2 node articles connected to the layer 1 node, and the layer 1 node article is the target text abstract. The abstract main body obtained by the method obtains the final key set by fusing two articles connected in the tree-shaped balanced binary tree layer by layer, so that the final key set can be ensured to contain all key information, some less important information can be removed in layer-by-layer fusion, the information redundancy is reduced under the condition of ensuring comprehensive information coverage, and the reading experience is improved.
3. In the invention, the article is fused pairwise, and the method specifically comprises the following steps: determining two connected articles through the same upper-layer node; identifying the demand information in two connected articles by a named entity identification technology, and extracting a key sentence from the demand information, wherein the key sentence is a sentence containing key information in the article; screening the extracted key sentences based on the similarity between the key sentences, wherein the similarity between the key sentences is determined by the similarity calculated based on the anchor points and the cosine similarity calculated based on the semantics; and splicing the screened key sentences to obtain a key sentence set after the two articles are fused. It can be understood that the key sentences are screened and spliced based on the similarity between the key sentences, which is beneficial to ensuring that the obtained key sentence set can contain all important information.
4. In the invention, the requirement information in two connected articles is identified by a named entity identification technology and is filled in a preset problem template, thereby generating a plurality of problems; and (3) extracting answer segments from each article for each question by using a Benson-BERT model, wherein the obtained answer segments are key sentences. It can be understood that the generated questions can be controlled through the preset question template, which is beneficial for the user to introduce the preference of the user on the important concern subjects or entities; on the other hand, by using the extraction type question-answering model, a more accurate key sentence set segment in the paragraph is convenient to find.
5. In the invention, pairwise elements in two key sentence sets are matched according to the similarity to form a bipartite graph; and selecting a key sentence set with most information content and least sentence number through a greedy algorithm as a key sentence set after the multiple articles are fused. It can be understood that the more compact the key sentence set after the plurality of articles are fused, the better, the comprehensiveness of the content is ensured under the condition of reducing the reading amount, and the method is beneficial to helping the user to quickly acquire information.
6. In the invention, the similarity between the key sentences is comprehensively determined by the similarity calculated based on the anchor points and the cosine similarity calculated based on the semantics. The design is beneficial to enhancing the reliability of the similarity comparison result between the key sentences so as to enhance the readability of the sequenced texts and improve the reading experience of the user.
7. In the invention, transition texts are generated among key sentences of the initial target text abstract through a Mengzi-T5 model, so that the target text abstract with the transition texts is obtained. It can be understood that, through the previous steps, the initial target text abstract already contains basically all the information necessary for an article abstract, but the information is only very directly spliced, so that the consistency and readability are poor when reading is performed; and transition texts are generated among the key sentences of the initial target text abstract, so that the readability of the target text abstract can be further improved, and the reading experience of a user can be further improved.
8. In the invention, the mode of generating the transition text is to respectively set masks between key sentences of the initial target text abstract; and predicting the content of the mask through a monte-T5 model to obtain a generated transition text so as to obtain a target text abstract with the transition text, wherein the transition text is used for perfecting the logic relationship between adjacent key sentences in the initial target text abstract. Through the previous algorithm steps, the initial target text abstract basically contains all necessary information of a research and report abstract, so that the transition text is often a relatively short logical phrase, a connecting word or a subtitle and the like, and does not contain excessive useful information text. Then, the generation problem of the transition text can be simplified into the generation problem of logic phrases, connection words or subtitles between sentences. The Mengzi-T5 model uses mass data of 300GB for pre-training and stores rich prior knowledge, so that a higher generation effect can be achieved under the setting of less sample fine adjustment.
9. The text abstract generating method of the invention also comprises the following steps: and generating a summary and/or a summary text for the target text abstract with the transition text, wherein the summary text is a paragraph or a sentence, and combining the summary and/or the summary text and the target text abstract with the transition text to obtain a final target text abstract. As can be appreciated, generating the summary and/or the summary text can make the final target text abstract more convenient to read, assist the user to obtain information more quickly, and further improve the reading experience of the user.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a flowchart of a text summary generating method according to a first embodiment of the present invention.
Fig. 2 is a flowchart of step S2 of the text summary generating method according to the first embodiment of the present invention.
Fig. 3 is a flowchart of step S22 of the text summary generating method according to the first embodiment of the present invention.
Fig. 4 is a flowchart of step S221 of the text summary generating method according to the first embodiment of the present invention.
Fig. 5 is a flowchart of step S23 of the text summary generating method according to the first embodiment of the present invention.
Fig. 6 is a schematic diagram of coherence ordering in a text summary generation method according to a first embodiment of the present invention.
Fig. 7 is another flowchart of a text summary generating method according to the first embodiment of the present invention.
Fig. 8 is a flowchart of step S3 of the text summary generating method according to the first embodiment of the present invention.
Fig. 9 is a flowchart of step S4 of the text summary generating method according to the first embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.
The terms "vertical," "horizontal," "left," "right," "up," "down," "left up," "right up," "left down," "right down," and the like as used herein are for illustrative purposes only.
Referring to fig. 1, a first embodiment of the present invention provides a text summary generating method, which specifically includes the following steps:
step S1: randomly combining at least two preset articles to generate a tree-shaped balanced binary tree, wherein leaf nodes of the tree represent one article;
step S2: and fusing every two articles connected in the tree-shaped balanced binary tree layer by layer to generate a target text abstract fused with key information of at least two articles.
It can be understood that the target text abstract is generated by fusing the key information of the articles, so that the reader can quickly acquire the key information of the articles, and the reading efficiency of the reader is improved
Specifically, in this embodiment, the article is a newspaper.
Further, the tree-shaped balanced binary tree includes nodes in layers 1 to N, where the node in layer N is a leaf node and N is a positive integer, and every two articles connected in the tree-shaped balanced binary tree are fused layer by layer, which specifically includes the following steps:
fusing the N-th layer of node articles into N-1-th layer of node articles, wherein each node at the N-1-th layer represents one N-1-th layer of node articles, and each N-1-th layer of node articles is generated by fusing the N-th layer of node articles connected to the same N-1-th layer of node;
fusing layer by layer from the N-1 to the 1 st layer to generate a layer 1 node article, wherein the layer 1 node article is generated by fusing a layer 2 node article connected to the layer 1 node, the layer 1 node article is the target text abstract, specifically, fusing the layer N-1 node article into a layer N-2 node article, each node of the layer N-2 represents a layer N-2 node article, and each layer N-2 node article is generated by fusing the layer N-1 node article connected to the same layer N-2 node; .., merging the layer 3 node articles into a layer 2 node article, wherein each node of the layer 2 represents a layer 2 node article, each layer 2 node article is generated by merging the layer 3 node articles connected to the same layer 2 node, and merging the layer 2 node articles into a layer 1 node article, wherein the layer 1 node article is the target text abstract.
The abstract main body obtained by the method obtains the final key set by fusing two articles connected in the tree-shaped balanced binary tree layer by layer, so that the final key set can be ensured to contain all key information, some less important information can be removed in layer-by-layer fusion, the information redundancy is reduced under the condition of ensuring comprehensive information coverage, and the reading experience is improved.
Please refer to fig. 1 and fig. 2, the step 2 of fusing the articles pairwise specifically includes the following steps:
step S21: determining two connected articles through the same upper-layer node;
step S22: identifying the demand information in two connected articles by a named entity identification technology, and extracting a key sentence from the demand information, wherein the key sentence is a sentence containing key information in the article;
step S23: screening the extracted key sentences based on the similarity between the key sentences, wherein the similarity between the key sentences is determined by the similarity calculated based on the anchor points and the cosine similarity calculated based on the semantics;
step S24: and splicing the screened key sentences to obtain a key sentence set after the two articles are fused.
It can be understood that the key sentences are screened and spliced based on the similarity between the key sentences, which is beneficial to ensuring that the obtained key sentence set can contain all important information.
Referring to fig. 1 to 3, step S22 specifically includes the following steps:
step S221: identifying the demand information in the two connected articles by a named entity identification technology, and filling the demand information into a preset problem template so as to generate a plurality of problems;
step S222: and (3) extracting an answer segment from each article for each question by using a Benson-BERT model, wherein the obtained answer segment is the key sentence.
It can be understood that the generated problems can be controlled through the preset problem template, which is beneficial for the user to introduce the preference of the user on the important concern subjects or entities; on the other hand, by using the extraction type question-answering model, a more accurate key sentence set segment in the paragraph is convenient to find.
It should be noted that the named entity identification technology is the prior art, and for details, see: chen, W., Feng, Y, Qin, L, & Liu, T. (2021). N-LTP An Open-source Neural Language Technology Platform for Chinese.
Referring to fig. 1 to 4, step S221 specifically includes the following steps:
step S2211: identifying the demand information in each article through a named entity identification technology, and filling the demand information into a preset problem template so as to generate a plurality of problems; that is, the requirement information is determined according to the vacancy in the question template;
step S2212: and (3) extracting an answer fragment from each key sentence for each question by using a monte-BERT model, wherein the obtained answer fragment is the key sentence.
It can be understood that the generated questions can be controlled through the preset question template, which is beneficial for the user to introduce the preference of the user on the important concern subjects or entities; on the other hand, by using the extraction type question-answering model, a more accurate key sentence set segment in the paragraph is convenient to find.
Further, a predetermined question template may be prepared with a plurality of questions, for example, "[ what is the business model of a certain company ], and then entity names of several companies in the paragraph are identified by the named entity identification technology and filled in the question template, thereby generating a question.
Further, when the answer segment is extracted, a plurality of questions generated by each research and report are recorded as Q1 and Q2.. QY in sequence, wherein Y is a positive integer; for example, the two questions generated by the a-study are denoted as a.q1 and a.q2, and the corresponding answers are denoted as a1 and a2, so as to facilitate subsequent matching.
Referring to fig. 2 and 5, step S23 specifically includes the following steps:
step S231: matching every two elements in the two key sentence sets according to the similarity to form a bipartite graph;
step S232: and selecting a key sentence set with most information content and least sentence number through a greedy algorithm as a key sentence set after the multiple articles are fused.
It can be understood that the more compact the key sentence set after the multiple articles are fused, the better, the comprehensiveness of the content is ensured under the condition of reducing the reading amount, and the method is favorable for helping the user to quickly obtain information.
It should be noted that the matching algorithm of the bipartite graph refers to the prior art, and please refer to royal beauty, zhouyanqing, yanasia star for details. CN106547739B, 2019-04-02. The present invention uses key sentences rather than topics as nodes. Namely: and connecting the two sentences if the similarity of the two key sentences is greater than a set threshold value. After iteration, a bipartite graph is finally generated. The nodes of this bipartite graph represent the sentence numbers in the study and the edges represent the similarity of the two sentences. The method comprises the steps of firstly calculating the out degree of each node, namely the number of edges taking the node as an end point, then taking the out degree as the representation of the information content of corresponding sentences, and selecting a key sentence set with the most information content and the least sentence number through a greedy algorithm as a key sentence set after a plurality of research and report are fused.
Further, the similarity between key sentences is calculated according to the formula
Figure 542237DEST_PATH_IMAGE001
Figure 168390DEST_PATH_IMAGE002
For cosine similarity based on semantic calculation, vector representation of each key sentence is calculated through a Meng pre-training model
Figure 132673DEST_PATH_IMAGE003
Wherein
Figure 768054DEST_PATH_IMAGE004
The number of the key sentence is defined, x is a positive integer, and the cosine similarity between the vectors is calculated by the specific formula
Figure 127491DEST_PATH_IMAGE005
Figure 252442DEST_PATH_IMAGE008
The similarity calculated based on the anchor point can be confirmed by a number comparison method or a character comparison method,
Figure 126857DEST_PATH_IMAGE009
a weight coefficient representing the similarity calculated based on the anchor point.
It can be understood that the design is beneficial to enhancing the reliability of the similarity comparison result between the key sentences so as to enhance the readability of the sequenced texts and improve the reading experience of the user.
Further, in the part of calculating the similarity based on the anchor point, only when the number is an important component of the article content, the number comparison mode can be used for confirming, and it is considered that if the two sentences contain the same number, the two sentences are similar sentences.
Further, in the part of calculating the similarity based on the anchor point, for the character comparison mode, the similarity may be calculated by using the edit distance of two sentences, the longest common subsequence length, or the N-Gram similarity.
Further, in step S2, the arrangement of the key sentences is completed by a monte-BERT pre-training model; the monte-BERT pre-training model can calculate the probability of the key sentences arranged at the ith position, and the key sentences are sorted based on the calculation result, wherein i is a positive integer.
The Benzry-BERT pre-training model adopts a pre-training task of sentence order prediction, so that the method can be well adapted to downstream tasks of inter-sentence consistency sequencing.
Referring to FIG. 6, an example of a key sentence arrangement is given for three sentences
Figure 503612DEST_PATH_IMAGE010
Randomly disordered and arranged as
Figure 779872DEST_PATH_IMAGE011
The input samples to construct a monte encoder are shown in the figure, where "[ CLS]"means inputSample initiator "
Figure 826457DEST_PATH_IMAGE012
"represents the beginning of a sentence," and "represents the textual content of the corresponding sentence," [ SEP]"denotes a separator (which may be considered here as an end of input sample). Then using a Monte-Sum encoder to encode the image to obtain
Figure 103034DEST_PATH_IMAGE015
Respectively are
Figure 233801DEST_PATH_IMAGE016
(ii) a Is provided with
Figure 700555DEST_PATH_IMAGE017
As the key vector K and the value vector V of the monte decoder, it will be
Figure 824697DEST_PATH_IMAGE022
As a query vector Q of a MengziDecoder (MengziDecoder), decoded
Figure 674022DEST_PATH_IMAGE025
(ii) a Finally, the data can be obtained through a pointer network
Figure 993007DEST_PATH_IMAGE026
And the probability that the j-1 st sentence is arranged at the ith position is shown. Thus, the consistency ordering of the three sentences is completed; it will be appreciated that the ordering of more than three sentences works the same way.
Details of how to use the Monte-BERT pre-training model to accomplish the consistency ordering are described in Lee, H., Hudson, D.A., Lee, K., & Manning, C.D. (2020). SLM: Learning a Discourse Language reproduction with Senterce Unshuffling.
Referring to fig. 7, the summary generation method further includes the following steps:
step S3: after the fusion of all articles is completed, sorting key sentences in the finally fused articles by adopting a Benson-BERT pre-training model to obtain an initial target text abstract; and generating transition texts between key sentences of the initial target text abstract through a Mengzi-T5 model so as to obtain the target text abstract with the transition texts.
It can be understood that, through the previous steps, the initial target text abstract already contains basically all the information necessary for an article abstract, but the information is only very directly spliced, so that the consistency and readability are poor when reading is performed; and transition texts are generated among the key sentences of the initial target text abstract, so that the readability of the target text abstract can be further improved, and the reading experience of a user can be further improved.
Referring to fig. 7 and 8, step S3 specifically includes the following steps:
step S31: respectively setting masks among key sentences of the initial target text abstract;
step S32: and predicting the content of the mask through a monte-T5 model to obtain a generated transition text so as to obtain a target text abstract with the transition text, wherein the transition text is used for perfecting the logic relationship between adjacent key sentences in the initial target text abstract.
It will be appreciated that through the preceding algorithmic steps, the initial target text summary already contains substantially all of the information necessary for a research summary, and thus, the transition text is often a relatively short logical phrase, conjunctions, or subtitles, etc., and does not contain too much useful information text. Then, the generation problem of the transition text can be simplified into the generation problem of logic phrases, connection words or subtitles between sentences. The Bengzi-T5 model uses mass data of 300GB to pre-train and stores rich prior knowledge, so that a higher generation effect can be achieved under the setting of less sample fine tuning.
For example, three ordered key sentences are input and are respectively recorded as:
Figure 160684DEST_PATH_IMAGE027
Figure 906923DEST_PATH_IMAGE028
and
Figure 402626DEST_PATH_IMAGE029
the input template is set as:
Figure 603931DEST_PATH_IMAGE030
finally, by means of a Mengzi-T5 model pair<Mask 1>、<Mask 2>The contents of the text are predicted to obtain a generated transition text, and \ s represents an input end symbol. Recording the target text abstract with the transition text as follows: sum.
Further, the monte-T5 model needs to be refined with a refined dataset before use, and the refined dataset is constructed based on part-of-speech tagging and/or punctuation identification and/or subtitles.
It can be understood that the generation problem of the transition text can be simplified into the generation problem of logic phrases, conjunctions or subtitles among the sentences, and the targeted construction of the fine tuning data set can effectively ensure the generation effect of the monte-T5 model.
Specifically, the application scenario for constructing the refined data set based on the part-of-speech tagging is as follows: the method includes the steps that a plurality of words with parts of speech such as prepositions or conjunctions exist in an article generally, the prepositions, the conjunctions and the like are replaced by masks, when a model is adjusted accurately, the replaced words are used as labels, and texts at the positions of the masks are generated to be training tasks.
The application scene plot for constructing the fine adjustment data set based on punctuation mark recognition is as follows: generally, it is common to report: "investment advice: … … ", colon and the like, followed by a summary of the contents. Thus, the text before the colon is replaced with a mask symbol and the fine tuning task is also generated for the text at the mask location.
The application scene plot for constructing the fine adjustment data set based on the subtitles is as follows: there are often a large number of subtitles in a study, in "subtitle text: paragraph text reconstructs data by the same method as the punctuation mark identification-based fine adjustment data set.
With continuing reference to fig. 7, the summary generation method herein further includes the following steps:
step S4: and generating a summary and/or a summary text for the target text abstract with the transition text, wherein the summary text is a paragraph or a sentence, and the final target text abstract is obtained.
As can be appreciated, generating the summary and/or summary text can assist the user in obtaining information more quickly, further improving the reading experience of the user.
Referring to fig. 7 and 9, step S4 specifically includes the following steps:
step S41: sequentially generating a topic vocabulary for each key sentence through a Mengzi-T5 model and marking the topic vocabulary;
step S42: designing a prompt question-answer template for inquiring and generating main lecture contents related to each topic word in a target text abstract with a transition text, setting answer answers corresponding to questions as masks, and predicting the contents of the masks by using a Mengzi-T5 model to obtain an overview and/or summary text;
step S43: and combining the summary and/or summary text and the target text abstract with the transition text to obtain a final target text abstract.
As can be appreciated, generating the summary and/or the summary text can make the final target text abstract more convenient to read, assist the user to obtain information more quickly, and further improve the reading experience of the user.
Specifically, taking the summary as an example, three key sentences are input, which are respectively recorded as:
Figure 563797DEST_PATH_IMAGE031
Figure 418621DEST_PATH_IMAGE032
and
Figure 729516DEST_PATH_IMAGE033
firstly, a topic vocabulary is sequentially generated for each key sentence through a Bengzi-T5 model and is respectively recorded as:
Figure 718201DEST_PATH_IMAGE034
Figure 470256DEST_PATH_IMAGE035
and
Figure 823877DEST_PATH_IMAGE036
then, designing a prompt template as follows: "sum. ask: this section relates to,
Figure 668074DEST_PATH_IMAGE035
And
Figure 132554DEST_PATH_IMAGE036
mainly saying what "," answer:<mask code>", finally using the monte-T5 model pair<Mask code>The summary text can be obtained by predicting the content of (1). This step requires the fine tuning of the model using a small amount of data, and data set source references include DOU Z-Y, LIU P, HAYASHI H, et al, 2021, GSum: A General Framework for Guided reactive simulation abs/2010.08014, and HEJ, KRYSCINSKI W, MCCANN B, et al, 2020: Towards General Controllable Text simulation abs/2012.04281, and the like.
Specifically, the topic generation data set construction method is as follows: firstly, for a certain paragraph in a research report and a corresponding subtitle thereof, extracting a maximum public subsequence as a candidate topic word from the paragraph, and constructing a large amount of training data by taking the paragraph as a key sentence. The design prompt template is as follows: "key sentence: paragraph text; topic words: < mask > ", the corresponding topic word can be generated at the" < mask > "position by finely tuning the monton-T5 model using the training data.
The data set construction method generated by the answers is as follows: the data set of the part is basically consistent with the data set generated by the topic, but considering that a paragraph in the data set generated by the topic can only correspond to one sub-title and one topic word, a plurality of paragraphs, corresponding sub-titles and corresponding key words need to be combined, and then the combined sub-titles are replaced by mask marks to construct a large amount of pseudo data. Finally, using the pseudo-data fine tune the mengzo-T5 model, the corresponding summary or summarizing text can be generated at the "< mask >" location.
Generally, the summary and summary content are not very different, and the summary and/or summary may be optionally generated.
Specifically, in this embodiment, the selection is only to generate an overview.
Compared with the prior art, the text abstract generating method has the following advantages:
1. the text abstract generating method specifically comprises the following steps: randomly combining at least two preset articles to generate a tree-shaped balanced binary tree, wherein leaf nodes of the tree represent one article; and fusing every two articles connected in the tree-shaped balanced binary tree layer by layer to generate a target text abstract fused with key information of at least two articles. It can be understood that the target text abstract is generated by fusing the key information of the articles, so that the reader can quickly acquire the key information of the articles, and the reading efficiency of the reader is improved.
2. In the invention, the tree-shaped balanced binary tree comprises nodes of the 1 st to the Nth layers, the nodes of the Nth layer are leaf nodes, N is a positive integer, and every two articles connected in the tree-shaped balanced binary tree are fused layer by layer, which comprises the following steps: fusing the N-th layer of node articles into N-1-th layer of node articles, wherein each node at the N-1-th layer represents one N-1-th layer of node articles, and each N-1-th layer of node articles is generated by fusing the N-th layer of node articles connected to the same N-1-th layer of node; and fusing layer by layer from the N-1 to the 1 st layer to generate a layer 1 node article, wherein the layer 1 node article is generated by fusing layer 2 node articles connected to the layer 1 node, and the layer 1 node article is the target text abstract. The abstract main body obtained by the method obtains the final key set by fusing two articles connected in the tree-shaped balanced binary tree layer by layer, so that the final key set can be ensured to contain all key information, some less important information can be removed in layer-by-layer fusion, the information redundancy is reduced under the condition of ensuring comprehensive information coverage, and the reading experience is improved.
3. In the invention, the article is fused pairwise, and the method specifically comprises the following steps: determining two connected articles through the same upper-layer node; identifying the demand information in two connected articles by a named entity identification technology, and extracting a key sentence from the demand information, wherein the key sentence is a sentence containing key information in the article; screening the extracted key sentences based on the similarity between the key sentences, wherein the similarity between the key sentences is determined by the similarity calculated based on the anchor points and the cosine similarity calculated based on the semantics; and splicing the screened key sentences to obtain a key sentence set after the two articles are fused. It can be understood that the key sentences are screened and spliced based on the similarity between the key sentences, which is beneficial to ensuring that the obtained key sentence set can contain all important information.
4. In the invention, the requirement information in two connected articles is identified by a named entity identification technology and is filled in a preset problem template, thereby generating a plurality of problems; and (3) extracting answer segments from each article for each question by using a Benson-BERT model, wherein the obtained answer segments are key sentences. It can be understood that the generated questions can be controlled through the preset question template, which is beneficial for the user to introduce the preference of the user on the important concern subjects or entities; on the other hand, by using the extraction type question-answering model, a more accurate key sentence set segment in the paragraph is convenient to find.
5. In the invention, pairwise elements in two key sentence sets are matched according to the similarity to form a bipartite graph; and selecting a key sentence set with most information content and least sentence number through a greedy algorithm as a key sentence set after the multiple articles are fused. It can be understood that the more compact the key sentence set after the multiple articles are fused, the better, the comprehensiveness of the content is ensured under the condition of reducing the reading amount, and the method is favorable for helping the user to quickly obtain information.
6. In the invention, the similarity between the key sentences is comprehensively determined by the similarity calculated based on the anchor points and the cosine similarity calculated based on the semantics. The design is beneficial to enhancing the reliability of the similarity comparison result between the key sentences so as to enhance the readability of the sequenced texts and improve the reading experience of the user.
7. In the invention, transition texts are generated among key sentences of the initial target text abstract through a Mengzi-T5 model, so that the target text abstract with the transition texts is obtained. It can be understood that, through the previous steps, the initial target text abstract already contains basically all the information necessary for an article abstract, but the information is only spliced very directly, so that the continuity and readability are poor when reading is performed; and transition texts are generated among the key sentences of the initial target text abstract, so that the readability of the target text abstract can be further improved, and the reading experience of a user can be further improved.
8. In the invention, the transition text is generated in a mode that masks are respectively arranged between key sentences of an initial target text abstract; and predicting the content of the mask through a monte-T5 model to obtain a generated transition text so as to obtain a target text abstract with the transition text, wherein the transition text is used for perfecting the logic relationship between adjacent key sentences in the initial target text abstract. Through the previous algorithm steps, the initial target text abstract basically contains all necessary information of a research and report abstract, so that the transition text is often a short logical phrase, a connective word or a subtitle and does not contain excessive useful information text. Then, the generation problem of the transition text can be simplified into the generation problem of logic phrases, connection words or subtitles between sentences. The Bengzi-T5 model uses mass data of 300GB to pre-train and stores rich prior knowledge, so that a higher generation effect can be achieved under the setting of less sample fine tuning.
9. The text abstract generating method further comprises the following steps: and generating a summary and/or a summary text for the target text abstract with the transition text, wherein the summary text is a paragraph or a sentence, and combining the summary and/or the summary text and the target text abstract with the transition text to obtain a final target text abstract. As can be appreciated, generating the summary and/or the summary text can make the final target text abstract more convenient to read, assist the user to acquire information more quickly, and further improve the reading experience of the user.
The text abstract generating method disclosed by the embodiment of the invention is described in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for the persons skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present description should not be construed as a limitation to the present invention, and any modification, equivalent replacement, and improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A text abstract generating method is characterized by comprising the following steps:
randomly combining at least two preset articles to generate a tree-shaped balanced binary tree, wherein leaf nodes of the tree represent one article;
and fusing every two connected articles in the tree-shaped balanced binary tree layer by layer to generate a target text abstract fused with key information of at least two articles.
2. The method for generating a text summary according to claim 1, wherein the tree-shaped balanced binary tree includes nodes in layers 1 to N, the node in layer N is a leaf node, N is a positive integer, and the step of fusing every two connected articles in the tree-shaped balanced binary tree includes:
fusing the N-th layer of node articles into N-1-th layer of node articles, wherein each node at the N-1-th layer represents one N-1-th layer of node articles, and each N-1-th layer of node articles is generated by fusing the N-th layer of node articles connected to the same N-1-th layer of node;
and fusing layer by layer from the N-1 to the 1 st layer to generate a layer 1 node article, wherein the layer 1 node article is generated by fusing layer 2 node articles connected to the layer 1 node, and the layer 1 node article is the target text abstract.
3. The text abstract generating method of claim 1, wherein the step of fusing every two articles connected in the tree-balanced binary tree layer by layer specifically comprises the following steps:
determining two connected articles through the same upper-layer node;
identifying the demand information in two connected articles by a named entity identification technology, and extracting a key sentence from the demand information, wherein the key sentence is a sentence containing key information in the article;
screening the extracted key sentences based on the similarity between the key sentences, wherein the similarity between the key sentences is determined by the similarity calculated based on the anchor points and the cosine similarity calculated based on the semantics;
and splicing the screened key sentences to obtain a key sentence set after the two articles are fused.
4. The text abstract generating method as claimed in claim 3, wherein the step of identifying the requirement information in two connected articles by the named entity identification technology and extracting the key sentence therefrom comprises the following steps:
identifying the demand information in the two connected articles by a named entity identification technology, and filling the demand information into a preset problem template so as to generate a plurality of problems;
and (3) extracting answer segments from each article for each question by using a Benson-BERT model, wherein the obtained answer segments are key sentences.
5. The method for generating a text summary according to claim 3, wherein the step of screening the extracted key sentences based on the similarity between the key sentences specifically comprises the steps of:
matching every two elements in the two key sentence sets according to the similarity to form a bipartite graph;
and selecting a key sentence set with the most information content and the least sentences through a greedy algorithm.
6. The text summary generation method of claim 3, wherein:
the similarity between key sentences is calculated according to the formula
Figure 929353DEST_PATH_IMAGE001
Figure 677866DEST_PATH_IMAGE002
For cosine similarity based on semantic calculation, vector representation of each key sentence is calculated through a Meng pre-training model
Figure 416015DEST_PATH_IMAGE003
Wherein
Figure 732726DEST_PATH_IMAGE004
The number of the key sentence is defined, x is a positive integer, and then the cosine similarity between the vectors is calculated, wherein the specific formula is
Figure 923536DEST_PATH_IMAGE005
Figure 226473DEST_PATH_IMAGE006
The similarity calculated based on the anchor point can be confirmed by a number comparison method or a character comparison method,
Figure 756811DEST_PATH_IMAGE007
a weight coefficient representing the similarity calculated based on the anchor point.
7. The text summary generation method of claim 1, further comprising the steps of:
after the fusion of all articles is completed, sorting key sentences in the finally fused articles by adopting a Benson-BERT pre-training model to obtain an initial target text abstract; and generating transition texts between key sentences of the initial target text abstract through a monte-T5 model so as to obtain the target text abstract with the transition texts.
8. The method for generating a text abstract according to claim 7, wherein the step of generating a transition text between key sentences of an initial target text abstract through a monte-T5 model to obtain a target text abstract having a transition text comprises the following steps:
respectively setting masks among key sentences of the initial target text abstract;
and predicting the content of the mask through a monte-T5 model to obtain a generated transition text so as to obtain a target text abstract with the transition text, wherein the transition text is used for perfecting the logic relationship between adjacent key sentences in the initial target text abstract.
9. The text summary generation method of claim 7, further comprising the steps of:
and generating a summary and/or a summary text for the target text abstract with the transition text, wherein the summary text is a paragraph or a sentence, and the final target text abstract is obtained.
10. The method for generating a text abstract according to claim 9, wherein the generating of the summary and/or the summary text for the target text abstract with the transition text, the summary text being a paragraph or a sentence, thereby obtaining the final target text abstract, comprises the following steps:
sequentially generating a topic vocabulary for each key sentence through a Mengzi-T5 model and marking the topic vocabulary;
designing a prompt question-answer template for inquiring and generating main lecture contents related to each topic word in a target text abstract with a transition text, setting answer answers corresponding to questions as masks, and predicting the contents of the masks by using a Mengzi-T5 model to obtain an overview and/or summary text;
and combining the summary and/or summary text and the target text abstract with the transition text to obtain a final target text abstract.
CN202210380604.8A 2022-04-12 2022-04-12 Text abstract generating method Pending CN114611520A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210380604.8A CN114611520A (en) 2022-04-12 2022-04-12 Text abstract generating method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210380604.8A CN114611520A (en) 2022-04-12 2022-04-12 Text abstract generating method

Publications (1)

Publication Number Publication Date
CN114611520A true CN114611520A (en) 2022-06-10

Family

ID=81869041

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210380604.8A Pending CN114611520A (en) 2022-04-12 2022-04-12 Text abstract generating method

Country Status (1)

Country Link
CN (1) CN114611520A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114997143A (en) * 2022-08-04 2022-09-02 北京澜舟科技有限公司 Text generation model training method and system, text generation method and storage medium
CN116501862A (en) * 2023-06-25 2023-07-28 西安杰出科技有限公司 Automatic text extraction system based on dynamic distributed collection

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114997143A (en) * 2022-08-04 2022-09-02 北京澜舟科技有限公司 Text generation model training method and system, text generation method and storage medium
CN114997143B (en) * 2022-08-04 2022-11-15 北京澜舟科技有限公司 Text generation model training method and system, text generation method and storage medium
CN116501862A (en) * 2023-06-25 2023-07-28 西安杰出科技有限公司 Automatic text extraction system based on dynamic distributed collection
CN116501862B (en) * 2023-06-25 2023-09-12 桂林电子科技大学 Automatic text extraction system based on dynamic distributed collection

Similar Documents

Publication Publication Date Title
Jung Semantic vector learning for natural language understanding
Neculoiu et al. Learning text similarity with siamese recurrent networks
CN111738004A (en) Training method of named entity recognition model and named entity recognition method
CN111259631B (en) Referee document structuring method and referee document structuring device
CN111324728A (en) Text event abstract generation method and device, electronic equipment and storage medium
CN112183094B (en) Chinese grammar debugging method and system based on multiple text features
CN112926345B (en) Multi-feature fusion neural machine translation error detection method based on data enhancement training
CN113569050B (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
CN114611520A (en) Text abstract generating method
CN115357719B (en) Power audit text classification method and device based on improved BERT model
CN113065356B (en) IT equipment operation and maintenance fault suggestion processing method based on semantic analysis algorithm
He English grammar error detection using recurrent neural networks
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN113032552A (en) Text abstract-based policy key point extraction method and system
CN114764566B (en) Knowledge element extraction method for aviation field
CN115390806A (en) Software design mode recommendation method based on bimodal joint modeling
CN114661872A (en) Beginner-oriented API self-adaptive recommendation method and system
Tüselmann et al. Are end-to-end systems really necessary for NER on handwritten document images?
Cao Generating natural language descriptions from tables
Sheikh et al. Document level semantic context for retrieving OOV proper names
CN117332789A (en) Semantic analysis method and system for dialogue scene
CN115630140A (en) English reading material difficulty judgment method based on text feature fusion
CN114626463A (en) Language model training method, text matching method and related device
CN115017404A (en) Target news topic abstracting method based on compressed space sentence selection
CN111274354B (en) Referee document structuring method and referee document structuring device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination