CN111400486B - Automatic text abstract generation system and method - Google Patents
Automatic text abstract generation system and method Download PDFInfo
- Publication number
- CN111400486B CN111400486B CN202010175795.5A CN202010175795A CN111400486B CN 111400486 B CN111400486 B CN 111400486B CN 202010175795 A CN202010175795 A CN 202010175795A CN 111400486 B CN111400486 B CN 111400486B
- Authority
- CN
- China
- Prior art keywords
- sentence
- representing
- sentences
- abstract
- candidate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 230000008569 process Effects 0.000 claims abstract description 12
- 238000012163 sequencing technique Methods 0.000 claims abstract description 7
- 238000012216 screening Methods 0.000 claims abstract description 5
- 239000013598 vector Substances 0.000 claims description 26
- 238000000605 extraction Methods 0.000 claims description 18
- 238000007670 refining Methods 0.000 claims description 16
- 238000010606 normalization Methods 0.000 claims description 5
- 238000000354 decomposition reaction Methods 0.000 claims description 4
- 230000007704 transition Effects 0.000 claims description 4
- 101100428009 Mus musculus Utp6 gene Proteins 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 2
- 230000007246 mechanism Effects 0.000 abstract description 3
- 230000006870 function Effects 0.000 description 9
- 238000012549 training Methods 0.000 description 7
- 238000009825 accumulation Methods 0.000 description 4
- JEIPFZHSYJVQDO-UHFFFAOYSA-N iron(III) oxide Inorganic materials O=[Fe]O[Fe]=O JEIPFZHSYJVQDO-UHFFFAOYSA-N 0.000 description 4
- 238000013459 approach Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000014759 maintenance of location Effects 0.000 description 2
- 230000000873 masking effect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 230000002688 persistence Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000004321 preservation Methods 0.000 description 1
- 230000010076 replication Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000010079 rubber tapping Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses an automatic generation system and method of text abstract, which converts original document into sentence sequence S= [ S ] 1 ,S 2 ,…,S n ]And corresponding labels y are allocated to each sentence i The method comprises the steps of carrying out a first treatment on the surface of the Extracting feature quantity of each sentence to obtain sentence-level features, scoring the sentence-level features by using a factor ROUGE-1 to obtain corresponding sentence scores and sequencing; and selecting the first K sentences as first candidate sentences, extracting refined words and reserving significant words by using a non-key information shielding mechanism, and further automatically generating a abstract. The invention can update part of characters while maintaining the extracted important information in the original document by combining the extracted and generated two-stage models, so that the automatically generated abstract can accurately reflect the content information of the original document. Meanwhile, the automatic generation process of the abstract is not simply two stages of generation or screening, but is a highly uniform process, so that the integrity of abstract expression is improved.
Description
Technical Field
The invention relates to the technical field of document abstract generation, in particular to an automatic text abstract generation system and method.
Background
Automatically generating text excerpts is a natural language generation task that identifies key information and filters non-important content in the original document to preserve the key information in the document. The automatic generation of the text abstract can be widely applied to a plurality of fields such as news reports, investigation and research, education and teaching and the like.
The methods for automatically generating the text abstract in the prior art mainly comprise two methods: extraction method and generation method. The extraction method is to select key sentences from the original document and splice the key sentences together to form abstract content. Instead of simple selection and replication, the generation method typically generates new sentences based on semantic understanding of the original text. Both of these methods have respective advantages and problems. Therefore, researchers have proposed using deep learning methods to combine the two methods to achieve a unified abstract model. The existing model has two problems: 1) These methods directly refine the extracted sentences, possibly ignoring the persistence of the keywords; 2) Using a similar decoder to re-decode the new word may cause the refining process to predict the same word.
Disclosure of Invention
Aiming at the problem that the automatically generated abstract can not accurately reflect the content of the original document in the prior art, the automatic text abstract generation system and method automatically generate the text abstract by combining two stages of sentence extraction and generation so as to accurately reflect the content information of the original document.
In order to achieve the above object, the present invention provides the following technical solutions:
the automatic text abstract generation system comprises a document sentence extraction module, a document sentence push module and an abstract generation module;
the original document is input into the document sentence extraction module to extract a first candidate sentence, the extracted first candidate sentence and the original document are input into the document sentence push module to be extracted, so that a second candidate sentence is obtained, and the second candidate sentence is recombined in the abstract generation module to automatically generate an abstract.
Preferably, the document sentence extracting module comprises a decomposing module, a scoring module and a first candidate sentence generating module; wherein,,
the decomposition module is used for respectively dividing the input original document into a sentence sequence, and each sentence is provided with a corresponding label;
the scoring module is used for scoring sentences with different labels;
and the first candidate sentence generation module is used for sequencing sentences from high score to low score in sequence and screening the first K sentences as first candidate sentences.
Preferably, the scoring module comprises 3 transducer layers and 4 attention heads.
Preferably, the document statement push module includes 8 converter decoder layers and 12 attention headers.
The invention also provides an automatic text abstract generating method, which comprises the following steps:
s1: will be originalConversion of an initial document into a sentence sequence s= [ S ] 1 ,S 2 ,…,S n ]And corresponding labels y are allocated to each sentence i ;
S2: extracting feature quantity of each sentence to obtain sentence-level features, scoring the sentence-level features by using a factor ROUGE-1 to obtain corresponding sentence scores and sequencing;
s3: and selecting the first K sentences as first candidate sentences, extracting refined words and reserving significant words for the first candidate sentences, and further automatically generating abstracts.
Preferably, the step S2 specifically includes the following steps:
s2-1: in the present embodiment, any sentence S i Input to the BERT model to extract the feature quantity, thereby obtaining a representation vector h i :h i =BERT(S i ) Where i.epsilon.1, n, i is an integer, h i Is m;
s2-2: will represent vector h i Converting to obtain sentence-level feature vectors:
in the formulas (1) and (2),transition sentence level feature vector representing a layer-l converter,/->Sentence-level feature vectors representing l-1 layer converters, LN representing layer normalization, MHAt representing multi-head attention,/->Sentence-level feature vectors representing l-layer converters, FFN representingA feed-forward layer;
s2-3: output layer output tag y i :
In the formula (3), y i Representing sentence S i Sigma represents a Sigmoid function, W 0 The weights of the output layer are represented as,sentence-level feature vector representing layer L converter, b 0 Representing the deviation of the output layer;
S2-5: the information quantity of each sentence is represented by ROUGE-1 as a score, and the sentences are ordered according to the height of the score.
Preferably, the extracting step of the refined word in S3 includes:
s3-1: selecting the first K sentences after sequencing as first candidate sentences, and performing word-by-word shielding (so as to obtain second candidate sentences;
s3-2: refining the feature H to generate a refined word, the refining process can be expressed as
d j =f delib (r ≠j ,H) (5)
In the formula (5), d j Representing the refining result of the jth mask word, f delib Representing a push function, r ≠j Representing mask words, H representing contextually relevant bi-directional features;
all d j Forming a vector, the refined word D can be obtained:
D=[d 1 ,d 2 ,L,d m ]。
preferably, the salient words in S3 remain as follows:
y j =(1-a j )d j +a j r j (6)
in the formula (7), y j Representing the reserved value of the salient word, r j Representing the extraction result, d j Representing the refining result of the jth mask word, a j Representing the weight.
In summary, due to the adoption of the technical scheme, compared with the prior art, the invention has at least the following beneficial effects:
1. the invention can update part of characters while maintaining the extracted important information in the original document by combining the extracted and generated two-stage models, so that the automatically generated abstract can accurately reflect the content information of the original document.
2. The automatic summary generation process in the invention is not two stages of simple generation or screening, but is a highly unified process, thereby improving the integrity of summary expression.
Description of the drawings:
fig. 1 is a schematic diagram of an automatic text summary generation system according to an exemplary embodiment of the present invention.
Fig. 2 is a schematic diagram of a document sentence extraction module according to an exemplary embodiment of the present invention.
Fig. 3 is a flowchart illustrating a text summary automatic generation method according to an exemplary embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and embodiments. It should not be understood that the scope of the subject matter described above is limited to the following examples, and all techniques implemented based on the present disclosure are within the scope of the present invention.
In the description of the present invention, it should be understood that the terms "longitudinal," "transverse," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate or are based on the orientation or positional relationship shown in the drawings, merely to facilitate description of the invention and to simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the invention.
The invention provides an automatic text abstract generating system, referring to fig. 1, comprising a document sentence extracting module 10, a document sentence pushing module 20 and an abstract generating module 30. The original document is input to the document sentence extraction module 10 for screening of a plurality of first candidate sentences; the first candidate sentences and the original document are input into the document sentence push module 20 for extraction so as to correct the information (such as missing sentences, incorrect sentences and the like) of the first candidate sentences and obtain a plurality of second candidate sentences; the summary generating module 30 reorganizes the plurality of second candidate sentences to automatically generate a summary to reflect the information content of the original document.
In this embodiment, referring to fig. 2, the document sentence extraction module 10 includes a decomposition module 101, a scoring module 102, and a first candidate sentence generation module 103.
A separate module 101 for decomposing the input original document into a sentence sequence, such as s= [ S ] 1 , S 2 ,…,S i ,…,S n ]Wherein S is i An ith sentence representing the entire original document; sentence S i Is composed of strict phases, S i =[W 1 ,W 2 ,…,W m ]Wherein W is m Represents the mth word and each sentence S i All having a corresponding tag y i ∈{0,1}。
A scoring module 102 for scoring sentences having different labels. The scoring module 102 sets 3 transducer layers and 4 attention headers. Through the pre-training of 500000 steps, gradient accumulation exists in every two steps; checkpoints are saved and evaluated on the validation set every 1000 steps. The evaluation loss of the validation set is selected according to the first three checkpoints and the average result is output on the test set. Thus, the training completion scoring module 102 may be used to score sentences.
The first candidate sentence generating module 103 is configured to sequentially arrange the sentences with scores from high to low, and filter the first K sentences as first candidate sentences.
In this embodiment, the document sentence push module 20 includes a converter decoder layer and an attention header number. Setting the converter decoder layer to 8 and the Attention header number (Multi-Head Attention) to 12; and for regularization, an exit and layer normalization mechanism is used and the exit rate is set to 0.15.
For gradient accumulation during training of the document sentence push module 20, the accumulation step size is set to 12, and 3 samples are input in each training step. A readable text sequence is also generated using a directed search size of 4 and a length penalty of 1.0.
In this embodiment, both the document sentence extraction module 10 and the document sentence push module 20 employ trigram (trigram) skills. For the document sentence extraction module 10, a ternary language model is used, meaning that if there is a ternary language model overlap between the selected sentence and the candidate sentence, the candidate sentence will be skipped; a similar approach is also used for the document statement push module 20. If a ternary language model is generated that exists in the refined abstract, the model sets the word probability to zero to avoid phrase and sentence duplication.
All the modular programs of the present invention are based on BERT-BASE, which is suitable for GPU memory (GTX 1080 Ti), and has low performance penalty. Since the invention uses converters with the same stack layerBERT, most superparameters are equal; and the learning rate used is 3e-4 beta 1 =0.9、β 2 =0.999 and ε=10 -9 The Adam optimizer of (c) trains the program in the module and uses a dynamic learning rate during training and fine tuning, with a batch size of approximately 36. Because of GPU memory limitations, program training in all modules uses a gradient accumulation approach.
Referring to fig. 3, the invention further provides an automatic text abstract generating method, which specifically comprises the following steps:
s1: the original document is converted into a sequence of sentences and each sentence is assigned a corresponding tag.
For example sentence sequence s= [ S ] 1 ,S 2 ,…,S i ,…,S n ]Wherein S is i An ith sentence representing the entire original document; each sentence S i Is composed of strict phases, S i =[W 1 ,W 2 ,…,W m ]M is the maximum word index, that W m Represents the mth word and each sentence S i All having a corresponding label y i ∈{0,1}。
S2: extracting feature quantity of any sentence to obtain sentence-level features, and scoring the sentence-level features by using a factor ROUGE-1 as a score to obtain corresponding sentence scores and sequencing.
S2-1: in this embodiment, there will be a tag y i Each sentence S of E {0,1} i Input to the BERT model to extract the feature quantity, thereby obtaining a representation vector h i :h i =BERT(S i ) Where i.epsilon.1, n, i is an integer, h i Is m.
S2-2: and then the vector h will be represented i Converting to obtain sentence-level feature vectors:
in the formula (1),transition sentence level feature vector representing l-layer converter, l representing index of converter layer,/>Sentence-level feature vectors representing the l-1 layer converter, LN representing layer normalization, MHAt representing multi-head attention;
in the formula (2),sentence-level feature vector representing a layer-i converter,/->Representing the transition sentence level eigenvectors of the l-layer converter, LN represents layer normalization, and FFN represents the feed-forward layer.
S2-3: output layer output tag y i 。
In the formula (3), y i Tag representing ith sentence, sigma represents Sigmoid function, W 0 Representing the weight of the output layer,sentence-level feature vector representing an L-th layer (also last layer) converter, b 0 Indicating the deviation of the output layer.
S2-5: the information amount of each sentence is represented by ROUGE-1 as a score, and the sentences are ordered in this order according to the high and low ROUGE scores.
In this embodiment, ROUGE-1 is a method for evaluating automatic summarization, which is a conventional technology, and therefore, a detailed description thereof is omitted. When sorting, sentences corresponding to high ROUGE scores are ranked in the front column, and sentences corresponding to low ROUGE scores are ranked in the rear column.
Meanwhile, in order to improve the matching rate, the first 3 sentences of the ordered sentences are selected as correct marks and used for training the extractor, so that the double-cross entropy loss function is minimized, and the number of correct words is improved.
In the formula (5), L ext Represents the double cross entropy loss function g n E {0,1} represents the correct tag for the nth sentence, N is the total number of sentences, y i A tag representing the i-th sentence.
S3: the first K sentences are selected as first candidate sentences, and the first candidate sentences are pushed by using a non-key information shielding mechanism (according to the attention value), so that the abstract is automatically generated.
In this embodiment, the tapping includes two processes of masking word refining and salient word preserving, namely, re-decoding the words in the first candidate sentence twice, finding out missing information, and refining incorrect words.
In the present embodiment, the original document is in the form of w= [ W ] 1 ,w 2 ,L,w n×m ]Which is provided withWherein n is the numbered index of the sentence, m is the sentence S i The maximum word index of (a). The invention marks the mask word in the first candidate sentence as r ≠j =[r 1 ,L,r j-1 ,r j+1 ,L,r m ]。
The extraction process of the refined words comprises the following steps:
s3-1: selecting the first K sentences as first candidate sentences, and performing word-by-word masking (namely, inputting the candidate sentences with the masked words removed into a knockout) so as to obtain second candidate sentences; inputting the second candidate sentences and the original documents into the BERT model to extract the characteristic quantity, and obtaining a vector H: h=bert (W), H representing a contextually relevant bi-directional feature.
S3-2: and refining the vector H to generate refined words, combining the refined words with the second candidate sentences to obtain third candidate sentences, and combining the reserved significant words to automatically generate abstracts.
In this embodiment, the refining process can be expressed as
d j =f delib (r ≠j ,H) (6)
In the formula (6), d j Representing the refining result of the jth mask word, f delib Representing a push function, r ≠j Representing mask words, H represents a contextually relevant bi-directional feature.
All d j Constructing a vector, refining this D:
D=[d 1 ,d 2 ,L,d m ]
in this embodiment, the process of preserving the salient word specifically includes:
humans do not rewrite each word during the refinement process, they replace only some of the unused information phrases, and retain most of the salient words. The salient word retaining process is to extract three sentences with high information content from the original document and obtain self attention (self attention) distribution weight thereof, namely, self attention distributionAs the weight, the reserved value of the salient word is a snapshotTaking a weighted sum of the result r and the refined result d:
y j =(1-a j )d j +a j r j (7)
in the formula (7), y j Representing the reserved value of the salient word, r j Representing the extraction result, d j Representing the refining result of the jth mask word, a j Representing the weight. I.e. the retention value is smaller than the preset retention threshold y Pre-preparation The corresponding words of the (c) are removed so that the retained words can reflect the information of the original document.
In this embodiment, therefore, the learning objective of the knock stage is:
in the formula (8), L delib Represents a loss function, M represents the total number of words, y j The abstract words are represented by a sequence of abstract words,representing the correct abstract word.
In this embodiment, L delib And L ext Combining to obtain a text abstract model L model I.e. L model = L delib +L ext And make text abstract model L model And minimizing, namely automatically generating the abstract.
The present model was evaluated on a CNN/Daily Mail data set containing news stories on CNN and Daily Mail websites. Each article in the dataset is paired with a manually written abstract of multiple sentences. This time a non-anonymous version of the data set was used, which contained 287113 training pairs, 13368 verification pairs and 11490 test pairs. In terms of evaluation, the criteria ROUGE-1, REGUO-2 and ROUGE-L were used.
Table 1: ROUGE scores for different models on CNN/Daily Mail dataset
As shown in the last line of Table 1, the model of the present invention exhibits higher ROUGE-1, REGUO-2 and REGUO-L scores than the other models. Compared with a unified model, the performance of the model is improved by approximately 5%, and the performance of the model is improved by 3% compared with a two-stage method.
Meanwhile, the invention also provides a computer storage medium.
The computer storage medium according to the embodiments of the present invention stores a computer program, which when executed by a processor implements any one of the steps of the two-stage automatic text-extracting generation method for text salient word preservation as described above, and the computer storage medium may be any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium.
It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples of implementations of the invention and that various changes in form and details may be made therein without departing from the spirit and scope of the invention.
Claims (5)
1. The automatic text abstract generating method is characterized by comprising the following steps of:
s1: converting an original document into a sentence sequence s= [ S ] 1 ,S 2 ,…,S n ]And corresponding labels y are allocated to each sentence i ;
S2: extracting feature quantity of each sentence to obtain sentence-level features, scoring the sentence-level features by using a factor ROUGE-1 to obtain corresponding sentence scores and sequencing;
the step S2 specifically comprises the following steps:
s2-1: in the present embodiment, any sentence S i Input to the BERT model to extract the feature quantity, thereby obtaining a representation vector h i :h i =BERT(S i ) Where i.epsilon.1, n, i is an integer, h i Is m;
s2-2: will beRepresenting vector h i Converting to obtain sentence-level feature vectors:
in the formulas (1) and (2),transition sentence level feature vector representing a layer-l converter,/->Sentence-level feature vectors representing l-1 layer converters, LN representing layer normalization, MHAt representing multi-head attention,/->Sentence-level feature vectors representing the l-layer converter, FFN representing the feed-forward layer;
s2-3: output layer output tag y i :
In the formula (3), y i Representing sentence S i Sigma represents a Sigmoid function, W 0 The weights of the output layer are represented as,sentence-level feature vector representing layer L converter, b 0 Representing the deviation of the output layer;
S2-5: the ROUGE-1 is used as a score to represent the information quantity of each sentence, and the sentences are ordered according to the score;
s3: selecting the first K sentences as first candidate sentences, extracting refined words and reserving significant words for the first candidate sentences, and further automatically generating abstracts;
the step of extracting the refined words in the step S3 is as follows:
s3-1: selecting the first K ordered sentences as first candidate sentences, and carrying out word-by-word shielding so as to obtain second candidate sentences; inputting the second candidate sentences and the original documents into the BERT model to extract the features, and obtaining the features H: h=bert (W), H representing a contextually relevant bi-directional feature;
s3-2: refining the feature H to generate a refined word, the refining process can be expressed as
d j =f delib (r ≠j ,H) (5)
In the formula (5), d j Representing the refining result of the jth mask word, f delib Representing a push function, r ≠j Representing mask words, H representing a contextually relevant bi-directional feature;
all d j Forming a vector, the refined word D can be obtained:
D=[d 1 ,d 2 ,…,d m ];
the salient words in the S3 are reserved as follows:
y j =(1-a j )d j +a j r j (6)
in the formula (6), y j Representing the reserved value of the salient word, r j Representing the extraction result, d j Representing the refining result of the jth mask word, a j Representing the weight;
will L delib And L ext Combining to obtain a text abstract model L model I.e. L model =L delib +L ext And make text abstract model L model Minimizing, namely automatically generating the abstract;
in the formula (7), L ext Represents the double cross entropy loss function g n E {0,1} represents the correct tag for the nth sentence, N is the total number of sentences, y i A tag representing an i-th sentence;
2. An automatic text summary generation system for implementing the method of claim 1, comprising a document sentence extraction module, a document sentence push module, and a summary generation module;
the original document is input into the document sentence extraction module to extract a first candidate sentence, and the extracted first candidate sentence and the original document are input into the document sentence push module to be extracted, so that a second candidate sentence is obtained, and the second candidate sentence is recombined in the abstract generation module to automatically generate an abstract.
3. The automatic text summarization generation system of claim 2 wherein the document sentence extraction module comprises a decomposition module, a scoring module, and a first candidate sentence generation module; wherein,,
the decomposition module is used for respectively dividing the input original document into a sentence sequence, and each sentence is provided with a corresponding label;
the scoring module is used for scoring sentences with different labels;
and the first candidate sentence generation module is used for sequencing sentences from high score to low score in sequence and screening the first K sentences as first candidate sentences.
4. A text excerpt automatic generation system as claimed in claim 3, wherein the scoring module comprises 3 transducer layers and 4 attention headers.
5. The automatic text summarization system of claim 2 wherein the document sentence push module comprises 8 converter decoder layers and 12 attention headers.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010175795.5A CN111400486B (en) | 2020-03-13 | 2020-03-13 | Automatic text abstract generation system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010175795.5A CN111400486B (en) | 2020-03-13 | 2020-03-13 | Automatic text abstract generation system and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111400486A CN111400486A (en) | 2020-07-10 |
CN111400486B true CN111400486B (en) | 2023-05-26 |
Family
ID=71432429
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010175795.5A Active CN111400486B (en) | 2020-03-13 | 2020-03-13 | Automatic text abstract generation system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111400486B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112732901B (en) * | 2021-01-15 | 2024-05-28 | 联想(北京)有限公司 | Digest generation method, digest generation device, computer-readable storage medium, and electronic device |
CN113377900A (en) * | 2021-05-14 | 2021-09-10 | 中国电子科技集团公司第五十四研究所 | Method for abstracting pushed text based on rewriting and retaining salient words |
CN113032552B (en) * | 2021-05-25 | 2021-08-27 | 南京鸿程信息科技有限公司 | Text abstract-based policy key point extraction method and system |
CN113705678B (en) * | 2021-08-28 | 2023-04-28 | 重庆理工大学 | Specific target emotion analysis method for enhancing antagonism learning by using word shielding data |
CN113657097B (en) * | 2021-09-03 | 2023-08-22 | 北京建筑大学 | Evaluation and verification method and system for abstract fact consistency |
CN114996441B (en) * | 2022-04-27 | 2024-01-12 | 京东科技信息技术有限公司 | Document processing method, device, electronic equipment and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110737769A (en) * | 2019-10-21 | 2020-01-31 | 南京信息工程大学 | pre-training text abstract generation method based on neural topic memory |
CN110795657A (en) * | 2019-09-25 | 2020-02-14 | 腾讯科技(深圳)有限公司 | Article pushing and model training method and device, storage medium and computer equipment |
-
2020
- 2020-03-13 CN CN202010175795.5A patent/CN111400486B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110795657A (en) * | 2019-09-25 | 2020-02-14 | 腾讯科技(深圳)有限公司 | Article pushing and model training method and device, storage medium and computer equipment |
CN110737769A (en) * | 2019-10-21 | 2020-01-31 | 南京信息工程大学 | pre-training text abstract generation method based on neural topic memory |
Non-Patent Citations (2)
Title |
---|
TSPT:基于预训练的三阶段复合式文本摘要模型;吕瑞等;《计算机应用研究》;20191025(第10期);第1-4页 * |
基于ERNIE的命名实体识别;张晓等;《智能计算机与应用》;20200301(第03期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111400486A (en) | 2020-07-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111400486B (en) | Automatic text abstract generation system and method | |
Bordes et al. | Large-scale simple question answering with memory networks | |
US9519634B2 (en) | Systems and methods for determining lexical associations among words in a corpus | |
Brodsky et al. | Characterizing motherese: On the computational structure of child-directed language | |
Ismail et al. | Bangla word clustering based on n-gram language model | |
KR101333485B1 (en) | Method for constructing named entities using online encyclopedia and apparatus for performing the same | |
JP5812534B2 (en) | Question answering apparatus, method, and program | |
Kübler et al. | Fast domain adaptation for part of speech tagging for dialogues | |
CN113673241A (en) | Text abstract generation framework and method based on example learning | |
Hao et al. | SCESS: a WFSA-based automated simplified chinese essay scoring system with incremental latent semantic analysis | |
Post et al. | Bayesian tree substitution grammars as a usage-based approach | |
Sababa et al. | A classifier to distinguish between cypriot greek and standard modern greek | |
CN112115256A (en) | Method and device for generating news text abstract integrated with Chinese stroke information | |
Siddique et al. | Bilingual word embeddings for cross-lingual personality recognition using convolutional neural nets | |
Al-Sultany et al. | Enriching tweets for topic modeling via linking to the wikipedia | |
Li et al. | HMM-based address parsing: efficiently parsing billions of addresses on MapReduce | |
JP2008282328A (en) | Text sorting device, text sorting method, text sort program, and recording medium with its program recorded thereon | |
Xu et al. | Historical changes in semantic weights of sub-word units | |
Khan et al. | A corpus based sql formation from bangla language using neural machine translation | |
Reddy et al. | Text Summarization of Telugu Scripts | |
CN117150002B (en) | Abstract generation method, system and device based on dynamic knowledge guidance | |
Sati et al. | Arabic text question answering from an answer retrieval point of view: A survey | |
Udagedara et al. | Language model-based spell-checker for sri lankan names and addresses | |
Reyes-Barragán et al. | INAOE at QAST 2009: Evaluating the Usefulness of a Phonetic Codification of Transcriptions. | |
Mohapatra et al. | Incorporating Localised Context in Wordnet for Indic Languages |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |