CN111400486B - Automatic text abstract generation system and method - Google Patents

Automatic text abstract generation system and method Download PDF

Info

Publication number
CN111400486B
CN111400486B CN202010175795.5A CN202010175795A CN111400486B CN 111400486 B CN111400486 B CN 111400486B CN 202010175795 A CN202010175795 A CN 202010175795A CN 111400486 B CN111400486 B CN 111400486B
Authority
CN
China
Prior art keywords
sentence
representing
sentences
abstract
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010175795.5A
Other languages
Chinese (zh)
Other versions
CN111400486A (en
Inventor
汪泽鸿
吕昱峰
钟将
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN202010175795.5A priority Critical patent/CN111400486B/en
Publication of CN111400486A publication Critical patent/CN111400486A/en
Application granted granted Critical
Publication of CN111400486B publication Critical patent/CN111400486B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an automatic generation system and method of text abstract, which converts original document into sentence sequence S= [ S ] 1 ,S 2 ,…,S n ]And corresponding labels y are allocated to each sentence i The method comprises the steps of carrying out a first treatment on the surface of the Extracting feature quantity of each sentence to obtain sentence-level features, scoring the sentence-level features by using a factor ROUGE-1 to obtain corresponding sentence scores and sequencing; and selecting the first K sentences as first candidate sentences, extracting refined words and reserving significant words by using a non-key information shielding mechanism, and further automatically generating a abstract. The invention can update part of characters while maintaining the extracted important information in the original document by combining the extracted and generated two-stage models, so that the automatically generated abstract can accurately reflect the content information of the original document. Meanwhile, the automatic generation process of the abstract is not simply two stages of generation or screening, but is a highly uniform process, so that the integrity of abstract expression is improved.

Description

Automatic text abstract generation system and method
Technical Field
The invention relates to the technical field of document abstract generation, in particular to an automatic text abstract generation system and method.
Background
Automatically generating text excerpts is a natural language generation task that identifies key information and filters non-important content in the original document to preserve the key information in the document. The automatic generation of the text abstract can be widely applied to a plurality of fields such as news reports, investigation and research, education and teaching and the like.
The methods for automatically generating the text abstract in the prior art mainly comprise two methods: extraction method and generation method. The extraction method is to select key sentences from the original document and splice the key sentences together to form abstract content. Instead of simple selection and replication, the generation method typically generates new sentences based on semantic understanding of the original text. Both of these methods have respective advantages and problems. Therefore, researchers have proposed using deep learning methods to combine the two methods to achieve a unified abstract model. The existing model has two problems: 1) These methods directly refine the extracted sentences, possibly ignoring the persistence of the keywords; 2) Using a similar decoder to re-decode the new word may cause the refining process to predict the same word.
Disclosure of Invention
Aiming at the problem that the automatically generated abstract can not accurately reflect the content of the original document in the prior art, the automatic text abstract generation system and method automatically generate the text abstract by combining two stages of sentence extraction and generation so as to accurately reflect the content information of the original document.
In order to achieve the above object, the present invention provides the following technical solutions:
the automatic text abstract generation system comprises a document sentence extraction module, a document sentence push module and an abstract generation module;
the original document is input into the document sentence extraction module to extract a first candidate sentence, the extracted first candidate sentence and the original document are input into the document sentence push module to be extracted, so that a second candidate sentence is obtained, and the second candidate sentence is recombined in the abstract generation module to automatically generate an abstract.
Preferably, the document sentence extracting module comprises a decomposing module, a scoring module and a first candidate sentence generating module; wherein,,
the decomposition module is used for respectively dividing the input original document into a sentence sequence, and each sentence is provided with a corresponding label;
the scoring module is used for scoring sentences with different labels;
and the first candidate sentence generation module is used for sequencing sentences from high score to low score in sequence and screening the first K sentences as first candidate sentences.
Preferably, the scoring module comprises 3 transducer layers and 4 attention heads.
Preferably, the document statement push module includes 8 converter decoder layers and 12 attention headers.
The invention also provides an automatic text abstract generating method, which comprises the following steps:
s1: will be originalConversion of an initial document into a sentence sequence s= [ S ] 1 ,S 2 ,…,S n ]And corresponding labels y are allocated to each sentence i
S2: extracting feature quantity of each sentence to obtain sentence-level features, scoring the sentence-level features by using a factor ROUGE-1 to obtain corresponding sentence scores and sequencing;
s3: and selecting the first K sentences as first candidate sentences, extracting refined words and reserving significant words for the first candidate sentences, and further automatically generating abstracts.
Preferably, the step S2 specifically includes the following steps:
s2-1: in the present embodiment, any sentence S i Input to the BERT model to extract the feature quantity, thereby obtaining a representation vector h i :h i =BERT(S i ) Where i.epsilon.1, n, i is an integer, h i Is m;
s2-2: will represent vector h i Converting to obtain sentence-level feature vectors:
Figure BDA0002410778810000031
Figure BDA0002410778810000032
in the formulas (1) and (2),
Figure BDA0002410778810000033
transition sentence level feature vector representing a layer-l converter,/->
Figure BDA0002410778810000034
Sentence-level feature vectors representing l-1 layer converters, LN representing layer normalization, MHAt representing multi-head attention,/->
Figure BDA0002410778810000035
Sentence-level feature vectors representing l-layer converters, FFN representingA feed-forward layer;
s2-3: output layer output tag y i
Figure BDA0002410778810000036
In the formula (3), y i Representing sentence S i Sigma represents a Sigmoid function, W 0 The weights of the output layer are represented as,
Figure BDA0002410778810000037
sentence-level feature vector representing layer L converter, b 0 Representing the deviation of the output layer;
s2-4: using Sigmoid function to send
Figure BDA0002410778810000038
Conversion to an attention value:
Figure BDA0002410778810000039
in the formula (4) of the present invention,
Figure BDA00024107788100000310
representing sentence S i Attention value of (2), and->
Figure BDA00024107788100000311
S2-5: the information quantity of each sentence is represented by ROUGE-1 as a score, and the sentences are ordered according to the height of the score.
Preferably, the extracting step of the refined word in S3 includes:
s3-1: selecting the first K sentences after sequencing as first candidate sentences, and performing word-by-word shielding (so as to obtain second candidate sentences;
s3-2: refining the feature H to generate a refined word, the refining process can be expressed as
d j =f delib (r ≠j ,H) (5)
In the formula (5), d j Representing the refining result of the jth mask word, f delib Representing a push function, r ≠j Representing mask words, H representing contextually relevant bi-directional features;
all d j Forming a vector, the refined word D can be obtained:
D=[d 1 ,d 2 ,L,d m ]。
preferably, the salient words in S3 remain as follows:
y j =(1-a j )d j +a j r j (6)
in the formula (7), y j Representing the reserved value of the salient word, r j Representing the extraction result, d j Representing the refining result of the jth mask word, a j Representing the weight.
In summary, due to the adoption of the technical scheme, compared with the prior art, the invention has at least the following beneficial effects:
1. the invention can update part of characters while maintaining the extracted important information in the original document by combining the extracted and generated two-stage models, so that the automatically generated abstract can accurately reflect the content information of the original document.
2. The automatic summary generation process in the invention is not two stages of simple generation or screening, but is a highly unified process, thereby improving the integrity of summary expression.
Description of the drawings:
fig. 1 is a schematic diagram of an automatic text summary generation system according to an exemplary embodiment of the present invention.
Fig. 2 is a schematic diagram of a document sentence extraction module according to an exemplary embodiment of the present invention.
Fig. 3 is a flowchart illustrating a text summary automatic generation method according to an exemplary embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and embodiments. It should not be understood that the scope of the subject matter described above is limited to the following examples, and all techniques implemented based on the present disclosure are within the scope of the present invention.
In the description of the present invention, it should be understood that the terms "longitudinal," "transverse," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate or are based on the orientation or positional relationship shown in the drawings, merely to facilitate description of the invention and to simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the invention.
The invention provides an automatic text abstract generating system, referring to fig. 1, comprising a document sentence extracting module 10, a document sentence pushing module 20 and an abstract generating module 30. The original document is input to the document sentence extraction module 10 for screening of a plurality of first candidate sentences; the first candidate sentences and the original document are input into the document sentence push module 20 for extraction so as to correct the information (such as missing sentences, incorrect sentences and the like) of the first candidate sentences and obtain a plurality of second candidate sentences; the summary generating module 30 reorganizes the plurality of second candidate sentences to automatically generate a summary to reflect the information content of the original document.
In this embodiment, referring to fig. 2, the document sentence extraction module 10 includes a decomposition module 101, a scoring module 102, and a first candidate sentence generation module 103.
A separate module 101 for decomposing the input original document into a sentence sequence, such as s= [ S ] 1 , S 2 ,…,S i ,…,S n ]Wherein S is i An ith sentence representing the entire original document; sentence S i Is composed of strict phases, S i =[W 1 ,W 2 ,…,W m ]Wherein W is m Represents the mth word and each sentence S i All having a corresponding tag y i ∈{0,1}。
A scoring module 102 for scoring sentences having different labels. The scoring module 102 sets 3 transducer layers and 4 attention headers. Through the pre-training of 500000 steps, gradient accumulation exists in every two steps; checkpoints are saved and evaluated on the validation set every 1000 steps. The evaluation loss of the validation set is selected according to the first three checkpoints and the average result is output on the test set. Thus, the training completion scoring module 102 may be used to score sentences.
The first candidate sentence generating module 103 is configured to sequentially arrange the sentences with scores from high to low, and filter the first K sentences as first candidate sentences.
In this embodiment, the document sentence push module 20 includes a converter decoder layer and an attention header number. Setting the converter decoder layer to 8 and the Attention header number (Multi-Head Attention) to 12; and for regularization, an exit and layer normalization mechanism is used and the exit rate is set to 0.15.
For gradient accumulation during training of the document sentence push module 20, the accumulation step size is set to 12, and 3 samples are input in each training step. A readable text sequence is also generated using a directed search size of 4 and a length penalty of 1.0.
In this embodiment, both the document sentence extraction module 10 and the document sentence push module 20 employ trigram (trigram) skills. For the document sentence extraction module 10, a ternary language model is used, meaning that if there is a ternary language model overlap between the selected sentence and the candidate sentence, the candidate sentence will be skipped; a similar approach is also used for the document statement push module 20. If a ternary language model is generated that exists in the refined abstract, the model sets the word probability to zero to avoid phrase and sentence duplication.
All the modular programs of the present invention are based on BERT-BASE, which is suitable for GPU memory (GTX 1080 Ti), and has low performance penalty. Since the invention uses converters with the same stack layerBERT, most superparameters are equal; and the learning rate used is 3e-4 beta 1 =0.9、β 2 =0.999 and ε=10 -9 The Adam optimizer of (c) trains the program in the module and uses a dynamic learning rate during training and fine tuning, with a batch size of approximately 36. Because of GPU memory limitations, program training in all modules uses a gradient accumulation approach.
Referring to fig. 3, the invention further provides an automatic text abstract generating method, which specifically comprises the following steps:
s1: the original document is converted into a sequence of sentences and each sentence is assigned a corresponding tag.
For example sentence sequence s= [ S ] 1 ,S 2 ,…,S i ,…,S n ]Wherein S is i An ith sentence representing the entire original document; each sentence S i Is composed of strict phases, S i =[W 1 ,W 2 ,…,W m ]M is the maximum word index, that W m Represents the mth word and each sentence S i All having a corresponding label y i ∈{0,1}。
S2: extracting feature quantity of any sentence to obtain sentence-level features, and scoring the sentence-level features by using a factor ROUGE-1 as a score to obtain corresponding sentence scores and sequencing.
S2-1: in this embodiment, there will be a tag y i Each sentence S of E {0,1} i Input to the BERT model to extract the feature quantity, thereby obtaining a representation vector h i :h i =BERT(S i ) Where i.epsilon.1, n, i is an integer, h i Is m.
S2-2: and then the vector h will be represented i Converting to obtain sentence-level feature vectors:
Figure BDA0002410778810000071
Figure BDA0002410778810000072
in the formula (1),
Figure BDA0002410778810000073
transition sentence level feature vector representing l-layer converter, l representing index of converter layer,/>
Figure BDA0002410778810000074
Sentence-level feature vectors representing the l-1 layer converter, LN representing layer normalization, MHAt representing multi-head attention;
in the formula (2),
Figure BDA0002410778810000075
sentence-level feature vector representing a layer-i converter,/->
Figure BDA0002410778810000076
Representing the transition sentence level eigenvectors of the l-layer converter, LN represents layer normalization, and FFN represents the feed-forward layer.
S2-3: output layer output tag y i
Figure BDA0002410778810000077
/>
In the formula (3), y i Tag representing ith sentence, sigma represents Sigmoid function, W 0 Representing the weight of the output layer,
Figure BDA0002410778810000078
sentence-level feature vector representing an L-th layer (also last layer) converter, b 0 Indicating the deviation of the output layer.
S2-4: using Sigmoid function to send
Figure BDA0002410778810000079
Converting into an attention value.
Figure BDA00024107788100000710
In the formula (4) of the present invention,
Figure BDA00024107788100000711
representing sentence S i Attention value of (2), and->
Figure BDA00024107788100000712
S2-5: the information amount of each sentence is represented by ROUGE-1 as a score, and the sentences are ordered in this order according to the high and low ROUGE scores.
In this embodiment, ROUGE-1 is a method for evaluating automatic summarization, which is a conventional technology, and therefore, a detailed description thereof is omitted. When sorting, sentences corresponding to high ROUGE scores are ranked in the front column, and sentences corresponding to low ROUGE scores are ranked in the rear column.
Meanwhile, in order to improve the matching rate, the first 3 sentences of the ordered sentences are selected as correct marks and used for training the extractor, so that the double-cross entropy loss function is minimized, and the number of correct words is improved.
Figure BDA0002410778810000081
In the formula (5), L ext Represents the double cross entropy loss function g n E {0,1} represents the correct tag for the nth sentence, N is the total number of sentences, y i A tag representing the i-th sentence.
S3: the first K sentences are selected as first candidate sentences, and the first candidate sentences are pushed by using a non-key information shielding mechanism (according to the attention value), so that the abstract is automatically generated.
In this embodiment, the tapping includes two processes of masking word refining and salient word preserving, namely, re-decoding the words in the first candidate sentence twice, finding out missing information, and refining incorrect words.
In the present embodiment, the original document is in the form of w= [ W ] 1 ,w 2 ,L,w n×m ]Which is provided withWherein n is the numbered index of the sentence, m is the sentence S i The maximum word index of (a). The invention marks the mask word in the first candidate sentence as r ≠j =[r 1 ,L,r j-1 ,r j+1 ,L,r m ]。
The extraction process of the refined words comprises the following steps:
s3-1: selecting the first K sentences as first candidate sentences, and performing word-by-word masking (namely, inputting the candidate sentences with the masked words removed into a knockout) so as to obtain second candidate sentences; inputting the second candidate sentences and the original documents into the BERT model to extract the characteristic quantity, and obtaining a vector H: h=bert (W), H representing a contextually relevant bi-directional feature.
S3-2: and refining the vector H to generate refined words, combining the refined words with the second candidate sentences to obtain third candidate sentences, and combining the reserved significant words to automatically generate abstracts.
In this embodiment, the refining process can be expressed as
d j =f delib (r ≠j ,H) (6)
In the formula (6), d j Representing the refining result of the jth mask word, f delib Representing a push function, r ≠j Representing mask words, H represents a contextually relevant bi-directional feature.
All d j Constructing a vector, refining this D:
D=[d 1 ,d 2 ,L,d m ]
in this embodiment, the process of preserving the salient word specifically includes:
humans do not rewrite each word during the refinement process, they replace only some of the unused information phrases, and retain most of the salient words. The salient word retaining process is to extract three sentences with high information content from the original document and obtain self attention (self attention) distribution weight thereof, namely, self attention distribution
Figure BDA0002410778810000091
As the weight, the reserved value of the salient word is a snapshotTaking a weighted sum of the result r and the refined result d:
y j =(1-a j )d j +a j r j (7)
in the formula (7), y j Representing the reserved value of the salient word, r j Representing the extraction result, d j Representing the refining result of the jth mask word, a j Representing the weight. I.e. the retention value is smaller than the preset retention threshold y Pre-preparation The corresponding words of the (c) are removed so that the retained words can reflect the information of the original document.
In this embodiment, therefore, the learning objective of the knock stage is:
Figure BDA0002410778810000092
in the formula (8), L delib Represents a loss function, M represents the total number of words, y j The abstract words are represented by a sequence of abstract words,
Figure RE-GDA0002460142820000092
representing the correct abstract word.
In this embodiment, L delib And L ext Combining to obtain a text abstract model L model I.e. L model = L delib +L ext And make text abstract model L model And minimizing, namely automatically generating the abstract.
The present model was evaluated on a CNN/Daily Mail data set containing news stories on CNN and Daily Mail websites. Each article in the dataset is paired with a manually written abstract of multiple sentences. This time a non-anonymous version of the data set was used, which contained 287113 training pairs, 13368 verification pairs and 11490 test pairs. In terms of evaluation, the criteria ROUGE-1, REGUO-2 and ROUGE-L were used.
Table 1: ROUGE scores for different models on CNN/Daily Mail dataset
Figure BDA0002410778810000101
As shown in the last line of Table 1, the model of the present invention exhibits higher ROUGE-1, REGUO-2 and REGUO-L scores than the other models. Compared with a unified model, the performance of the model is improved by approximately 5%, and the performance of the model is improved by 3% compared with a two-stage method.
Meanwhile, the invention also provides a computer storage medium.
The computer storage medium according to the embodiments of the present invention stores a computer program, which when executed by a processor implements any one of the steps of the two-stage automatic text-extracting generation method for text salient word preservation as described above, and the computer storage medium may be any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium.
It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples of implementations of the invention and that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims (5)

1. The automatic text abstract generating method is characterized by comprising the following steps of:
s1: converting an original document into a sentence sequence s= [ S ] 1 ,S 2 ,…,S n ]And corresponding labels y are allocated to each sentence i
S2: extracting feature quantity of each sentence to obtain sentence-level features, scoring the sentence-level features by using a factor ROUGE-1 to obtain corresponding sentence scores and sequencing;
the step S2 specifically comprises the following steps:
s2-1: in the present embodiment, any sentence S i Input to the BERT model to extract the feature quantity, thereby obtaining a representation vector h i :h i =BERT(S i ) Where i.epsilon.1, n, i is an integer, h i Is m;
s2-2: will beRepresenting vector h i Converting to obtain sentence-level feature vectors:
Figure FDA0004134629620000011
Figure FDA0004134629620000012
in the formulas (1) and (2),
Figure FDA0004134629620000013
transition sentence level feature vector representing a layer-l converter,/->
Figure FDA0004134629620000014
Sentence-level feature vectors representing l-1 layer converters, LN representing layer normalization, MHAt representing multi-head attention,/->
Figure FDA0004134629620000015
Sentence-level feature vectors representing the l-layer converter, FFN representing the feed-forward layer;
s2-3: output layer output tag y i
Figure FDA0004134629620000016
In the formula (3), y i Representing sentence S i Sigma represents a Sigmoid function, W 0 The weights of the output layer are represented as,
Figure FDA0004134629620000017
sentence-level feature vector representing layer L converter, b 0 Representing the deviation of the output layer;
s2-4: using Sigmoid function to send
Figure FDA0004134629620000018
Conversion to an attention value:
Figure FDA0004134629620000019
in the formula (4) of the present invention,
Figure FDA00041346296200000110
representing sentence S i Attention value of (2), and->
Figure FDA00041346296200000111
S2-5: the ROUGE-1 is used as a score to represent the information quantity of each sentence, and the sentences are ordered according to the score;
s3: selecting the first K sentences as first candidate sentences, extracting refined words and reserving significant words for the first candidate sentences, and further automatically generating abstracts;
the step of extracting the refined words in the step S3 is as follows:
s3-1: selecting the first K ordered sentences as first candidate sentences, and carrying out word-by-word shielding so as to obtain second candidate sentences; inputting the second candidate sentences and the original documents into the BERT model to extract the features, and obtaining the features H: h=bert (W), H representing a contextually relevant bi-directional feature;
s3-2: refining the feature H to generate a refined word, the refining process can be expressed as
d j =f delib (r ≠j ,H) (5)
In the formula (5), d j Representing the refining result of the jth mask word, f delib Representing a push function, r ≠j Representing mask words, H representing a contextually relevant bi-directional feature;
all d j Forming a vector, the refined word D can be obtained:
D=[d 1 ,d 2 ,…,d m ];
the salient words in the S3 are reserved as follows:
y j =(1-a j )d j +a j r j (6)
in the formula (6), y j Representing the reserved value of the salient word, r j Representing the extraction result, d j Representing the refining result of the jth mask word, a j Representing the weight;
will L delib And L ext Combining to obtain a text abstract model L model I.e. L model =L delib +L ext And make text abstract model L model Minimizing, namely automatically generating the abstract;
Figure FDA0004134629620000021
in the formula (7), L ext Represents the double cross entropy loss function g n E {0,1} represents the correct tag for the nth sentence, N is the total number of sentences, y i A tag representing an i-th sentence;
Figure FDA0004134629620000022
in the formula (8), L delib Represents a loss function, M represents the total number of words, y j The abstract words are represented by a sequence of abstract words,
Figure FDA0004134629620000023
representing the correct abstract word.
2. An automatic text summary generation system for implementing the method of claim 1, comprising a document sentence extraction module, a document sentence push module, and a summary generation module;
the original document is input into the document sentence extraction module to extract a first candidate sentence, and the extracted first candidate sentence and the original document are input into the document sentence push module to be extracted, so that a second candidate sentence is obtained, and the second candidate sentence is recombined in the abstract generation module to automatically generate an abstract.
3. The automatic text summarization generation system of claim 2 wherein the document sentence extraction module comprises a decomposition module, a scoring module, and a first candidate sentence generation module; wherein,,
the decomposition module is used for respectively dividing the input original document into a sentence sequence, and each sentence is provided with a corresponding label;
the scoring module is used for scoring sentences with different labels;
and the first candidate sentence generation module is used for sequencing sentences from high score to low score in sequence and screening the first K sentences as first candidate sentences.
4. A text excerpt automatic generation system as claimed in claim 3, wherein the scoring module comprises 3 transducer layers and 4 attention headers.
5. The automatic text summarization system of claim 2 wherein the document sentence push module comprises 8 converter decoder layers and 12 attention headers.
CN202010175795.5A 2020-03-13 2020-03-13 Automatic text abstract generation system and method Active CN111400486B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010175795.5A CN111400486B (en) 2020-03-13 2020-03-13 Automatic text abstract generation system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010175795.5A CN111400486B (en) 2020-03-13 2020-03-13 Automatic text abstract generation system and method

Publications (2)

Publication Number Publication Date
CN111400486A CN111400486A (en) 2020-07-10
CN111400486B true CN111400486B (en) 2023-05-26

Family

ID=71432429

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010175795.5A Active CN111400486B (en) 2020-03-13 2020-03-13 Automatic text abstract generation system and method

Country Status (1)

Country Link
CN (1) CN111400486B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112732901B (en) * 2021-01-15 2024-05-28 联想(北京)有限公司 Digest generation method, digest generation device, computer-readable storage medium, and electronic device
CN113377900A (en) * 2021-05-14 2021-09-10 中国电子科技集团公司第五十四研究所 Method for abstracting pushed text based on rewriting and retaining salient words
CN113032552B (en) * 2021-05-25 2021-08-27 南京鸿程信息科技有限公司 Text abstract-based policy key point extraction method and system
CN113705678B (en) * 2021-08-28 2023-04-28 重庆理工大学 Specific target emotion analysis method for enhancing antagonism learning by using word shielding data
CN113657097B (en) * 2021-09-03 2023-08-22 北京建筑大学 Evaluation and verification method and system for abstract fact consistency
CN114996441B (en) * 2022-04-27 2024-01-12 京东科技信息技术有限公司 Document processing method, device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110737769A (en) * 2019-10-21 2020-01-31 南京信息工程大学 pre-training text abstract generation method based on neural topic memory
CN110795657A (en) * 2019-09-25 2020-02-14 腾讯科技(深圳)有限公司 Article pushing and model training method and device, storage medium and computer equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110795657A (en) * 2019-09-25 2020-02-14 腾讯科技(深圳)有限公司 Article pushing and model training method and device, storage medium and computer equipment
CN110737769A (en) * 2019-10-21 2020-01-31 南京信息工程大学 pre-training text abstract generation method based on neural topic memory

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
TSPT:基于预训练的三阶段复合式文本摘要模型;吕瑞等;《计算机应用研究》;20191025(第10期);第1-4页 *
基于ERNIE的命名实体识别;张晓等;《智能计算机与应用》;20200301(第03期);全文 *

Also Published As

Publication number Publication date
CN111400486A (en) 2020-07-10

Similar Documents

Publication Publication Date Title
CN111400486B (en) Automatic text abstract generation system and method
Bordes et al. Large-scale simple question answering with memory networks
US9519634B2 (en) Systems and methods for determining lexical associations among words in a corpus
Brodsky et al. Characterizing motherese: On the computational structure of child-directed language
Ismail et al. Bangla word clustering based on n-gram language model
KR101333485B1 (en) Method for constructing named entities using online encyclopedia and apparatus for performing the same
JP5812534B2 (en) Question answering apparatus, method, and program
Kübler et al. Fast domain adaptation for part of speech tagging for dialogues
CN113673241A (en) Text abstract generation framework and method based on example learning
Hao et al. SCESS: a WFSA-based automated simplified chinese essay scoring system with incremental latent semantic analysis
Post et al. Bayesian tree substitution grammars as a usage-based approach
Sababa et al. A classifier to distinguish between cypriot greek and standard modern greek
CN112115256A (en) Method and device for generating news text abstract integrated with Chinese stroke information
Siddique et al. Bilingual word embeddings for cross-lingual personality recognition using convolutional neural nets
Al-Sultany et al. Enriching tweets for topic modeling via linking to the wikipedia
Li et al. HMM-based address parsing: efficiently parsing billions of addresses on MapReduce
JP2008282328A (en) Text sorting device, text sorting method, text sort program, and recording medium with its program recorded thereon
Xu et al. Historical changes in semantic weights of sub-word units
Khan et al. A corpus based sql formation from bangla language using neural machine translation
Reddy et al. Text Summarization of Telugu Scripts
CN117150002B (en) Abstract generation method, system and device based on dynamic knowledge guidance
Sati et al. Arabic text question answering from an answer retrieval point of view: A survey
Udagedara et al. Language model-based spell-checker for sri lankan names and addresses
Reyes-Barragán et al. INAOE at QAST 2009: Evaluating the Usefulness of a Phonetic Codification of Transcriptions.
Mohapatra et al. Incorporating Localised Context in Wordnet for Indic Languages

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant