CN110765768A - Optimized text abstract generation method - Google Patents

Optimized text abstract generation method Download PDF

Info

Publication number
CN110765768A
CN110765768A CN201910981470.3A CN201910981470A CN110765768A CN 110765768 A CN110765768 A CN 110765768A CN 201910981470 A CN201910981470 A CN 201910981470A CN 110765768 A CN110765768 A CN 110765768A
Authority
CN
China
Prior art keywords
cnn
text
decoder
extracted
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910981470.3A
Other languages
Chinese (zh)
Inventor
刘博�
申利彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201910981470.3A priority Critical patent/CN110765768A/en
Publication of CN110765768A publication Critical patent/CN110765768A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An optimized text abstract generation method belongs to the field of natural language generation, and particularly relates to a sequence-to-sequence text abstract generation related method. Firstly, Chinese data is preprocessed through cleaning and the like, an article is sent to an AS-CNN model of an Encoder end to extract characteristics, and then the characteristics are sent to a Decoder end composed of transformers. The network can not only utilize the parallel capability of the CNN network and the transform, give full play to the effect of hardware and accelerate the training speed, but also use the CNN at the Encoder end, reduce the parameters of the model, avoid the over-fitting problem and expand the application range of the model.

Description

Optimized text abstract generation method
The technical field is as follows:
the invention belongs to the field of natural language generation, and particularly relates to a method for generating a sequence text abstract.
Background art:
with the rapid development of information technology, information explosion is impacting people's lives. On the one hand, the internet now has a large number of web pages and texts, but there is a large amount of redundant contents among the texts related to the contents, and it takes a lot of time and energy for people to read and acquire the repeated contents. On the other hand, the social development speeds up the life rhythm of people, and the time of more and more fragmentations drives people to obtain contents through the internet instead of traditional paper materials such as books and the like. Therefore, how to extract the main content from a large amount of text information has become a hot spot of academic research nowadays.
With regard to the problem of text summarization, many domestic and foreign scholars have deep insights on the field, and a lot of available text summarization technologies are provided. The earliest scholars proposed an Extraction Text Summarization (ETS) method, which mainly uses the traditional statistical method to extract the segments that can summarize the subject matter of the content. Although this approach may capture the primary content to some extent, there is a major problem in that the captured summary may have a semantic inconsistency. Subsequently, some researchers proposed an Abstract Text Summarization (ATS) method, which can effectively solve the problem of semantic incoherence of the text summarization generated by the ETS method. The method uses the latest Deep learning technology (DL) technology, simulates the writing habit of people by using a neural network, and then trains to generate the text abstract. The classic network architecture in neural network technology is Sequence to Sequence (Sequence 2Seq), first proposed by Cho et al, which is an Encoder for encoding source text input and a decoder for decoding and outputting target text. This architecture is based on a Recurrent Neural Network (RNN), but because it is a sequential input and output, it is not possible to parallelize the training and is time consuming. Therefore, Jonas et al propose a seq2seq based on Convolutional Neural Network (CNN) to speed up the training process. However, the convolutional neural network has a defect in the encoding capability of the convolutional neural network on language sequence information, and in 2017, a transform model proposed by Ashish et al can process language information and can perform parallel training. However, the Transformer is a self-attention model of an Encoder and a Decoder with 6 layers, has many parameters and an overlarge integral model structure, and is not suitable for efficient laboratory research.
Disclosure of Invention
The invention mainly solves the technical problems of reducing the model parameters of the Encoder module and increasing the training speed on the premise of not influencing the performance. A CNN model suitable for Text summarization is provided, and abstract Text summary relational Neural Network (AS-CNN) is improved based on textCNN provided by Yoon, and AS-CNN coding results are sent to a Decoder module of a Transformer for summarization generation.
The invention provides a method for quickly training a summary generation model for massive text data. The method comprises the steps of removing spaces and special characters from text data, cleaning the text according to frequency, and then constructing a dictionary required by a user, wherein keys of the dictionary are words, and values are corresponding ids of all the words. And then converting the article to be processed into corresponding id according to the dictionary, initializing a word vector matrix at an Embedding layer of the model, and then finding out a word vector corresponding to each word according to the id. The word vectors are sent to the Encoder end of the model for feature extraction, a large number of parameters can be generated by different models in the process of feature extraction, the parameters of some models are increased in an exponential order, the requirement on computing hardware equipment is high, a feature extraction method can be replaced in the stage of feature extraction by the Encoder, and the number of the parameters of the model is reduced on the premise of obtaining rich features.
In order to achieve the purpose, the invention adopts the following technical scheme: in order to avoid excessive parameter quantity of an Encoder end in a characteristic extraction stage and guarantee parallel training, an AS-CNN algorithm is adopted AS the Encoder end of a model, and effective text characteristics are extracted by adopting different convolution kernel sizes according to different article lengths. And then inputting the extracted text features into a Decoder end, wherein the Decoder adopts a self-attention mechanism of a transform model, so that the advantages of the transform generated text are used, and the parameter quantity is reduced. Thus, a text abstract generating framework based on the AS-CNN and the Transformer architecture is obtained.
A method for optimizing text summary generation comprises the following steps:
step 1, acquiring related text data needing to generate an abstract, and performing necessary text data processing and word segmentation.
And 2, constructing a related dictionary for the processed text, setting word vector dimensions and randomly initializing all word vectors, wherein each word corresponds to a unique id.
And 3, performing feature extraction on the AS-CNN of the vector input model Encoder end input by the article.
And 4, sending the characteristic vectors extracted by the AS-CNN into a Decoder end of a Transformer for decoding to generate a abstract of the article.
Preferably, step 3 specifically comprises the following steps:
step 3.1, setting the size of convolution kernels and the number of each convolution kernel according to the length of the article;
step 3.2, extracting sentence characteristics with different lengths from the characteristics extracted by different convolution kernels
Step 3.3, carrying out padding on the sentence characteristics with different lengths to ensure that the sentence lengths are consistent, and generally selecting the longest sentence length as the standard
Step 3.4, feature fusion is carried out on the features extracted by different convolution kernels
And 3.5, carrying out full-connection network mapping on the fused feature vectors.
Preferably, step 4 specifically comprises the following steps:
step 4.1, performing dimension conversion on the text characteristic vector extracted by the AS-CNN to enable the text characteristic vector to be input into a Decoder end of a Transformer
And 4.2, taking the characteristic vector of the AS-CNN AS a keys and value matrix in the self-attention mechanism of the Decoder end, calculating the attention weight, and then acting on the queries matrix input by the Decoder end.
And 4.3, finding out the words needing to be generated through a Softmax layer according to the semantic vector generated by decoding the Decoder.
Compared with the prior art, the invention has the following obvious advantages:
when the text abstract is generated, the AS-CNN is adopted to extract the text characteristic information, and then the abstract information is generated by a self-attention mechanism, compared with other methods, the method has two advantages, namely: the Encoder end adopts AS-CNN to extract features, but not a self-attention mechanism of a Transformer or a recurrent neural network, the change can reduce the parameter quantity to one percent or one thousandth of the original quantity, thereby not only saving the memory space of hardware, but also obviously improving the iteration speed. In addition, hardware conditions can be fully utilized, and the training speed is accelerated. Secondly, the size of the convolution kernel of the AS-CNN can be selected by self, which is beneficial to solving the problem of long text dependence. In summary, the abstract generation method based on AS-CNN and transform provided by the invention has the advantages of accelerating training, reducing model parameters and solving long text dependence.
Description of the drawings:
FIG. 1 is a flow chart of a method according to the present invention
FIG. 2 AS-CNN Module schematic diagram
FIG. 3A-CNN and a transform Decoder module interaction schematic diagram
The specific implementation mode is as follows:
the invention is described in further detail below with reference to specific network model examples and with reference to the accompanying drawings.
Hardware equipment used by the invention comprises one PC (personal computer), and 1080 video cards 1 block;
in this section, we have conducted extensive experiments to investigate the effect of our proposed method. The network architecture operation flow chart designed by the invention is shown in fig. 1, and specifically comprises the following steps:
step 1, processing a text data set, removing special symbols, removing low-frequency words according to word frequency, and constructing a dictionary for training. The key in the dictionary is a word, and the value is the id of the word.
And 2, randomly initializing an Embedding layer matrix, and selecting a word vector corresponding to each word according to the id in the dictionary.
And step 3, as shown in fig. 2, selecting convolution kernels with different sizes to extract text features, wherein 512 convolution kernels with each size are selected.
Step 3.1, text of 7 × 300 is input, wherein the sentence length is 7, and the word vector dimension is 300.
Step 3.2, three sizes of convolution kernels are selected, namely 4 × 300, 3 × 300 and 2 × 300, and the number of convolution kernels in each size is 512.
Step 3.3, taking 4 × 300 convolution kernels as an example, the feature dimension extracted by one convolution kernel is 4 × 1, so that the feature dimension extracted by 512 convolution kernels is 4 × 512; the dimension of the 3 x 300 convolution kernel extraction features is 5 x 512; the feature dimension extracted by the 2 x 300 convolution kernel is 6 x 512
And 3.4, the features padding extracted by the convolution kernels with different sizes are taken as the same dimension, 6 × 512, and feature fusion is carried out to obtain feature vectors with 6 × 1536 dimensions.
Step 3.5, mapping the convolution extracted features into 6 x 512 dimensions by using a full-connection network
And 4, sending the characteristics extracted by the AS-CNN into a Decoder end of a Transformer, and calculating the attention weight by taking the characteristics AS keys and values of the self-attention model.
And 5, training a network model, evaluating the generated abstract quality by using a BLEU evaluation index, and comparing the abstract quality with a native transform to obtain a final conclusion according to the number of model parameters.
Step 5.1, training the network model until the Loss convergence is verified, wherein the used Loss function is a Cross Entropy Loss function (Cross Entropy Loss):
Figure BDA0002235326080000051
wherein y is(i)In order to be the true value of the value,
Figure BDA0002235326080000052
is a predicted value.
AS shown in fig. 3, an interaction diagram of the AS-CNN and the Decoder side is obtained, the AS-CNN extracts text features AS keys and values of the self-attention model, and sends the keys and values to the Decoder side, and the input of the Decoder side is used AS query, and the three are subjected to attention calculation to form a final decoding vector.
The above embodiments are only exemplary embodiments of the present invention, and are not intended to limit the present invention, and the scope of the present invention is defined by the claims. Various modifications and equivalents may be made by those skilled in the art within the spirit and scope of the present invention, and such modifications and equivalents should also be considered as falling within the scope of the present invention.

Claims (3)

1. A method for optimizing text summary generation, comprising the steps of:
step 1, acquiring related text data needing to generate an abstract, and processing the text data;
step 2, constructing a relevant dictionary for the processed text, setting word vector dimensions and randomly initializing all word vectors, wherein each word corresponds to a unique id;
step 3, performing feature extraction on the AS-CNN of the vector input model Encoder end input by the article;
and 4, sending the characteristic vectors extracted by the AS-CNN into a Decoder end of a Transformer for decoding to generate a abstract of the article.
2. The method according to claim 1, characterized in that step 3 comprises in particular the steps of:
step 3.1, setting the size of convolution kernels and the number of each convolution kernel according to the length of the article;
3.2, extracting sentence characteristics with different lengths from the characteristics extracted by different convolution kernels;
step 3.3, carrying out padding on the sentence characteristics with different lengths to ensure that the sentence lengths are consistent, and selecting the longest sentence length as the standard;
step 3.4, performing feature fusion on the features extracted by different convolution kernels;
and 3.5, carrying out full-connection network mapping on the fused feature vectors.
3. The method according to claim 1, wherein step 4 comprises the steps of:
step 4.1, performing dimension conversion on the text characteristic vector extracted by the AS-CNN, and inputting the Decoder end of the Transformer
4.2, using the characteristic vector of the AS-CNN AS a keys and value matrix in a Decoder-end self-attention mechanism, calculating the attention weight, and then acting on a queries matrix input by the Decoder end;
and 4.3, finding out the words needing to be generated through a Softmax layer according to the semantic vector generated by decoding the Decoder.
CN201910981470.3A 2019-10-16 2019-10-16 Optimized text abstract generation method Pending CN110765768A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910981470.3A CN110765768A (en) 2019-10-16 2019-10-16 Optimized text abstract generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910981470.3A CN110765768A (en) 2019-10-16 2019-10-16 Optimized text abstract generation method

Publications (1)

Publication Number Publication Date
CN110765768A true CN110765768A (en) 2020-02-07

Family

ID=69331275

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910981470.3A Pending CN110765768A (en) 2019-10-16 2019-10-16 Optimized text abstract generation method

Country Status (1)

Country Link
CN (1) CN110765768A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112733498A (en) * 2020-11-06 2021-04-30 北京工业大学 Method for improving automatic Chinese text summarization self-attention calculation
CN113449489A (en) * 2021-07-22 2021-09-28 深圳追一科技有限公司 Punctuation mark marking method, punctuation mark marking device, computer equipment and storage medium
CN117763140A (en) * 2024-02-22 2024-03-26 神州医疗科技股份有限公司 Accurate medical information conclusion generation method based on computing feature network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180300400A1 (en) * 2017-04-14 2018-10-18 Salesforce.Com, Inc. Deep Reinforced Model for Abstractive Summarization
CN109492232A (en) * 2018-10-22 2019-03-19 内蒙古工业大学 A kind of illiteracy Chinese machine translation method of the enhancing semantic feature information based on Transformer
CN109885673A (en) * 2019-02-13 2019-06-14 北京航空航天大学 A kind of Method for Automatic Text Summarization based on pre-training language model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180300400A1 (en) * 2017-04-14 2018-10-18 Salesforce.Com, Inc. Deep Reinforced Model for Abstractive Summarization
CN109492232A (en) * 2018-10-22 2019-03-19 内蒙古工业大学 A kind of illiteracy Chinese machine translation method of the enhancing semantic feature information based on Transformer
CN109885673A (en) * 2019-02-13 2019-06-14 北京航空航天大学 A kind of Method for Automatic Text Summarization based on pre-training language model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SHENGLI SONG 等: "Abstractive text summarization using LSTM-CNN based deep learning" *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112733498A (en) * 2020-11-06 2021-04-30 北京工业大学 Method for improving automatic Chinese text summarization self-attention calculation
CN112733498B (en) * 2020-11-06 2024-04-16 北京工业大学 Method for improving self-attention calculation of Chinese automatic text abstract
CN113449489A (en) * 2021-07-22 2021-09-28 深圳追一科技有限公司 Punctuation mark marking method, punctuation mark marking device, computer equipment and storage medium
CN113449489B (en) * 2021-07-22 2023-08-08 深圳追一科技有限公司 Punctuation mark labeling method, punctuation mark labeling device, computer equipment and storage medium
CN117763140A (en) * 2024-02-22 2024-03-26 神州医疗科技股份有限公司 Accurate medical information conclusion generation method based on computing feature network
CN117763140B (en) * 2024-02-22 2024-05-28 神州医疗科技股份有限公司 Accurate medical information conclusion generation method based on computing feature network

Similar Documents

Publication Publication Date Title
CN108920473B (en) Data enhancement machine translation method based on same-class word and synonym replacement
CN113254599B (en) Multi-label microblog text classification method based on semi-supervised learning
CN109766432B (en) Chinese abstract generation method and device based on generation countermeasure network
CN110134782B (en) Text summarization model based on improved selection mechanism and LSTM variant and automatic text summarization method
CN106033426B (en) Image retrieval method based on latent semantic minimum hash
CN110765768A (en) Optimized text abstract generation method
CN111723547A (en) Text automatic summarization method based on pre-training language model
CN110765264A (en) Text abstract generation method for enhancing semantic relevance
CN110807324A (en) Video entity identification method based on IDCNN-crf and knowledge graph
CN112597366B (en) Encoder-Decoder-based event extraction method
CN115759119B (en) Financial text emotion analysis method, system, medium and equipment
CN111563160B (en) Text automatic summarization method, device, medium and equipment based on global semantics
CN111984782A (en) Method and system for generating text abstract of Tibetan language
CN107832307B (en) Chinese word segmentation method based on undirected graph and single-layer neural network
CN113065349A (en) Named entity recognition method based on conditional random field
CN113609840B (en) Chinese law judgment abstract generation method and system
Cui et al. Learning topic representation for smt with neural networks
CN111428518B (en) Low-frequency word translation method and device
CN111400487A (en) Quality evaluation method of text abstract
CN113626584A (en) Automatic text abstract generation method, system, computer equipment and storage medium
Zhang et al. Extractive Document Summarization based on hierarchical GRU
CN109829054A (en) A kind of file classification method and system
CN115659172A (en) Generation type text summarization method based on key information mask and copy
CN114357984A (en) Homophone variant processing method based on pinyin
CN114238649A (en) Common sense concept enhanced language model pre-training method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination