CN117875268A - Extraction type text abstract generation method based on clause coding - Google Patents

Extraction type text abstract generation method based on clause coding Download PDF

Info

Publication number
CN117875268A
CN117875268A CN202410281932.1A CN202410281932A CN117875268A CN 117875268 A CN117875268 A CN 117875268A CN 202410281932 A CN202410281932 A CN 202410281932A CN 117875268 A CN117875268 A CN 117875268A
Authority
CN
China
Prior art keywords
text
abstract
sentence
layer
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410281932.1A
Other languages
Chinese (zh)
Other versions
CN117875268B (en
Inventor
赵卫东
刘彦言
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University of Science and Technology
Original Assignee
Shandong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University of Science and Technology filed Critical Shandong University of Science and Technology
Priority to CN202410281932.1A priority Critical patent/CN117875268B/en
Priority claimed from CN202410281932.1A external-priority patent/CN117875268B/en
Publication of CN117875268A publication Critical patent/CN117875268A/en
Application granted granted Critical
Publication of CN117875268B publication Critical patent/CN117875268B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a method for generating an extraction type text abstract based on clause coding, which belongs to the field of deep learning natural language processing and comprises the following steps: step 1, acquiring a training data set, and processing the training data set by adopting a clause algorithm and a abstract conversion algorithm; step 2, coding the text sentence set based on the BERT model to obtain sentence vectors and article characterization; step 3, constructing a text classification improved model based on expansion convolution and gating convolution and carrying out model training; step 4, obtaining a current abstract to be generated, and dividing the text into a plurality of sentences by adopting a sentence dividing algorithm; and extracting the text by using the trained text classification improved model to generate an extraction type abstract. The invention solves the overlong problem of input text, designs a text classification improvement model based on expansion convolution and gating convolution, learns the finer granularity characteristic in text sentences and generates a final abstract.

Description

Extraction type text abstract generation method based on clause coding
Technical Field
The invention belongs to the field of deep learning natural language processing, and particularly relates to a method for generating an extraction type text abstract based on clause coding.
Background
The input text in the abstract task is too long and exceeds the maximum input length of the pre-trained language model, which is a challenge facing current automatic text abstract technology. With the advent of the big data age, text data has exploded, which contains a lot of redundant information such as news, papers, microblogs, etc. This makes it extremely important to obtain key summary information from it. The automatic text summarization technology can improve the reading efficiency of people by compressing and simplifying long texts and refining key information of short texts.
Currently, text summarization tasks are mainly divided into two categories: extracting the formula abstract and generating the formula abstract. The extraction type abstract is used for extracting a plurality of important sentences from the original text to form the abstract through a specific grading rule and a specific sorting method, so that the abstract content is ensured to come from the original text, and a certain degree of syntactic accuracy and information integrity are maintained. Generating the summary is intended to generate a new summary description by understanding the text main idea, fusing text information. However, generating the abstract faces the difficult problems of parsing the source document and fusing text information, and no perfect solution exists at present.
Text representation is the fundamental work of natural language processing, with a tremendous impact on overall system performance. The traditional machine learning stage adopts a discrete and sparse text representation method, such as a bag-of-word model, one-hot coding and the like. Text representation techniques based on neural networks, such as Word2Vec and GloVe, can represent complex contexts, but do not solve the Word ambiguity problem. The big language pre-training model BERT in the latest stage solves the word ambiguity problem through pre-training and fine tuning, and can capture the semantics of long texts.
After the vector expression of the text is obtained, the abstract task essentially becomes a text classification task. In natural language processing, a widely used text classification model is TextCNN, the thought of convolutional neural network CNN is used for reference, local and global features of a text are extracted through convolution and pooling operations, and finally classification is performed through a full-connection layer.
However, the problem faced is that the input text is too long, exceeding the maximum input length of the pre-trained language model. This results in long text that cannot be directly input into the pre-trained model during the abstract task, limiting the model's understanding and expressive power of the entire text. The present invention aims to solve this problem so that long texts can also effectively participate in automatic text summarization tasks.
Disclosure of Invention
In order to solve the problems, the invention provides a method for generating an extraction type text abstract based on clause coding, which adopts a clause algorithm and a BERT model to process sentences and solves the problem of overlong input text; meanwhile, a text classification improvement model ImprovidextCNN is designed based on the expansion convolution and the gating convolution, features with finer granularity in text sentences are learned, and a final abstract is generated.
The technical scheme of the invention is as follows:
a method for generating an extraction type text abstract based on clause coding comprises the following steps:
step 1, acquiring a training data set, and processing the training data set by adopting a clause algorithm and a abstract conversion algorithm;
step 2, coding the text sentence set based on the BERT model to obtain sentence vectors and article characterization;
step 3, constructing a text classification improved model based on expansion convolution and gating convolution and carrying out model training;
step 4, obtaining a current abstract to be generated, and dividing the text into a plurality of sentences by adopting a sentence dividing algorithm; and extracting the text by using the trained text classification improved model to generate a abstract.
Further, in the step 1, the training data set adopts two Chinese abstract data sets news2016zh and a WeChat public number data set which are disclosed on the network; the data in the training data set are in an article format, and the article comprises a text part and a abstract part; sentence dividing algorithm is adopted to divide the text part; processing the abstract part by adopting an abstract conversion algorithm;
the training data set contains a plurality of pieces of data, the text part in the data is in the format of an original text-label pair, and the original text-label pairDividing by adopting a clause algorithm to obtain a sentence set and a abstract label set, wherein the text part of the article is divided into the sentence set and the abstract label set; first->Sentence set of individual text +.>First->The abstract tag set of the individual text is +.>The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Express clause post->The>Sentence (S)>Indicate->The>Summary tag of each sentence,/->The value is 0 or 1; />The total number of sentences appearing in the text;
the abstract conversion algorithm adopts a ROUGE score matching mode, wherein ROUGE is a group of indexes for automatic abstract evaluation; the specific process is as follows: traversing each sentence in the abstract, calculating the ROUGE score of each sentence, deleting the sentence with the highest ROUGE score from the abstract when each iteration is finished, repeating the process until the iteration is finished, and finishing the duplication elimination of the sentences; sorting the content of the digests after the duplication removal according to the filtered score from high to low, taking the sorted digests as the labels of the extraction digests, and finally obtaining a digest set after the duplication removal of each articleThe method comprises the steps of carrying out a first treatment on the surface of the Wherein,for articles, < >>Representing the +.sup.th in the summary set after the de-duplication and ordering>And sentences.
Further, in the step 2, the BERT model includes an embedded layer, a transducer layer, and an average pooling layer connected in sequence; the specific process of coding by the BERT model is as follows:
step 2.1, the first stepSentence set of individual texts->Input BERT model, the embedding layer of BERT model will +.>Conversion to a three-dimensional input vector, the dimension of which is +.>;/>Is->A text sentence; />Is +.>Maximum length of each text sentence; />Is the hidden layer dimension; three-dimensional input vector->The following are provided:
wherein,is a character embedding vector, ">Is a sentence embedded vector, ">Is a position embedding vector;
step 2.2, three-dimensional input vectorEntering a transform layer, wherein the transform layer is formed by stacking a plurality of block blocks, the input of the next block is the output of the last block, and the output form of the last block isA coded representation of the text;
step 2.3, carrying out average pooling operation on the sentence vector in each article in the dimension of the sentence length to obtain a vector representation of one article, wherein the dimension of the sentence vector becomes
Step 2.4, expanding the different articles in the dimension of the sentence number by using packing operation, so that the dimension of a batch of articles is changed into;/>Representing the number of articles in a batch, +.>Representing the number of text sentences in an article;
the training process of the BERT model includes two key phases: pre-training and fine-tuning; in the pre-training stage, the BERT model performs unsupervised learning by masking language model tasks; in the fine tuning phase, the BERT model is fine tuned on a specific task by supervised learning.
Further, in the step 2.2, each block includes a multi-head self-attention layer, a residual error connection layer, a normalization layer and a feedforward network layer which are sequentially connected; the data processing process of each block is as follows:
first, the multi-head attention layer is expressed as a combination of a residual error connection layer and a normalization layer:
wherein,the network layer result is output; />Is a normalization layer; />Is a weight variable;calculating a multi-head attention dot product; />Is the first network layer; />Is the second network layer; />Is the third network layer;
then passing through a feedforward network layer comprising two full-connection layers; the feedforward layer is expressed as in combination with the residual connection layer and the normalization layer:
wherein,the final output result is obtained; />Is a feedforward neural network;
finally, data is output from the block and used as the input of the next block, and the input vector is circularly coded; the block blocks of the layers are stacked to obtain final sentence vectors and article characterization; first, theSentence vector set of individual text is +.>The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Indicate->The>A vector representation of the individual sentences; and all sentence vectors are spliced to obtain the vector representation of the whole article.
Further, in the step 3, the text classification improvement model includes an expansion convolution and a gating convolution; the specific data flow is as follows:
step 3.1, firstly adopting expansion convolution to calculate, wherein the calculation formula is as follows:
wherein,is input; />Is the expansion rate; />Is a convolution kernel; />Is the size of the convolution kernel; />For the lower limit of convolution, +.>Is the upper limit of the convolution; />Is->In position->A value of the position;
and 3.2, performing calculation by combining gating convolution with residual connection, wherein the calculation formula is as follows:
wherein,probability of being a gated convolution output; />Is a convolution layer; />Is a variable value; />Is a coefficient;is the second convolution layer; />Is an intermediate variable +.>
Step 3.3, after the output of the article is calculated through the expansion convolution and the gating convolution, directly calculating the label of each sentence by using a full-linear transformation to obtain the output of the sequence label; first, theThe output sequence of each text isWherein->Is->The>Tag prediction results of the sentences, wherein the value of each tag is 0 or 1;
and 3.4, splicing sentences with the label prediction result of 1 to form a final extraction type abstract.
The beneficial technical effects brought by the invention are as follows.
The sentence dividing algorithm based on the BERT model solves the problem that a long text exceeds the maximum input length of the BERT model from the sentence perspective, and simultaneously considers the semantic feature understanding effect and the model performance of the pre-training model.
Through a multi-head self-attention mechanism and an encoding layer of the BERT pre-training model, context semantics of texts can be accurately and effectively captured, sentences are mapped into vectors, fine-grained characteristics are extracted, and original text vector representation of valuable information is formed.
The text classification improved model ImpvedTextCNN can further learn local and global characteristics of the text, the added expansion convolution enlarges the receptive field of neurons, multi-scale information is captured by selecting different expansion rates, the gradient vanishing risk can be reduced by combining the output of the gating convolution with residual connection, and the model can acquire more effective information characteristics. The full-connection linear transformation of the output layer can calculate the label of each sentence to obtain the output of the sequence label.
The extraction type text abstract based on clause coding has higher practicability and efficiency in practical application, and is particularly suitable for processing long text. Note, however, that the extracted text summaries may ignore the links between the text and fail to generate new information.
Drawings
FIG. 1 is a flow chart of the method for generating the abstract text abstract based on clause coding.
Detailed Description
The invention is described in further detail below with reference to the attached drawings and detailed description:
as shown in fig. 1, a method for generating an extraction type text abstract based on clause coding comprises the following steps:
and step 1, acquiring a training data set, and processing the training data set by adopting a clause algorithm and a abstract conversion algorithm.
The training data set adopts two Chinese abstract data sets news2016zh and WeChat public number data sets disclosed on a network. The news2016zh news data set comprises 2317427 pieces of news, the average word number of the news abstract is 20, the standard deviation of the word number is 6, the maximum word number is 196, and the minimum number is 4; the average number of words of the news text is 1250, the standard deviation of the number of words is 1735, the maximum number of words is 356749, and the minimum number is 31; the WeChat public number data set comprises 712826 articles, the average word number of the article abstract is 22, the standard deviation of the word number is 11, the maximum word number is 4984, and the minimum number is 4; the average number of words of the text of the article is 1499, the standard deviation of the number of words is 1754, the maximum number of words is 34665, and the minimum number is 107.
The data in the training data set are in an article format, and the article comprises a text part and a abstract part, and the two parts are required to divide sentences into sentences with finer granularity by adopting a proper mode. For the text part, a sentence dividing algorithm is adopted to divide the text, the text is converted into sentence sequences, then the text formed by the sentence sequences is encoded by using a BERT model, then sentence vectors of sentences are obtained through average pooling operation, and the sentence vectors of all sentences are combined to obtain the text vector representation of the text. And aiming at the abstract part, processing by adopting an abstract conversion algorithm, converting the abstract into an extraction type abstract, and using the extraction type abstract as a label of sequence labeling for training and optimizing a text classification improvement model constructed later.
The core idea of the sentence dividing algorithm is to divide sentences according to different punctuations, and the punctuations and line-feed symbols need to be carefully processed during operation, so that the sentence division accuracy is ensured. The training data set contains a plurality of pieces of data, the text part in the data is in the format of an original text-label pair, and the original text-label pairDividing by adopting a clause algorithm to obtain a sentence set and a abstract label set, wherein the text part of the article is divided into the sentence set and the abstract label set; first->Sentence set of individual textsFirst->The abstract tag set of the individual text is +.>The method comprises the steps of carrying out a first treatment on the surface of the Wherein,express clause post->The>Every sentence, every->Corresponding to different articles in the dataset, +.>Representing the total number of sentences in the text, +.>Indicate->The>The abstract labels of the sentences take the value of 0 or 1; />The total number of sentences appearing in the text;
the clause algorithm is one of the basic tasks in natural language processing, and aims to divide a body text into independent sentences. In a specific implementation, there are a series of rules and steps to ensure accurate sentence segmentation. The specific process of the clause algorithm is as follows:
first, the basic sentence rules are to segment the text into candidate sentences according to periods ('times), question marks (' times) and exclamation marks ('|'); this is the most basic clause rule; next, the processing of an ellipsis (a.) needs to be considered, as an ellipsis generally indicates that text is truncated, but not necessarily indicates the end of a sentence. When processing the ellipses, whether the text behind the ellipses should be used as a new sentence or not is judged according to the context; the content in quotation marks ("", "and" respectively ") is usually a quotation or quotation sentence, and should not be divided into sentences. Therefore, the influence of quotation marks needs to be considered in the sentence; meanwhile, abbreviations such as "mr", "Dr.", and the like also need to be processed. These abbreviations are typically in the middle of a sentence, but do not represent the end of the sentence.
The number is also a special case in which special handling is required, as the number may be followed by a period or other symbol, but does not indicate the end of a sentence, e.g. "3.14" or "20%". Other punctuation marks such as a semicolon (";") and a colon (":") are also contemplated, which may appear in the text in the middle of a sentence.
In addition, other special characters, such as line breaks, tabs, etc., need to be processed. These characters typically do not represent the end of a sentence, but may need to be noted in the sentence.
For more accurate clause, context information, such as words at the beginning and end of the sentence, and grammatical structures in the context, may be considered. After the above rules are applied, some erroneous segmentation may occur, so context judgment is a key to improve accuracy.
Finally, for more complex body text structures, machine learning models or other advanced algorithms may be used to make clauses to further improve accuracy. In general, the sentence segmentation algorithm is a process of combining rules and context judgment, and needs to consider multiple language structures and special cases to ensure accurate sentence segmentation on text.
The abstract conversion algorithm adopts a ROUGE score matching mode, and ROUGE is a group of indexes for automatic abstract evaluation. The ROUGE score is used to measure the degree of coincidence between the generated summary and the reference summary to evaluate the quality of the summary. The specific process is as follows:
traversing each sentence in the abstract, calculating the ROUGE score of each sentence, deleting the sentence with the highest ROUGE score from the abstract when each iteration is finished, repeating the process until the iteration is finished, and finishing the duplication elimination of the sentences; sorting the content of the digests after the duplication removal according to the filtered score from high to low, taking the sorted digests as the labels of the extraction digests, and finally obtaining a digest set after the duplication removal of each articleThe method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>For articles, < >>Representing the +.sup.th in the summary set after the de-duplication and ordering>And sentences.
And 2, encoding the text sentence set based on the BERT model to obtain sentence vectors and article characterization.
The BERT model comprises an embedded layer, a transducer layer and an average pooling layer which are connected in sequence. The transducer is a model that uses the attention mechanism to increase the model training speed. The specific process of coding by the BERT model is as follows:
step 2.1, the first stepThe sentence set of the individual texts is input into the BERT model, the embedded layer of the BERT model will be +.>Sentence set of individual texts->Conversion to three-dimensional input vector->Its dimension is->;/>Is->Text sentences, which are equivalent to batch sizes; />Is +.>Maximum length of each text sentence; />Is the hidden layer dimension; the embedded layer of the BERT model is initially loaded into the model in a fixed form, but as the word vector enters the model structure to interact with other parts, the word vector is automatically adjusted according to the context to obtain a meaning more conforming to the context, and the ambiguous word can be distinguished at the moment. The embedded layer of the BERT model is also an input layer of the BERT model, and three-dimensional input vectors after passing through the embedded layer are +.>The method comprises three embedded vectors of character embedding, sentence embedding and position embedding, namely:
wherein,is a character embedding vector, ">Is a sentence embedded vector, ">Is a position embedding vector.
The character embedding process is a process of extracting a trained word vector, each character is mapped into a vector with a fixed dimension through embedding matrix mapping, and the vector value uses a value stored in BERT model weights which are trained based on a data set.
The sentence embedding is designed for enabling the model to have the capability of distinguishing different sentences under the scene of multiple sentences, and the extraction type abstract of the invention codes each sentence by independently using the model, so that different sentences do not need to be distinguished, and therefore, the value of sentence embedding is set to be zero.
The purpose of the position embedding is to add the position information of each word to the transducer layer of the model, because the transducer cannot naturally preserve the timing information of sentences while calculating the attention due to its own structural features. The position coding of the BERT model is randomly initialized before pre-training and then learned during the pre-training process.
Step 2.2, three-dimensional input vectorAnd entering a transform layer, wherein the transform layer is formed by stacking a plurality of block blocks, the input of the next block is the output of the last block, and the output form of the last block is the coding representation of the text, and the context semantics are learned through coding.
The model determines the number of block blocks during training, and different block blocks correspond to semantic information of different layers. The input of the model flows in block blocks, each block comprises a multi-head self-attention layer, a residual error connecting layer and a normalization layer which are connected in sequence.
The data processing process of each block is as follows: firstly, calculating attention scores of all neurons through a multi-head self-attention layer and an attention mechanism, then dynamically adjusting vectors of the neurons, simultaneously using a residual connection method, and finally carrying out layer normalization on all characteristics of the neurons. The method comprises the following steps:
the multi-head self-attention layer is the most critical part, the attention function is to use inquiry, inquire all keys, calculate all similarity, then determine the attention of each key according to the similarity result, extract the value of each key, and carry out weighted summation. The corresponding formula is as follows:
wherein,is the dot product of multiple heads of attention; />A plurality of vector operations are spliced;is->A plurality of attention heads; />Is->A plurality of attention heads; />As a function of attention; />Is a weight matrix; />For inquiry->Is a matrix of (a); />Is a key->Is a matrix of (a); />For the value->Is a matrix of (a); />Is a normalized exponential function; />Transpose the symbol; />Is a dimension;
residual connection directly skips data over a certain structure while data normally propagates forward, namelyThe network structure of the deep model has the advantages that a part of gradient can skip the part and directly return to the front when in back propagation, so that the gradient disappearance condition in the deep model can be effectively improved; the normalization layer normalizes all feature values in a neuron, rather than normalizing the same features of all neurons of a batch, which is better in the language model. Wherein (1)>Is an output result function; />Is a network layer; />Is a weight function;
first, the multi-head attention layer in combination with the residual connection layer and the normalization layer can be expressed as:
wherein,the network layer result is output; />Is a normalization layer; />Is a weight variable;calculating a multi-head attention dot product; />Is the first network layer; />Is the second network layer; />Is the third network layer;
then passing through a feedforward network layer comprising two full-connection layers; wherein the output dimension of the first fully connected layer is typically larger and then the second fully connected layer maps the output dimension back to the original dimension. The feed forward layer in combination with residual connection and normalization can be expressed as:
wherein,the final output result is obtained; />Is a feedforward neural network; />Is a weight variable;
and finally, outputting the data from the block, and circularly encoding the input vector as the input of the next block. Since the shape of the input vector and the output vector of each block is the same, the output of the last block can be directly used as the input of the next block, and the blocks of a plurality of layers can be directly stacked, so that the final sentence vector and the article representation are obtained; first, theSentence vector set of individual text is +.>The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Indicate->The>The vector representations of the individual sentences stitch all sentence vectors to obtain the vector representation of the entire article.
And 2.3, carrying out average pooling operation on the sentence vector in each article in the dimension of the sentence length to obtain the vector representation of one article. For a different article of manufacture,the size of (a) is uncertain, so that averaging in the dimension of sentence length using the averaging pooling layer amounts to trowelling the difference between sentence lengths, when an article vector dimension becomes
Step 2.4, expanding different articles in the dimension of sentence number by using packing operation for batch training, so that the dimension of a batch of articles becomes。/>Representing the number of articles in a batch, +.>Representing the number of text sentences in an article.
The training process of the BERT model includes two key phases: pretraining and fine tuning. First, in the pre-training phase, BERT performs unsupervised learning through a Mask Language Model (MLM) task. In this task, a portion of the vocabulary of the input text is masked randomly and the model needs to predict these masked vocabulary so that the context of the missing vocabulary in the context is learned. BERT employs a multi-layer structure of a transducer encoder with a bi-directional attention mechanism that allows better understanding of language.
The next is a fine tuning phase, at which the BERT performs fine tuning on specific tasks, such as text classification or named entity recognition, through supervised learning. The fine-tuning task provides a tokenized text input and a loss function for the corresponding task. By back propagation and gradient descent, the model adjusts the parameters to suit the particular task. Since BERT has learned a generic language representation during the pre-training phase, the fine-tuning phase generally requires less annotation data while achieving excellent performance in various natural language processing tasks. These two phases combine to form a comprehensive training process for the BERT model.
And 3, constructing a text classification improved model based on the expansion convolution and the gating convolution, and performing model training.
The text classification improved model constructed by the invention takes the textCNN model as an infrastructure, and the expansion rate parameter and the gating structure are integrated in one-dimensional convolution used by the textCNN model. The one-dimensional convolution becomes the expansion convolution after being integrated into the expansion rate parameter, and becomes the gating convolution after being integrated into the gating structure, namely the text classification improvement model comprises the expansion convolution and the gating convolution.
The form of the dilation convolution is as follows:
wherein,is input; />Is the expansion rate; />Is a convolution kernel; />Is the size of the convolution kernel; />For the lower limit of convolution, +.>Is the upper limit of the convolution; />Is->In position->The value of the position.
The form of the gated convolution is as follows:
wherein,probability of being a gated convolution output; />Is a convolution layer; />Is a variable value; />Is a coefficient;is the second convolution layer; />And->The dimensions are the same but the weights of each are different, so the number of parameters is twice that of the normal convolution. One of the convolution variables is activated by using a sigmoid function and is used as a gating unit, the other convolution variable is directly used as a common convolution output, and the two variables are multiplied by bits.
Since the dimensions of the input and output are identical, a residual connection is used. The formula for combining residual connection with gating convolution is as follows:
wherein,without using an activation function, in fact only one pair +>Is added with +.>Is also a linear transformation, so->Equivalent to->Then the probability formula of the gated convolution output isCan be equivalent to:
wherein,
after the output of the article is calculated through the expansion convolution and the gating convolution, directly calculating the label of each sentence by using a full-linear transformation to obtain the output of the sequence label; first, theThe output sequence of each text isWherein->Is->The>Tag prediction results of the sentences, wherein the value of each tag is 0 or 1;
the tag of each sentence may indicate whether the sentence needs to be extracted into the abstract, and for each article's sequence label, a sentence with a tag prediction result of 1 is spliced to form the final extracted abstract.
After the model is built, initializing model parameters, and extracting from a training setA sample number; during training, the loss of the whole calculation model is +.>The method comprises the steps of carrying out a first treatment on the surface of the And selecting an Adam optimizer, and updating model parameters according to a back propagation algorithm until the model converges or the training turns reach a threshold value, and stopping training.
Step 4, obtaining the abstract to be generated currently, and dividing the text into a plurality of sentences directly by adopting a sentence dividing algorithm; and extracting the text by using the trained text classification improved model to generate an extraction type abstract.
In order to demonstrate the feasibility and superiority of the present invention, the following verification experiments and comparative experiments were performed.
The verification experiment and the comparison experiment still select two Chinese abstract data sets news2016zh and a WeChat public signal data set which are disclosed on a network to evaluate the performance of the model.
Experimental environment configuration: windows11 operating system, 11th Gen Intel (R) Core (TM) i5-1135 G7@2.40 GHz 1.38 GHz, 16GB RAM, python3.8, pytorch2.0.
In the verification experiment, ROUGE-N and ROUGE-L algorithms are selected as evaluation indexes. This selection is based on the need for an automatic summary task, and the ROUGE index can effectively measure the coincidence between the generated summary and the reference summary, providing a comprehensive assessment of summary quality. The formulas of the ROUGE-N and ROUGE-L algorithms are as follows:
wherein,for ROUGE-N to measure the overlap between the generated summary and the reference summary, it is of particular interest that the N-gram model is continuous +.>Recall of individual items; />The number inside the summary set;is a related abstract; />The number of paragraphs divided; />Representing the number of N-Gram models appearing in the reference abstract; />Representing the number of N-Gram models which simultaneously appear in the candidate abstract and the reference abstract; />The accuracy rate is ROUGE-N; />Is a harmonic mean; />Is the harmonic average of the ROUGE-N precision and recall; />The accuracy is achieved; />Is the recall rate;
wherein,to measure the accuracy of the longest common subsequence; />Is a random parameter; />Is the recall rate;the accuracy is achieved; />Representing the candidate abstracts automatically generated by the machine; />Representing reference abstract,/->Length of maximum common subsequence representing candidate and reference digests,/for>Indicating the length of the reference abstract->Indicating the length of the candidate summary. In this experiment, the drug is->1, the recall rate and the accuracy rate are comprehensively considered.
The experimental process comprises the following key steps:
1. data preparation: an appropriate test dataset is selected, including the original text and the corresponding reference abstract.
2. Model application: and applying the trained automatic abstract generation model to test data to generate an abstract.
3. Reference abstract generation: the reference abstract in the test data was used as a standard, as a reference for the ROUGE evaluation.
4. ROUGE calculation: and calculating the coincidence degree between the generated abstract and the reference abstract by using a ROUGE algorithm, in particular ROUGE-N and ROUGE-L. ROUGE-N focuses on the coincidence of the N-gram models, while ROUGE-L uses the Longest Common Subsequence (LCS) to consider word order relationships.
5. Comparison of results: scores of the ROUGE-N and ROUGE-L are compared, as well as their behavior in terms of different N values (for ROUGE-N) and text length. By comparing the results, the accuracy and grammar consistency of the model generated abstract can be evaluated.
6. Statistical analysis: statistical analysis is performed, such as calculating averages, standard deviations, etc., to obtain a more comprehensive assessment.
Such an experimental design helps to fully evaluate the performance of an automatic summary model in generating summary tasks. Based on the comparison of the results, advantages and limitations of the model can be obtained, and targeted improvement suggestions can be provided.
N of all experimental ROUGE-N evaluation methods of the invention is 1 and 2, and F1 is selected as an effect reference by ROUGE-1, ROUGE-2 and ROUGE-L.
The comparison experiment selects a mainstream model based on deep learning as a comparison model to compare the performance of the method (BERT+ImprodTextCNN) of the invention with that of the text classification improved model, and the mainstream model comprises a Word vector model Word2Vec, a convolutional neural network model CNN and a cyclic neural network RNN. The results of the performance comparisons of the different methods are shown in Table 1, and the results are the average of the two data sets.
Table 1 results of performance comparisons for different methods;
4 experiments are carried out, and a Word2Vec and CNN fusion method is adopted in the 1 st experiment; the 2 nd experiment adopts a method of fusing Word2Vec and RNN and a method of fusing Word2Vec and the text classification improved model of the invention respectively; the 3 rd experiment adopts a BERT and RNN fusion method and a BERT and CNN fusion method respectively; the 4 th experiment uses the method of fusion of BERT and ImpvedTextCNN of the invention. As can be seen from Table 1, the extracted text abstract based on Word2Vec shows the worst performance, and the method for fusing BERT and ImprodTextCNN provided by the invention has the best performance on three indexes of ROUGE-1, ROUGE-2 and ROUGE-L, and the effectiveness of the method provided by the invention is shown by comparison of experimental results.
It should be understood that the above description is not intended to limit the invention to the particular embodiments disclosed, but to limit the invention to the particular embodiments disclosed, and that the invention is not limited to the particular embodiments disclosed, but is intended to cover modifications, adaptations, additions and alternatives falling within the spirit and scope of the invention.

Claims (5)

1. The extraction type text abstract generation method based on clause coding is characterized by comprising the following steps of:
step 1, acquiring a training data set, and processing the training data set by adopting a clause algorithm and a abstract conversion algorithm;
step 2, coding the text sentence set based on the BERT model to obtain sentence vectors and article characterization;
step 3, constructing a text classification improved model based on expansion convolution and gating convolution and carrying out model training;
step 4, obtaining a current abstract to be generated, and dividing the text into a plurality of sentences by adopting a sentence dividing algorithm; and extracting the text by using the trained text classification improved model to generate a abstract.
2. The method for generating an abstract text abstract based on clause coding according to claim 1, wherein in said step 1, two chinese abstract data sets news2016zh and WeChat public number data sets disclosed on a network are adopted as the training data set; the data in the training data set are in an article format, and the article comprises a text part and a abstract part; sentence dividing algorithm is adopted to divide the text part; processing the abstract part by adopting an abstract conversion algorithm;
the training data set contains a plurality of pieces of data, the text part in the data is in the format of an original text-label pair, and the original text-label pairDividing by adopting a clause algorithm to obtain a sentence set and a abstract label set, wherein the text part of the article is divided into the sentence set and the abstract label set; first->Sentence set of individual text +.>First->The abstract label set of the individual text isThe method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Express clause post->The>Sentence (S)>Indicate->The>Summary tag of each sentence,/->The value is 0 or 1; />The total number of sentences appearing in the text;
the abstract conversion algorithm adopts a ROUGE score matching mode, wherein the ROUGE is a group ofAn index for automatic summary assessment; the specific process is as follows: traversing each sentence in the abstract, calculating the ROUGE score of each sentence, deleting the sentence with the highest ROUGE score from the abstract when each iteration is finished, repeating the process until the iteration is finished, and finishing the duplication elimination of the sentences; sorting the content of the digests after the duplication removal according to the filtered score from high to low, taking the sorted digests as the labels of the extraction digests, and finally obtaining a digest set after the duplication removal of each articleThe method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>For articles, < >>Representing the +.sup.th in the summary set after the de-duplication and ordering>And sentences.
3. The method for generating the abstract text abstract based on clause coding according to claim 1, wherein in the step 2, the BERT model comprises an embedded layer, a Transformer layer and an average pooling layer which are sequentially connected; the specific process of coding by the BERT model is as follows:
step 2.1, the first stepSentence set of individual texts->Input BERT model, the embedding layer of BERT model will +.>Conversion to a three-dimensional input vector, the dimension of which is +.>;/>Is->A text sentence; />Is +.>Maximum length of each text sentence; />Is the hidden layer dimension; three-dimensional input vector->The following are provided:
wherein,is a character embedding vector, ">Is a sentence embedded vector, ">Is a position embedding vector;
step 2.2, three-dimensional input vectorEntering a transducer layer, wherein the transducer layer is formed by stacking a plurality of block blocks, the input of the next block is the output of the last block, and the last block outputsThe form of (a) is an encoded representation of the text;
step 2.3, carrying out average pooling operation on the sentence vector in each article in the dimension of the sentence length to obtain a vector representation of one article, wherein the dimension of the sentence vector becomes
Step 2.4, expanding the different articles in the dimension of the sentence number by using packing operation, so that the dimension of a batch of articles is changed into;/>Representing the number of articles in a batch, +.>Representing the number of text sentences in an article;
the training process of the BERT model includes two key phases: pre-training and fine-tuning; in the pre-training stage, the BERT model performs unsupervised learning by masking language model tasks; in the fine tuning phase, the BERT model is fine tuned on a specific task by supervised learning.
4. The method for generating a abstract text abstract based on clause coding according to claim 3, wherein in step 2.2, each block comprises a multi-head self-attention layer, a residual connection layer, a normalization layer and a feed-forward network layer which are sequentially connected; the data processing process of each block is as follows:
first, the multi-head attention layer is expressed as a combination of a residual error connection layer and a normalization layer:
wherein,the network layer result is output; />Is a normalization layer; />Is a weight variable;calculating a multi-head attention dot product; />Is the first network layer; />Is the second network layer; />Is the third network layer;
then passing through a feedforward network layer comprising two full-connection layers; the feedforward layer is expressed as in combination with the residual connection layer and the normalization layer:
wherein,the final output result is obtained; />Is a feedforward neural network;
finally, data is output from the block and used as the input of the next block, and the input vector is circularly coded; the block blocks of the layers are stacked to obtain final sentence vectors and article characterization; first, theSentence vector set of individual textsThe method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Indicate->The>A vector representation of the individual sentences; and all sentence vectors are spliced to obtain the vector representation of the whole article.
5. The method for generating a abstract text summary based on clause coding according to claim 1, wherein in said step 3, the text classification improvement model comprises an expansion convolution and a gating convolution; the specific data flow is as follows:
step 3.1, firstly adopting expansion convolution to calculate, wherein the calculation formula is as follows:
wherein,is input; />Is the expansion rate; />Is a convolution kernel; />Is the size of the convolution kernel; />As a lower limit value of the convolution,is the upper limit of the convolution; />Is->In position->A value of the position;
and 3.2, performing calculation by combining gating convolution with residual connection, wherein the calculation formula is as follows:
wherein,probability of being a gated convolution output; />Is a convolution layer; />Is a variable value; />Is a coefficient;is the second convolution layer; />Is an intermediate variable +.>
Step 3.3, after the output of the article is calculated through the expansion convolution and the gating convolution, directly calculating the label of each sentence by using a full-linear transformation to obtain the output of the sequence label; first, theThe output sequence of each text isWherein->Is->The>Tag prediction results of the sentences, wherein the value of each tag is 0 or 1;
and 3.4, splicing sentences with the label prediction result of 1 to form a final extraction type abstract.
CN202410281932.1A 2024-03-13 Extraction type text abstract generation method based on clause coding Active CN117875268B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410281932.1A CN117875268B (en) 2024-03-13 Extraction type text abstract generation method based on clause coding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410281932.1A CN117875268B (en) 2024-03-13 Extraction type text abstract generation method based on clause coding

Publications (2)

Publication Number Publication Date
CN117875268A true CN117875268A (en) 2024-04-12
CN117875268B CN117875268B (en) 2024-05-31

Family

ID=

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413986A (en) * 2019-04-12 2019-11-05 上海晏鼠计算机技术股份有限公司 A kind of text cluster multi-document auto-abstracting method and system improving term vector model
WO2021000362A1 (en) * 2019-07-04 2021-01-07 浙江大学 Deep neural network model-based address information feature extraction method
WO2021135469A1 (en) * 2020-06-17 2021-07-08 平安科技(深圳)有限公司 Machine learning-based information extraction method, apparatus, computer device, and medium
WO2021243706A1 (en) * 2020-06-05 2021-12-09 中山大学 Method and apparatus for cross-language question generation
CN114169312A (en) * 2021-12-08 2022-03-11 湘潭大学 Two-stage hybrid automatic summarization method for judicial official documents
CN115033659A (en) * 2022-05-26 2022-09-09 华东理工大学 Clause-level automatic abstract model system based on deep learning and abstract generation method
CN116049394A (en) * 2022-12-22 2023-05-02 广西师范大学 Long text similarity comparison method based on graph neural network
KR20230103782A (en) * 2021-12-30 2023-07-07 국민대학교산학협력단 Transformer-based text summarization method and device using pre-trained language model
CN116756266A (en) * 2023-01-31 2023-09-15 西安工程大学 Clothing text abstract generation method based on external knowledge and theme information

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413986A (en) * 2019-04-12 2019-11-05 上海晏鼠计算机技术股份有限公司 A kind of text cluster multi-document auto-abstracting method and system improving term vector model
WO2021000362A1 (en) * 2019-07-04 2021-01-07 浙江大学 Deep neural network model-based address information feature extraction method
WO2021243706A1 (en) * 2020-06-05 2021-12-09 中山大学 Method and apparatus for cross-language question generation
WO2021135469A1 (en) * 2020-06-17 2021-07-08 平安科技(深圳)有限公司 Machine learning-based information extraction method, apparatus, computer device, and medium
CN114169312A (en) * 2021-12-08 2022-03-11 湘潭大学 Two-stage hybrid automatic summarization method for judicial official documents
KR20230103782A (en) * 2021-12-30 2023-07-07 국민대학교산학협력단 Transformer-based text summarization method and device using pre-trained language model
CN115033659A (en) * 2022-05-26 2022-09-09 华东理工大学 Clause-level automatic abstract model system based on deep learning and abstract generation method
CN116049394A (en) * 2022-12-22 2023-05-02 广西师范大学 Long text similarity comparison method based on graph neural network
CN116756266A (en) * 2023-01-31 2023-09-15 西安工程大学 Clothing text abstract generation method based on external knowledge and theme information

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
岳一峰;黄蔚;任祥辉;: "一种基于BERT的自动文本摘要模型构建方法", 计算机与现代化, no. 01, 15 January 2020 (2020-01-15) *
徐如阳;曾碧卿;韩旭丽;周武;: "卷积自注意力编码过滤的强化自动摘要模型", 小型微型计算机系统, no. 02, 15 February 2020 (2020-02-15) *
李维;闫晓东;解晓庆;: "基于改进TextRank的藏文抽取式摘要生成", 中文信息学报, no. 09, 15 September 2020 (2020-09-15) *

Similar Documents

Publication Publication Date Title
CN110969020B (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
CN111709243B (en) Knowledge extraction method and device based on deep learning
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN110532328B (en) Text concept graph construction method
CN110825848B (en) Text classification method based on phrase vectors
CN110619034A (en) Text keyword generation method based on Transformer model
CN112989834A (en) Named entity identification method and system based on flat grid enhanced linear converter
CN111414481A (en) Chinese semantic matching method based on pinyin and BERT embedding
CN111027595A (en) Double-stage semantic word vector generation method
CN110781290A (en) Extraction method of structured text abstract of long chapter
Zhang et al. A hybrid text normalization system using multi-head self-attention for mandarin
CN112232053A (en) Text similarity calculation system, method and storage medium based on multi-keyword pair matching
CN114169312A (en) Two-stage hybrid automatic summarization method for judicial official documents
CN115310448A (en) Chinese named entity recognition method based on combining bert and word vector
CN115712731A (en) Multi-modal emotion analysis method based on ERNIE and multi-feature fusion
CN114564563A (en) End-to-end entity relationship joint extraction method and system based on relationship decomposition
CN115238697A (en) Judicial named entity recognition method based on natural language processing
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
Jung et al. Learning to embed semantic correspondence for natural language understanding
Ma et al. Joint pre-trained Chinese named entity recognition based on bi-directional language model
CN117875268B (en) Extraction type text abstract generation method based on clause coding
CN115965027A (en) Text abstract automatic extraction method based on semantic matching
CN115759090A (en) Chinese named entity recognition method combining soft dictionary and Chinese character font features
CN113312903B (en) Method and system for constructing word stock of 5G mobile service product
CN115687939A (en) Mask text matching method and medium based on multi-task learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant