Extraction method of structured text abstract of long chapter
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a method for extracting a structured text abstract of a long chapter.
Background
At present, when a long text is abstracted, the processing of Word embedding, text abstract extraction and chapter structure analysis is generally involved, for Word embedding, words in text data are converted into numerical vectors which can be learned by a machine, the traditional Word embedding is to firstly adopt one-hot coding on the words in the text and then put into a Word2Vec model for learning, and finally the mapping from the text to the numerical vectors is completed.
The text abstract extraction is a process of extracting important sentences in a text as the text abstract by a machine through learning text features, and is actually a classification problem, namely, whether the text sentences are important or not is subjected to binary classification, wherein the important sentences are the text abstract. The current mainstream text abstract extraction method is based on a neural network model, and the method mainly comprises two parts of encoding and decoding. The encoding process is a process of learning text features by a machine, wherein the process comprises sentence encoding, position encoding, article encoding and the like, and methods comprise CNN, RNN, BERT and the like; the decoding process is mainly a classification process, and training of the classifier is finished according to the output result of the coding and a given label.
But the current text abstract extraction mainly has the following problems: (1) the existing abstract extraction model does not well solve the problem of long texts in the encoding process, and the prior art mainly adopts a direct truncation method for the problem of long texts and then encodes the truncated data, so that important information in the long texts can be greatly lost; there are also techniques to add a coded representation between paragraphs during coding, which have certain limitations, such as the input text is not segmented, or there is no correlation between adjacent paragraphs. (2) The data for Chinese abstract extraction disclosed by the prior art is single in related field and short in single data text, and the data is not friendly to a long text abstract extraction training task in a special field.
The analysis of the structure of the chapters is used for identifying semantic relationships between different text blocks, so that the text can be understood from a global perspective, and further, automatic abstract extraction of the text can be further optimized. In the automatic abstract extraction system for the long text, the relations of cause and effect, turning and the like among sentences in the text are analyzed and identified for the chapter structure, and the primary and secondary relations are distinguished.
The problems of the analysis of chapters at present are as follows: how is the chapter structure analyzed for cases without chapter connectors? How does analysis of chapter structure apply to the downstream automatic summarization specific task? In view of the above-mentioned situation, there are still many problems to be solved.
Disclosure of Invention
In order to solve the problems of ambiguous words, direct truncation adopted in the extraction of the long text abstract and no chapter structure analysis and the extraction of the long text abstract in multiple fields in the prior art, the invention provides a method for extracting the long chapter structured text abstract, which comprises the following steps:
(1) into numerical information
Carrying out sentence dividing processing on input long text information according to punctuations, and adopting Bert WordEmbedding dynamic word embedding processing to convert each sentence into a vector matrix of the sentence, namely numerical value information learned by a computer;
(2) analysis of chapter structure
Carrying out implicit discourse relation analysis on every two sentences, namely putting every two adjacent clauses into two bidirectional GRU models for processing, splicing hidden layer information of the two models, putting spliced results into a multilayer perceptron for classification to obtain predicted class probability, taking a class label with the highest probability as a corresponding label, and reasonably segmenting the long text according to the identified class of the label;
(3) abstract extraction
And (3) performing abstract extraction on each paragraph obtained in the step (2) according to two modes based on a model and a rule, wherein the final abstract result output is an output result fusing the two modes.
As an improvement, in the step (3), the abstract extraction based on the model is to input each paragraph of information into the model, the model encodes each sentence of the paragraph, i.e., learns the features, and then decodes the learned features, i.e., classifies each sentence, thereby completing the extraction of the abstract sentences.
The coding is composed of two layers of bidirectional GRU model, the first layer inputs the sentence vector matrix, after the forward and backward GRU model processing, the hidden layer vectors in two directions are spliced and the maximum pooling processing is carried out, the processed result is used as the second layer input, the hidden layer information w of the layer is
iPosition information of each represented word, i represents the ith word in the sentence; the operation of the second layer is the same as that of the first layer, and the spliced hidden layer information h
jRepresents each sentence information of the paragraph, j represents the jth sentence in the paragraph, and the whole paragraph p is represented by the following formula (1):
wherein W
pB denotes the weight and bias of each sentence, N
pThe number of sentences in the paragraph is shown, and i and j are positive integers of 1, 2 and 3 … ….
As an improvement, the decoding layer further calculates the probability that a sentence in the text belongs to a summary sentence according to the information obtained in the encoding process, and the probability is expressed by a formula (2) as follows:
wherein y is
j1 indicates that the jth sentence in the paragraph is a summary sentence, W
1,W
2,W
3As model parameters, s
jThe dynamic abstract representation is a weighted sum of accessed sentence hiding layers, the weight is the probability that the sentence finally belongs to the abstract sentence, and the formula (3) is expressed as follows:
in equation (3): n and j represent the n and j sentences in the paragraph, n and j are positive integers of 1, 2 and 3 … …, and P (y)
n1) represents the probability of the accessed sentence n belonging to the abstract sentence, and the calculation mode is shown in formula (2).
As an improvement, in the step (3), rule-based abstract extraction is to formulate a corresponding rule according to text characteristics of different fields, match keywords and specific patterns having characteristics in the field, recall the matched keywords and words around the specific patterns, and take the recalled sentences as the abstracts of rule extraction.
Has the advantages that: according to the method for extracting the abstract of the long chapter structured text, provided by the invention, a word vector can be dynamically obtained according to surrounding words by adopting a dynamic word embedding method, so that the problem of ambiguous words in the text is solved; adopting discourse structure analysis, reasonably dividing paragraphs according to the relation recognition result between sentences, and enabling a computer to understand texts from a global perspective; the abstract extraction model is adopted to extract the abstract of each section on the basis of chapter structure analysis, so that the problem of direct interception of the traditional long text abstract is solved; the rule-based abstract extraction is used for carrying out feature matching and recalling the abstract sentence of the text according to the text characteristics of each field, so that the problem of abstract extraction of multi-field texts is solved.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a structural diagram of the analysis of the discourse structure of the present invention;
FIG. 3 is a diagram illustrating an abstract structure of the present invention.
Detailed Description
The figures of the present invention are further described below in conjunction with the embodiments. The invention provides a method for extracting a structured text abstract of a long chapter, and a flow chart of the method is shown in figure 1. The specific implementation process is as follows:
firstly, the sentence dividing processing is carried out on the input long text information according to punctuation marks, and the embedding processing of Bert WordEmbelling dynamic words is adopted for each sentence to be converted into a vector matrix of the sentence.
And secondly, performing chapter structure analysis on the text, wherein the model structure of the part is shown in FIG. 2. And (3) putting every two adjacent clauses into two bidirectional GRU models for processing, splicing hidden layer information of the two models, putting spliced results into a multilayer perceptron for classification to obtain predicted class probability, and taking the class label with the highest probability as a corresponding label. The data set adopted by the invention is PDTB, and the research type labels comprise extension (Expansion), time sequence (Temporal), break (Comparison) and cause and effect (containment). The final output result of the part comprises the analysis result of the chapter structure of the input long text, and the long text is reasonably segmented according to the result. The specific segmentation mode is as follows: and (4) carrying out segmentation processing on the sentences with the expansion and turning relations, and not carrying out segmentation processing on the sentences with the time sequence and the causal relation.
Next, the text of the good paragraphs is extracted segment by segment, and the structure of the partial model is shown in fig. 3. Inputting the matrix information of each sentence in the paragraph into the bidirectional GRU model of the first layer, splicing the hidden layer information in the forward direction and the backward direction, performing max-posing treatment, and using the spliced hidden layer information w of the first layer as the input of the next layer
iPosition information of each represented word, i represents the ith word in the sentence; the operation of the second layer is the same as that of the first layer, and the spliced hidden layer information h
jRepresents the information of each sentence in the paragraph, and j represents the jth sentence in the paragraph; the hidden layer information h after the second layer splicing is adopted for the expression of the paragraph
jVia a nonlinear activation function, equation (1) is as follows:
wherein W
pB represents the weight and bias of each sentence, which are model learning parameters, N
pRepresenting the number of sentences in the paragraph.
The decoding process further calculates the probability that the sentence in the paragraph belongs to the abstract sentence according to the information obtained in the encoding process, and the formula (2) is as follows:
wherein y is
j1 indicates that the jth sentence in the paragraph is a summary sentence, W
1,W
2,W
3As model parameters, s
jThe dynamic abstract representation is a weighted sum of accessed sentence hiding layers, the weight is the probability that the sentence finally belongs to the abstract sentence, and the expression is as follows in formula (3):
wherein n and j represent the n and j sentences in the paragraph, n and j are positive integers of 1, 2 and 3 … …, and P (y)
n1) indicates that the accessed sentence belongs to the abstract sentence, and the calculation mode is shown as formula (2).
The method is completed on the basis of chapter structure analysis, and compared with the traditional method for directly extracting the abstract of the long text, the method provided by the invention can be used for extracting the abstract of the text from global and local angles, so that the extraction precision of the abstract of the long text is improved.
The loss function during the abstract extraction model training is a cross entropy function, and the optimization function is an Adam optimization function. The final output result of the part is a sentence predicted as a summary by the model. The method of the invention finally adopts a rule-based abstract extraction method, and the corresponding rules are formulated according to the text characteristics of different fields. Firstly, matching keywords with the characteristics of the field and a specific mode, then recalling the matched keywords and surrounding words of the specific mode, and finally taking the recalled sentences as abstracts extracted by rules. The invention takes the abstract extracted by the fusion model and the rule as the final abstract extraction result.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.