CN115906805A

CN115906805A - Long text abstract generating method based on word fine granularity

Info

Publication number: CN115906805A
Application number: CN202211609887.5A
Authority: CN
Inventors: 郑园园; 张舒
Original assignee: Changchun University of Science and Technology
Current assignee: Changchun University of Science and Technology
Priority date: 2022-12-12
Filing date: 2022-12-12
Publication date: 2023-04-04

Abstract

The invention relates to a method for generating a long text abstract based on word fine granularity, which relates to the technical field of information processing, and combines a traditional algorithm and a deep learning algorithm to solve the problem that a text is difficult to be converted into a simple and complete abstract under a big data condition, wherein the method comprises the following steps: carrying out text preprocessing on an original text; loading the sentence set into the trained improved NEZHA coding model; loading the sentence vectors with semantic information into a TextRank algorithm, and sequencing and scoring each sentence to obtain an importance score of each sentence; finally, filtering out high-coupling sentences by using an improved MMR algorithm to obtain a low-correlation high-score abstract sentence set, namely an abstract of an original text; the invention does not limit the number of words of the input text, mines the internal information of the text from multiple angles of words, expressions and sentences, and considers the structural characteristics of the whole text to ensure the quality, the accuracy and the reliability of the generated abstract content.

Description

Long text abstract generating method based on word fine granularity

Technical Field

The invention relates to the technical field of information processing, in particular to a long text abstract generating method based on word fine granularity.

Background

Nowadays, people can easily obtain the desired related information through a search engine and can arrange and write the information into their own blogs, which promotes the information to become richer, diversified, public and personalized, however, people need to spend a lot of time obtaining the desired content from a lot of related information. In order to improve the efficiency of rapidly acquiring important information from massive information, the text automatically generates an abstract, which is one of hot contents of the research of the information processing technology.

The task of automatically generating the abstract of the text is to summarize a given document, so that the summarized content can express the main content of the document and is as short as possible. The text abstract generation method is mainly divided into an extraction formula and a generation formula. The abstracted generated abstract is to extract important sentences from the sentences of the original text according to a certain rule to form abstract content, so that the correctness of the semantics and the grammar of the sentences can be ensured, but the logic property of the whole content is lacked. The method used for the extraction type automatic abstract generation is mainly based on the characteristic extraction, the graph sorting and the traditional machine learning method.

In recent years, with the rapid development of deep learning and the expansion of deep learning in various fields, an automatic summarization algorithm based on a generative formula is promoted to be proposed successively, wherein the generative formula summarization is to make a computer imitate human thinking, so that the information of an original document is expressed by concise and complete contents on the basis of understanding the document, and the logicality is higher than that of an extraction formula. The generative abstract model mainly takes a coder-decoder model as a base line, rush et al propose an abstract model based on a window, and the coder and the generative model are jointly trained on a sentence abstract task. In order to not increase the size of a decoder, an attention mechanism is introduced, and Nallapati et al firstly propose a generation abstract model based on the attention mechanism; see et al use a pointer generation network to copy words from the original text through pointers, while keeping the generation of new words, and also to accurately rephrase the original text. The velcade et al is improved on the basis of Transformer, the encoder becomes a part of the decoder by sharing parameters of the encoder into the decoder, and meanwhile, information screening is performed on an input sequence by using a gating network, so that the abstract scoring, the training speed and the reasoning speed are all improved. Peng et al used LSTM and transformer to compose dual encoders to obtain more semantics, plus global gating to further screen key information. With the advent of a series of pre-trained models such as BERT, pre-trained models began to achieve excellent results in the field of natural language understanding. Zhang et al propose a BERT-based natural language generation model, which makes full use of a pre-trained model during the encoding process, and designs a two-segment decoder to further improve the effect on text summarization. Zhang et al 7 propose a new self-supervised training target GSG, pre-train on massive text corpus with a transform-based Encoder-Decoder model, and the result shows that the performance comparable to that of artificial abstracts is achieved on a plurality of text abstract data sets. Facebook proposes a denoising self-encoder model BART of a pre-training Seq2Seq on ACL2020, and a text filling method in the BART model enables the model to learn more to consider the whole length of sentences, and carries out wider conversion on input, so that MLM and NSP targets in the BERT are unified, and the difficulty in model learning is increased.

Although the generative abstract algorithm based on the BERT model is in the mass, the data requirement and the calculation cost of the generative abstract method are much higher than those of the abstraction method, the abstraction method does not depend on an additional training data set in most cases, and the semantic and grammatical accuracy of the abstract content is high. However, the traditional abstraction algorithm model fails to consider the text representation of the context information and the semantic information, only considers the similarity of the abstract contents, and ignores the redundant contents of the abstract. Therefore, the invention provides an abstract method based on combination of an improved NEZHA pre-training language model and a TextRank graph model. In order to process the unlimited word number of the text content and capture the deep semantic information of the sentences, an improved NEZHA model is introduced to map each sentence to the same vector space, so that sentences with similar semantics are close to each other in the same vector space; combining sentence vectors obtained by the improved NEZHA pre-training model with the TextRank model to calculate the weights of the sentences, and fusing the weights of the sentences calculated by the structural features of the text to finally obtain sentence scores with higher accuracy and reliability; and finally, redundancy elimination is carried out on the abstract by utilizing an MMR algorithm in order to ensure low redundancy of the abstract.

Therefore, a method for generating the long text abstract based on word fine granularity is provided.

Disclosure of Invention

In view of the analysis of the above, the present invention aims to provide a method for generating a long text abstract based on word fine granularity, which first converts a text into a vector by using an improved NEZHA algorithm, so that the sentence vector has not only semantic information and context information of words, but also semantic information of words; and then, obtaining a sentence score which is more accurate and reliable relative to a single characteristic by utilizing a TextRank algorithm and the structural characteristics of the fusion text. In order to ensure the low redundancy of the abstract, the improved MMR algorithm is finally utilized to carry out redundancy elimination on the abstract. The method improves the quality of the current abstract content, improves the accuracy and reliability of the content, and verifies the effectiveness of the method through an LCTS data set and a judicial abstract data set of the national research cup in 2020.

In order to achieve the purpose, the invention provides the following technical scheme:

the method for generating the long text abstract based on the word fine granularity comprises the following specific steps:

the method comprises the following steps: performing word segmentation and stop word removal text preprocessing operation on an input text, and splitting the text into sentences according to a text logic framework so as to obtain a clean sentence set;

step two: loading the processed sentences into an improved NEZHA model, and converting the sentences into word vectors;

step three: loading sentence vectors with semantic information into a TextRank algorithm, sequencing and scoring each sentence, and simultaneously considering the similarity between the sentences and the title, the position of the sentence in the paragraph and the context information of the sentence length to obtain the importance score of each sentence;

step four: and finally, filtering out the high-coupling sentences by using an improved MMR algorithm to obtain a low-correlation high-score abstract sentence set, namely the abstract of the original text.

Preferably, the word segmentation and the word removal of the stop word are performed on the input text data in the step one, and then the text is split into sentences according to a text logical framework, specifically:

(1) Removing special characters: removing special characters, mainly including basic punctuation marks and irregular formats, including: "} [ to. - (8230) - (Y-A);

(2) Word segmentation processing: and (3) segmenting each sentence in the text into characters and words by using a jieba word segmentation tool, and removing stop words.

Preferably, in the second step, the processed sentences are loaded into the improved NEZHA model, and the sentences are converted into word vectors, specifically:

s1, loading text data into an input layer of an improved NEZHA model to obtain input word Embedding (Token Embedding), input Segment Embedding (Segment Embedding) and Position Embedding (Position Embedding), and adding the three to finally obtain an output vector of the input layer;

s2, after passing through an input layer, entering a training layer of an improved NEZHA model, wherein each hidden layer is composed of transformers, each Transformer is composed of an attention layer, an intermediate layer and an output layer, the attention mechanism used in the method is a multi-head attention mechanism of 12-head heads, for each head, a query vector, a key vector and a value vector corresponding to the query vector, the key vector and the value vector are solved through a weight matrix of the query vector, the key vector and the value vector are multiplied, and then scaling is carried out, so that a primary attention mechanism weight matrix is obtained;

s3, connecting the output of the attention layer into a full connection layer, and obtaining the output of the middle layer through an activation function GELU; and finally obtaining the whole output through the Norm layer after passing through the full connection layer and the Dropout layer, wherein 12 hidden layer operations need to be circularly carried out for 12 times because of the 12 hidden layers, and finally sentence vectors can be obtained.

Preferably, the improved NEZHA model is specifically as follows:

firstly, adding Chinese words into vocab.txt; inputting a sentence S, carrying out word segmentation once by using pre _ token size to obtain [ w ] ₁ ,w ₂ ,...w _i ](ii) a Traverse each wi, ifKeeping wi in the vocabulary, otherwise, subdividing wi by using a token function carried by NEZHA; the tokenize results for each wi are concatenated in order as the final tokenize result.

Preferably, in the third step, the sentence vectors with semantic information are loaded into the TextRank algorithm, each sentence is ranked and scored, and the importance score of each sentence is obtained by considering the similarity between the sentence and the title, the position of the sentence in the paragraph, and the context information of the sentence length, specifically:

step 3.1: the cosine value of the vector included angle is used as a measure for measuring the difference between two individuals, and the formula is as follows:

wherein, X _i And X _j Respectively represent the ith sentence vector and the jth sentence vector, cos (X) _i ,X _j ) Representing the similarity of two sentences,

step 3.2: the TextRank algorithm calculates sentence weights:

whether sentences have similarity or not is used as an edge, and the similarity between the sentences is used as the weight of the edge, so that an undirected weighted TextRank network graph can be formed, and the weight of each sentence, namely the weight calculation formula of each node is as follows:

setting the initial weight of each node as 1/| D |, namely B ₀ ＝(1,...,1) ^T After several times of iterative calculation, B _j Can converge to: b is _i ＝SD _n×m ·B _i-1 ；

After several iterations, the resulting B _n×1 ＝[b ₁ ,b ₂ ,...,b _n ] ^T Containing the weight value of each sentence node, wherein b _i A score representing the ith sentence;

step 3.3: sentence and title similarity characteristics:

the title similarity weight is calculated, and the title is generally a condensed article which thinly summarizes the content of the article, so that the words appearing in the title are likely to be the characteristic words of the article. If a sentence in an article has a higher similarity to a title, the more important the sentence is, the higher the probability that the sentence will be a word in the abstract. Therefore, the similarity between the sentences of the article and the titles is calculated to adjust the weight of the sentences;

suppose a title sentence is the word S ₀ Then its corresponding sentence vector is

The similarity calculation between the sentences and the title sentences still adopts a cosine function, and the specific formula is as follows:

through the above formula, the similarity calculation is carried out on the traversal of sentences in the text and the title sentences, and finally the weight matrix of the similarity can be obtained

Step 3.4: characteristics of sentence position in paragraph:

the earlier sentences in the first sentence have larger weight and the later sentences in the last sentence have smaller weight, so the weight adjustment coefficient is:

wherein j represents the position of the sentence in the paragraph; h represents the total number of sentences in the paragraph, and a sentence position weight matrix Q can be obtained through calculation _n×1 ＝[w _S (S ₁ ),w _S (S ₂ ),...,w _S (S _n )] ^T ；

Step 3.5: characteristics of sentence length:

long sentences contain key information and modification contents, and core contents need more words for description and modification, especially for strongly descriptive and logically rigorous texts such as legal documents, file descriptions and the like; short sentences may be summary content and modified sentences; therefore, a length coefficient is introduced, and the weight of the sentence under the condition is reasonably calculated from the characteristics of the length of the sentence, so that the weight adjusting coefficient is as follows:

wherein,

is expressed as length L _i C contrast ratio of the ith sentence to the longest sentence in the text, C _avg Representing an average contrast coefficient, when the contrast coefficient is less than 0.1, taking the sentence as a summary candidate sentence, namely, the length adjustment coefficient is 0, otherwise, calculating the adjustment coefficient;

the sentence length weight matrix can be obtained through calculation

Step 3.6: sentence weight fusion:

the final formula of the sentence score is as follows by integrating the semantic information of the inter-sentence weight, the sentence position weight, the title similarity weight and the clue word weight:

MMRS(s _i )＝λ·score _i -(1-λ)·Maxsim(s _i ，s _j )

wherein, λ represents the weight influence factor of different parts, if the weight influence on the sentence is larger, the weight influence coefficient is larger, and vice versa, λ _B +λ ₀ +λ _P +λ _l =1 and λ ranges from 0 to 1; b is _n×1 Is a sentenceInter-child weight, O _n×1 As a sentence position weight, P _n×1 Is title similarity weight, L _n×1 Is a feature weight for the sentence length.

Preferably, the improved MMR algorithm in step four filters out the highly coupled sentences to obtain a set of summary sentences with low correlation and high score, and the specific steps are as follows:

after the multi-feature sentence weight is fused, the weight of each sentence can be obtained, the weight can be regarded as the score of the sentence, the sentences are sorted from high to low according to the score and are sequentially marked as s ₁ ,s ₂ ,...,s _n Wherein s is _i Represents the sentence with score of i, score of score corresponding to each sentence ₁ ,score ₂ ,...,score _n ；

Determining MMRS (S) of the abstract candidate set sentences according to the improved MMR algorithm _j ) The value, the calculation formula is as follows:

MMRS(s _i )＝λ·score _i (1-λ)·Maxsim(s _i ，s _j )

wherein s is _i Represents the ith sentence in rank, λ ∈ [0,1 ∈]，score _i Represents the score of the ith ranked sentence, maxsim(s) _i ,s _j ) Represents the maximum similarity between the ith sentence and a sentence in the candidate sentence set, lambda. Score _i Representing the importance of the sentence score, (1- λ). Maxsim(s) _i ,s _j ) Setting lambda =0.75 and setting the redundancy threshold t =0.75 of MMRS (S) when MMRS (S) shows the difference between the ith sentence and the candidate sentence _i ) And when the number of the candidate abstract sentences is less than or equal to t and the number of the candidate abstract sentences is less than the set number of the sentences, adding the sentences meeting the conditions into the candidate abstract sentence set, and finally arranging the sentences in the candidate abstract sentence set according to the original sentence sequence to finally obtain the abstract which is concise and complete.

Preferably, the stop words include not only the language words but also the human pronouns, such as: aiyao, you, your.

Compared with the prior art, the invention has the beneficial effects that:

(1) The invention selects an extraction type abstract method, is oriented to legal adjudication books, integrates an improved NEZHA algorithm, a TextRank algorithm, an improved MMR algorithm and structural characteristics of text sentences, and provides a long text abstract generation algorithm suitable for a single piece with unlimited word number;

(2) The invention improves the NEZHA algorithm, so that the NEZHA algorithm can process the characteristics of words, thereby mining more semantic information and context information and leading the output sentence vector to be more representative and reliable;

(3) The method not only uses the TextRank algorithm to calculate the score of the sentence, but also corrects the score of the sentence by using the similarity between the sentence and the title, the position of the sentence, the length of the sentence and other structural information, thereby improving the accuracy and reliability of the score of the sentence;

(4) The invention utilizes the improved MMR algorithm to reduce the redundancy of the abstract according to the principle of high score and low coupling, improves the condition that sentences with high relative scores are removed, and improves the content quality of the abstract.

Drawings

FIG. 1 is a flow chart of a long text summary generation method based on word granularity;

fig. 2 is a model framework diagram of a long text abstract generation method based on word fine granularity.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.

Examples

Referring to fig. 1 to fig. 2, the present invention provides a technical solution: the method for generating the long text abstract comprises the following specific steps:

it should be noted that the method for generating the abstract of the long text mainly aims at the problems of insufficient capture of semantic information and context information, redundant abstract contents and the like in the process of generating the abstract of the long text, so that a CAIL2020 abstract data set is selected as experimental data, and the total data of the data is 9848 civil first-pass judgment books. The referee document is divided into a plurality of sentences, each sentence is marked with whether the expression is important or not, and a full-text reference abstract corresponding to the expression is provided. The document category comprises civil disputes such as infringement liability, labor contracts, inheritance contracts and the like, the average number of words of the documents is 2568, the maximum number of words is 13064, the minimum number of words is 866, 90% of the documents exceeds 4863, the average number of words of the summary is 283, the maximum number of words is 474 and the minimum number of words is 66, so the data set belongs to a long text summary data set;

firstly, the documents and abstracts are required to be processed to include some special punctuations and irregular formats, and the contents will influence the effects of following word segmentation and sentence segmentation; then, using a regular expression to extract and identify the category of the document from the symbolic character strings in the document; then dividing the abstract into periods; then, a predefined legal noun dictionary is imported, a jieba word segmentation tool is used for word segmentation, and finally sentence segmentation is carried out according to the ending mark of a sentence number as a sentence.

In this embodiment, stop words include not only words with tone but also pronouns with names, such as: aiyao, you, your.

the BERT pre-training model in this embodiment is widely applied in natural language processing, but the mainstream BERT model can process texts with 512 tokens at most, so that the root cause of this bottleneck is that BERT uses absolute position codes trained from random initialization, and a general maximum position is set to 512, so that at most 512 tokens can be processed, and the extra parts have no position codes available, which results in that content information exceeding the maximum position does not participate in learning. In 2019, the NEZHA model is proposed by the hua nua ark laboratory on the basis of the BERT model, and a relative position coding function is added, so that the limit of data length can be broken through, and the model can better learn interactive transfer among information. Therefore, the invention takes the NEZHA model as the reference model of vector training, solves the problem of word number limitation of the input text and improves the word number. Because the NEZHA model only takes the characters as basic units during training, namely Chinese sentences can be split into one character, the method cannot fully utilize word-level information, the features of the captured sentences are less, the relation between grammar and semantics is loose, and the sentence semantics cannot be completely expressed, for example, if the word of 'notopterygium' is split into word granularities, the word becomes 'notopterygium' and 'alive', the single meanings of the two characters are obviously different from the meanings of the words combined by the two characters, so the sentence features captured by the word granularity information are less, the problem of 'text semantics loss' exists, and the generation effect of abstracting by completely depending on the characters is poor;

therefore, the NEZHA is improved by taking the words as units, and the specific improvement content is as follows: firstly, adding a Chinese word into vocab.txt; inputting a sentence S, carrying out word segmentation once by using pre _ token size to obtain [ w [ ] ₁ ,w ₂ ,...w _i ](ii) a Traversing each wi again, if the wi is in the vocabulary, keeping the wi, otherwise, using the NEZHA self-contained tokenize function to divide the wi once again; the tokenize results for each wi are concatenated in order as the final tokenize result.

Loading the preprocessed text data into an input layer of an improved NEZHA model to obtain input word Embedding (Token Embedding), input Segment Embedding (Segment Embedding) and Position Embedding (Position Embedding), and adding the three to obtain an output vector of the input layer;

after passing through the input layer, the training layer of the improved NEZHA model is entered. Each hidden layer is composed of transformers, each of which is in turn composed of an attention layer, an intermediate layer, and an output layer. The attention mechanism used herein is a multi-headed attention mechanism with 12-headed heads. For each head, solving corresponding query, key and value vectors through a query, key and value weight matrix of the attention mechanism, multiplying the query and the key vectors, and then scaling to obtain a primary attention mechanism weight matrix;

the output of the attention layer is connected into a full connection layer, and the output of the middle layer is obtained through an activation function GELU; and finally obtaining the whole output through the Norm layer after passing through the full connection layer and the Dropout layer, wherein 12 hidden layer operations need to be circularly carried out for 12 times because of the 12 hidden layers, and finally sentence vectors can be obtained.

Step three: loading sentence vectors with semantic information into a TextRank algorithm, sequencing and scoring each sentence, and simultaneously considering the similarity between the sentences and the title, the position of the sentence in the paragraph and the context information of the sentence length to obtain the importance score of each sentence; however, there is a room for improving the accuracy and reliability of sentence score, because it only calculates the sentence score by the inherent features of the text and the unilateral similarity, but ignores the extrinsic features of the text, such as the similarity between the sentence and the title, the position of the sentence in the paragraph, and the extrinsic structural information of the text, the invention corrects the sentence score by using the extrinsic structural information of the text, thereby improving the accuracy and reliability of sentence score, and the specific operation steps are as follows:

step 3.2: the TextRank algorithm calculates sentence weights:

among the unsupervised abstract-based methods, a method for determining the importance of sentences in a document by using a graph-based ranking algorithm TextRank is widely applied, and the main idea is to analogize a text (or texts) into a graph model, the sentences in the document correspond to nodes in the graph model, and the similarity between the sentences is analogized into the weight of graph model edges; the TextRank graph model algorithm refers to a PageRank algorithm proposed by another Google, the PageRank algorithm is mainly used for sorting the webpage values in online search results, and the algorithm with high webpage importance is arranged in the front, so that the importance of a node in the TextRank algorithm can be calculated by circularly iterating and calculating the TextRank value of a sentence or using the PageRank webpage sorting algorithm, and finally, the sentence with high rank is extracted to form a text abstract.

Setting the initial weight of each node as 1/| D |, namely B ₀ ＝(1,...,1) ^T After a plurality of iterative calculations B _j Can converge to: b is _i ＝SD _n×m ·B _i-1 ；

step 3.3: sentence and title similarity characteristics:

through the above formula, the similarity calculation is carried out on the sentence traversal in the text and the title sentence, and finally the weight matrix of the similarity can be obtained

Step 3.4: characteristics of sentence position in paragraph:

the importance of sentences in different positions in the chapter is relatively different, and the importance of sentences appearing at the head and the tail of the paragraph is relatively high, and the expert research results show that: in the process of extracting the artificial abstract, the proportion of the first selected sentence is 85%, and the proportion of the last selected sentence is close to 7%, for news articles with the inverted pyramid characteristics, the gist of the articles is mostly replaced at the first paragraph or the first sentence, and it is particularly important to appropriately increase the weight of the paragraphs or sentences which are close to the beginning position of the articles. The weight of the sentence is adjusted according to the paragraph in which the sentence is located and the position of the sentence in the paragraph, the weight of the sentence which is more advanced in the first paragraph is given a larger weight, and the weight of the sentence which is more advanced in the last paragraph is given a smaller weight, so the weight adjustment coefficient is:

wherein j represents the position of the sentence in the paragraph; h represents the total number of sentences in the paragraph, and the sentences can be obtained through calculationPosition weight matrix Q _n×1 ＝[w _S (S ₁ ),w _S (S ₂ ),...,w _S (S _n )] ^T ；

Step 3.5: characteristics of sentence length:

long sentences contain key information and modification contents, core contents need more words for description and modification, and particularly, strong descriptive and logic texts such as legal documents and file descriptions are provided; while short sentences may be summary content and also modified sentences; therefore, a length coefficient is introduced, and the weight of the sentence under the condition is reasonably calculated from the characteristics of the length of the sentence, so that the weight adjusting coefficient is as follows:

wherein,

is expressed as length L _i C, the contrast coefficient of the ith sentence with the longest sentence in the text _avg Representing an average contrast coefficient, when the contrast coefficient is less than 0.1, taking the sentence as a summary candidate sentence, namely, the length adjustment coefficient is 0, otherwise, calculating the adjustment coefficient;

the sentence length weight matrix can be obtained through calculation

Step 3.6: sentence weight fusion:

integrating the semantic information of the inter-sentence weight, the sentence position weight, the heading similarity weight and the clue word weight, so that the final calculation formula of the sentence score is as follows:

MMRS(s _i )＝λ·score _i -(1-λ)·Maxsim(s _i ，s _j )

wherein, λ represents the weight influence factor of different parts, if the weight influence on the sentence is larger, the weight influence coefficient is larger, and vice versaIt is also a _B +λ ₀ +λ _P +λ _l =1 and λ ranges from 0 to 1; b is _n×1 Is an inter-sentence weight, O _n×1 As a sentence position weight, P _n×1 As title similarity weight, L _n×1 Is a feature weight for sentence length.

Step four: finally, filtering out the high-coupling sentences by using an improved MMR algorithm to obtain a low-correlation high-score abstract sentence set, namely an abstract of the original text;

specifically, after the multi-feature sentence weight is fused, the weight of each sentence can be obtained, the weight can be regarded as the score of the sentence, the sentences are sorted from high to low according to the score and are sequentially marked as s ₁ ,s ₂ ,...,s _n Wherein s is _i Represents the sentence with the ith sentence score, and the score corresponding to each sentence is score ₁ ,score ₂ ,...,score _n ；

MMRS(s _i )＝λ·score _i -(1-λ)Maxstm(s _i ，s _j )

Step one, performing word segmentation and word stop removal on input text data, and splitting a text into sentences according to a text logic framework, specifically:

(1) Removing special characters: removing special characters, mainly including basic punctuation marks and irregular formats, including: "} [ to. - (8230); ";

(2) Word segmentation processing: and (4) segmenting each sentence in the text into characters and words by using a jieba word segmentation tool, and removing stop words.

Step two, the processed sentences are loaded into an improved NEZHA model, and the sentences are converted into word vectors, specifically:

The improved NEZHA model specifically comprises the following components:

firstly, adding a Chinese word into vocab.txt; inputting a sentence S, carrying out word segmentation once by using pre _ token size to obtain [ w [ ] ₁ ,w ₂ ,...w _i ](ii) a Traversing each wi again, if the wi is in the word list, reserving the wi, otherwise, subdividing the wi once by using a tokenize function carried by NEZHA; the tokenize results for each wi are concatenated in order as the final tokenize result.

Loading the sentence vectors with semantic information into a TextRank algorithm, sorting and scoring each sentence, and simultaneously considering the similarity between the sentences and the titles, the positions of the sentences in the paragraphs and the context information of the sentence lengths to obtain the importance score of each sentence, wherein the steps are specifically as follows:

step 3.2: the TextRank algorithm calculates sentence weights:

setting the initial weight of each node as 1/| D |, namely B ₀ ＝(1,...,1) ^T After a plurality of iterative calculations B _j Can be converged as: b is _i ＝SD _n×m ·B _i-1 ；

step 3.3: sentence and title similarity characteristics:

suppose a sentence of a title is the word S ₀ Then its corresponding sentence vector is

Step 3.4: characteristics of sentence position in paragraph:

Step 3.5: characteristics of sentence length:

long sentences contain key information and modification contents, core contents need more words for description and modification, and particularly, strong descriptive and logic texts such as legal documents and file descriptions are provided; while short sentences may be summary content and also modified sentences; therefore, a length coefficient is introduced, and the weight of the sentence under the condition is reasonably calculated from the characteristics of the sentence length, so that the weight adjusting coefficient is as follows:

wherein,

the sentence length weight matrix can be obtained through calculation

Step 3.6: sentence weight fusion:

MMRS(s _i )＝λ·score _i -(1-λ)·Maxsim(s _i ，s _j )

wherein λ represents the weight influence factor of different parts, if the weight influence on the sentence is larger, the weight influence coefficient is larger, and vice versa, λ _B +λ ₀ +λ _P +λ _l =1, and the value range of lambda is between 0 and 1; b is _n×1 Is an inter-sentence weight, O _n×1 As sentence position weight, P _n×1 Is title similarity weight, L _n×1 Is a feature weight for sentence length.

Fourthly, the improved MMR algorithm filters out the sentences with high coupling to obtain a summary sentence set with low correlation and high score, and the specific steps are as follows:

MMRS(s _i )＝λ·score _i -(1-λ)·Maxsim(s _i ，s _j )

wherein s is _i Represents the ith sentence of the rank, λ ∈ [0,1 ]]，score _i Represents the score of the ith ranked sentence, maxsim(s) _i ,s _j ) Represents the maximum similarity between the ith sentence and a sentence in the candidate sentence set, λ · score _i Representing the importance of sentence score, (1- λ). Maxsim(s) _i ,s _j ) Setting lambda =0.75 and setting the redundancy threshold t =0.75 of MMRS (as MMRS (S)) to represent the difference between the ith sentence and the candidate sentence _i ) And when the number of the candidate abstract sentences is less than or equal to t and the number of the candidate abstract sentences is less than the set number of the sentences, adding the sentences meeting the conditions into the candidate abstract sentence set, and finally, arranging the sentences in the candidate abstract sentence set according to the original sentence sequence to finally obtain the concise and complete abstract.

The Rouge branch widely used in the summary task is used as an evaluation index, and Rouge evaluates the summary based on the co-occurrence information of n-grams in the summary, so that the method is an automatic summary evaluation method facing the recall rate of the n-grams. The basic idea is to compare the abstract automatically generated by the algorithm with the standard abstract of the test data set, and evaluate the quality of the abstract by counting the number of the overlapped basic units between the two. Three evaluation indexes, namely Rouge-1, rouge-2 and Rouge L, are selected to evaluate the quality of the abstract generated by the algorithm.

The algorithm provided by the invention and a classic TextRank algorithm, a BERT-TextRank algorithm, a NEZHA-TextRank-weight correction algorithm and a NEZHA-TextRank-weight correction-redundancy processing algorithm are realized on a single long text through a single Chinese long text in the legal industry, and the reasonability and the practicability of the algorithm provided by the invention on the single long text are verified by evaluating a standard abstract and a model abstract which are manually marked. The obtained results are shown in table 1, and it can be seen from table 1 that the algorithm provided by the invention is greater than other algorithms in each index, so that the superiority of the algorithm is embodied, and the algorithm is suitable for the abstract extraction technology of a single long text.

Table 1 summary evaluation results of each algorithm:

the foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The method for generating the long text abstract based on the word fine granularity is characterized by comprising the following steps of: the method for generating the long text abstract comprises the following specific steps:

2. The method for generating a long text summary based on word fine granularity according to claim 1, wherein: step one, performing word segmentation and word stop removal on input text data, and splitting a text into sentences according to a text logic framework, specifically:

3. The method for generating a long text summary based on word fine granularity according to claim 1, wherein: step two, the processed sentences are loaded into an improved NEZHA model, and the sentences are converted into word vectors, specifically:

s1, loading text data into an input layer of an improved NEZHA model to obtain input word embedding, input segment embedding and position embedding, and adding the three to finally obtain an output vector of the input layer;

s2, after passing through an input layer, entering a training layer of an improved NEZHA model, wherein each hidden layer is composed of transformers, each Transformer is composed of an attention layer, a middle layer and an output layer, an attention mechanism used in the method is a multi-head attention mechanism with 12 heads, for each head, a query vector, a key vector and a value vector corresponding to the query vector, the key vector and the value vector are solved through a weight matrix of the query vector, the key vector and the value vector are multiplied, and then the result is scaled to obtain a primary attention mechanism weight matrix;

s3, connecting the output of the attention layer into a full connection layer, and obtaining the output of the middle layer through an activation function GELU; and finally obtaining the whole output through the Norm layer after passing through the full connection layer and the Dropout layer, wherein 12 hidden layer operations need to be carried out circularly for 12 times because of the 12 hidden layers, and finally the sentence vector can be obtained.

4. The method of claim 3 for generating a long text abstract based on word fine granularity, wherein: the improved NEZHA model specifically comprises the following steps:

firstly, adding a Chinese word into vocab.txt; inputting a sentence S, carrying out word segmentation once by using pre _ token size to obtain [ w [ ] ₁ ,w ₂ ,...w _i ](ii) a Traversing each wi again, if the wi is in the vocabulary, keeping the wi, otherwise, using the NEZHA self-contained tokenize function to divide the wi once again; the tokenize results for each wi are concatenated in order as the final tokenize result.

5. The method for generating a long text summary based on word fine granularity according to claim 1, wherein: loading the sentence vectors with semantic information into a TextRank algorithm, sorting and scoring each sentence, and simultaneously considering the similarity between the sentences and the titles, the positions of the sentences in the paragraphs and the context information of the sentence lengths to obtain the importance score of each sentence, wherein the steps are specifically as follows:

step 3.2: the TextRank algorithm calculates sentence weights:

step 3.3: sentence and title similarity characteristics:

the title similarity weight is calculated, and the title is generally a condensed article which thinly summarizes the content of the article, so that the words appearing in the title are likely to be the characteristic words of the article. If a sentence in an article has a higher similarity to a headline, the more important the sentence is said to be, the higher the probability that the sentence will become a sentence in the abstract. Therefore, the similarity between the sentences of the article and the titles is calculated to adjust the weight of the sentences;

through the above formula, the similarity between the sentence traversal in the text and the title sentence is measuredCalculating to obtain the weight matrix of the similarity

Step 3.4: characteristics of sentence position in paragraph:

Step 3.5: characteristics of sentence length:

wherein,

is expressed as length L _i C, the contrast coefficient of the ith sentence with the longest sentence in the text _avg Represents an average contrast ratio, and when the contrast ratio is less than 0.1, the sentence will not be considered as the abstractCandidate sentences, namely length adjustment coefficients are 0, otherwise, the adjustment coefficients of the candidate sentences are calculated;

the sentence length weight matrix can be obtained through calculation

Step 3.6: sentence weight fusion:

MMRS(s _i )＝λ·score _i -(1-λ)·Maxsim(s _i ，s _j )

wherein λ represents the weight influence factor of different parts, if the weight influence on the sentence is larger, the weight influence coefficient is larger, and vice versa, λ _B +λ ₀ +λ _P +λ _l =1 and λ ranges from 0 to 1; b is _n×1 Is an inter-sentence weight, O _n×1 As a sentence position weight, P _n×1 As title similarity weight, L _n×1 Is a feature weight for sentence length.

6. The method for generating a long text abstract based on word fine granularity according to claim 1, wherein: fourthly, the improved MMR algorithm filters out the sentences with high coupling to obtain a summary sentence set with low correlation and high score, and the specific steps are as follows:

MMRS(s _i )＝λ·score _i -(1-λ)·Maxsim(s _i ，s _j )

wherein s is _i Represents the ith sentence in rank, λ ∈ [0,1 ∈]，score _i Represents the score of the ith ranked sentence, maxsim(s) _i ,s _j ) Represents the maximum similarity between the ith sentence and a sentence in the candidate sentence set, λ · score _i Representing the importance of the sentence score, (1- λ). Maxsim(s) _i ,s _j ) Setting lambda =0.75 and setting the redundancy threshold t =0.75 of MMRS (S) when MMRS (S) shows the difference between the ith sentence and the candidate sentence _i ) And when the number of the candidate abstract sentences is less than or equal to t and the number of the candidate abstract sentences is less than the set number of the sentences, adding the sentences meeting the conditions into the candidate abstract sentence set, and finally, arranging the sentences in the candidate abstract sentence set according to the original sentence sequence to finally obtain the concise and complete abstract.

7. The method for generating a long text abstract based on word fine granularity according to claim 2, wherein: the stop words not only include the tone words, but also include the pronouns of the people, such as: a, an aiya, an aiway, an aike, you.