CN114169312A

CN114169312A - Two-stage hybrid automatic summarization method for judicial official documents

Info

Publication number: CN114169312A
Application number: CN202111494073.7A
Authority: CN
Inventors: 李波; 欧阳建权; 黄文鹏
Original assignee: Hunan Hailong International Intelligent Technology Co ltd; Xiangtan University
Current assignee: Hunan Hailong International Intelligent Technology Co ltd; Xiangtan University
Priority date: 2021-12-08
Filing date: 2021-12-08
Publication date: 2022-03-11

Abstract

A two-stage hybrid automatic summarization method for judicial official documents comprises the following steps: 1) calculating the similarity of key sentences in the referee document, coding and classifying abstract models of the key sentences, and finally extracting abstract key sentences; 2) extracting sentences from the referee document to combine into a key sentence collection; 3) and 2) taking the key sentence collection in the step 2) as the input of a generative model, and generating a text abstract through model coding and decoding. The invention can concentrate and refine long texts in many referee documents to obtain accurate useful information to generate the abstract. The abstract generated by the method provided by the invention has strong readability, strong continuity and high identification degree, and ensures the fidelity between the text and the abstract.

Description

Two-stage hybrid automatic summarization method for judicial official documents

Technical Field

The invention belongs to the technical field of official document data processing, and particularly designs a two-stage hybrid automatic summarization method for judicial official documents.

Background

With the rapid development of the information age, the data volume on the internet is exponentially increased. The text abstract technology abstracts and summarizes text information to extract the gist of articles, and the abstract is used for replacing an original text chapter to participate in indexing, so that the retrieval time can be effectively shortened, redundant information in a retrieval result can be reduced, and a user can efficiently acquire required information from a large amount of data.

Existing intelligent systems such as internet courts are generally used as auxiliary work for legal workers, for example, extracting information from referee documents by techniques such as semantic analysis, or constructing relationships between legal elements by manual processing. The official document is in a standard writing, but the content is exhaustive and lengthy, at present, the abstract is generated by extracting words, phrases and sentences with larger weight from the official document and combining the words, the phrases and the sentences, and the semantic coherence of the generated abstract is poor, so that the law and the official knowledge are not effectively fused, and the generated abstract is inconsistent and inaccurate. Therefore, a method for generating a referee document abstract is needed to ensure the consistency and accuracy of the referee document abstract.

The judicial official documents are the final carriers of judicial activities, and the existing judicial official documents are important bases for assisting criminal decision-making and standardizing the officials' scales. However, the number of official documents which are disclosed at present is as large as 1.2 hundred million, and how to acquire useful information from a plurality of official documents is an urgent problem to be solved. The automatic summarization technology can concentrate and refine long texts, and the short summaries are used for representing the long original texts, so that the automatic summarization technology is an important means for solving the problem of information overload.

The text automatic summarization technology can be divided into an abstract summary and a generated summary according to different summary generation modes. The extraction method is to take the text summarization task as a classification problem and judge whether the sentence is a summarization sentence, and the method keeps the loyalty with the original text, but because the text is directly extracted and spliced from the original text, the readability and the continuity of the generated summary are poor. Compared with the extraction method, the generation method is closer to the process of artificial summarization, a deep learning model is used for learning a large amount of text data, the text is encoded and decoded, and the summary of the extracted content is generated by a rephrasing and replacing method. Instead of extracting sentences directly from the source document, the generative abstract replaces the original text sentences by generating new sentences. Although the generating method can generate a new sentence, the generated sentence is easily contrary to the original consciousness, the fidelity is not guaranteed, and the generating method is easy to lose information for long text. The above disadvantages are more prominent when the judicial official document is used as a text with an ultra-long space and a single extraction or generation method is applied to the judicial official document. Therefore, the present invention provides a two-stage hybrid automatic summarization method combining an extraction method and a generation method, which effectively solves the above problems.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a two-stage hybrid automatic summarization method for a judicial official document. Firstly, forming a key sentence collection by adopting an extraction mode, secondly, taking the sentence collection as the input of a generation mode, and generating a text abstract by model coding and decoding; the text of the whole referee document is concentrated and refined, so that the space of the abstract text is reduced, the fidelity, readability and continuity of the generated abstract and the meaning of the original text are ensured, and the number of characters of the abstract generated manually is reduced, and the reliability is low.

In order to solve the problems, the following technical scheme is provided:

a two-stage hybrid automatic summarization method for judicial official documents comprises the following steps:

1) and calculating the similarity of the key sentences in the referee document, coding and classifying abstract models of the key sentences, and finally extracting the abstract key sentences.

2) Sentences are extracted from the referee documents and combined into a key sentence collection.

3) And 2) taking the key sentence collection in the step 2) as the input of a generative model, and generating a text abstract through model coding and decoding.

Preferably, the calculating the similarity of the key sentences in the step 1) includes:

step 1.1) sentence division is carried out on the referee document, then an artificial standard sentence is found in the referee document, and then a sentence with the highest similarity is found from the original text and is used as a tag data set of the extraction abstract. And calculating the similarity score between the sentences in the artificial abstract and the sentences in the source document through cosine similarity, and selecting the sentences with the highest score in the source document, namely the key sentences.

Preferably, step 1) further comprises:

and step 1.2) vectorizing the text, wherein sentences obtained after similarity calculation and original texts in the referee document are in the same line, and the source text, the label data and the artificial abstract are subjected to word segmentation by adopting jieba. In the word segmentation process, legal nouns are crawled to be used as supplement of a word bank, and then word vectorization is carried out by using a BERT model.

Preferably, the encoding of the abstract model of the key sentence in step 1) includes:

and (5) extracting model coding. At the coding layer, word embedding adopts target word embedding vector, and for a text with n sentences, D ═ S₁，S₂，……，S_nPre-treatment by two special markers. First, [ CLS ] is inserted at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]The tokens constitute the input. [ CLS]The token represents the vector of the current sentence, [ SEP ]]The tokens represent clauses used to segment sentences in the text. On the basis of word embedding, input position embedding and segmentation embedding are further arranged.

Preferably, the location is embedded. The position information of the word is coded into a feature vector, and the position vector adopts a scheme in the Attention islalyouneed:

PE_(pos,2i)＝sin(pos/10000^2i/dmodel)。

PE_(pos,2i+1)＝cos(pos/10000^2i/dmodel)。

in the formula, pos represents the position of a word in a sentence, and the numeric area is [0, n ]. i refers to the dimension of the word vector. The input to the BERT is dmodel 128-1024, preferably 256-512.

Preferably, the segments are embedded. For distinguishing two sentences, the different sentences are preceded by a and B labels, respectively, so that the input sentence is represented as (E)_A，E_B，E_A，E_B… …). The word embedding, position embedding and segment embedding representations are stitched as BERT model inputs. Sentence vector X ═ X (X) obtained after pre-training layer via BERT model₁,X₂,……,X_n)＝BERT(sent₁,sent₂,sent₃,……,sent_n) Wherein sent_iThe i-th sentence, X, represented as the original referee document_iCorresponding send_iBERT-coded vector, X_iThe ith vector sequence that needs to be processed.

Preferably, the classifying the abstract model of the key sentence in step 1) includes:

and (4) a classification layer, wherein an expansion residual gated convolutional neural network structure is adopted, namely the expansion residual gated convolutional neural network structure is DRGCNN. The key sentence extraction of the abstract is carried out by stacking a plurality of layers of DRGCNN networks, the number of the layers of the DRGCNN is 6-10, preferably 7-8, and the expansion coefficient of each layer is 1, 2, 4, 8, 1 and 1 respectively. Original input sequence for convolutional network X ═ X (X)₁,X₂,……,X_n) With convolution kernel W, signature C of arbitrary convolution operation_iThe calculation formula of (2) is as follows:

in the formula, W_cRepresenting a one-dimensional volumeThe product-kernel, also called weight coefficient, is a learnable parameter. k denotes a distance from the input identity i. n represents the number of words in the sentence. x is the number of_i±kA word vector representing k words forward or backward from the ith word. The resulting signature graph may represent input X_iThe degree of association with the context.

Preferably, the convolution width is expanded by adding the expansion coefficient α. When α is 1, the dilation convolution operation corresponds to a full convolution operation. Alpha is alpha>1, the dilated convolution can learn more distant context information, feature map C_iThe calculation formula of (2) is as follows:

wherein α is an expansion coefficient, W_cRepresenting a one-dimensional convolution kernel, also called a weight coefficient, is a learnable parameter. k represents the distance from the input identifier i, and the resulting feature map may represent the input X_iThe degree of association with the context. In the feature map C_iOn the basis of the method, a gate control mechanism convolutional neural network is introduced, and an output calculation formula is as follows:

in the formula, convD₁，convD₂Representing a one-dimensional convolution function. X denotes a sentence vector.

Representing point-by-point multiplication. σ is a gating function. convD₁And convD₂Operate for two convolution functions and the weights are not shared.

Preferably, a residual structure is introduced on the basis of a gating mechanism, and the output calculation formula is as follows:

Representing point-by-point multiplication. σ is a gating function.

Preferably, the sentence is further classified into two categories by the full link layer. During training, cross entropy is selected as a loss function and is expressed as:

in the formula (I), the compound is shown in the specification,

the label data representing sample i has a positive class of 1 and a negative class of 0. y represents the probability that sample i is predicted as a positive class. Loss represents a Loss function.

Preferably, the sentence combination key sentence collection is extracted from the referee document as the input of the generation model in the step 2) through coding and classification in the extraction abstract model.

Preferably, the formula model generated in step 3) includes: and combining the key sentence collection as the input of a generation model, and generating a text abstract by performing model coding and decoding on the input. The model coding adopts a Unilm pre-training language model, and the input of the model consists of word embedding, segment embedding and position embedding.

Preferably, the words are embedded as n sentences for a text D ═ S₁，S₂，……，S_nPre-treatment by two special markers. First, [ CLS ] is inserted at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]The tokens constitute the input. [ CLS]The token represents the vector of the current sentence, [ SEP ]]The tokens represent clauses used to segment sentences in the text.

Preferably, segment embedding is used to distinguish two sentences, different sentences being preceded by a and B labels, respectively, so that the input sentence is represented as (E)_A，E_B，E_A，E_B,……)。

Preferably, the position in the model input is embedded as a position code of hierarchical decomposition. The position coding vector trained by using BERT is p₁,p₂，p₃,…,p_nConstructing a new set of position codes q by formula₁,q₂，q₃，…，q_mIn the formula, the structural formula is as follows:

in the formula, q_(i-1)×n+jIs position-coded.

The value is a hyperparameter and is 0.4. q is the position code of the (i-1) × n + j position. i is the ith word. j is the jth word. u is a vector, the base vector of the q vector is represented by the trained position p vector

The pos represents the position of the word in the sentence and has the value range of 0, n]. The position code of (i-1) × n + j is hierarchically represented as (i, j) by the formula. The position codes corresponding to i and j are respectively

And

because q is₁＝p₁,q₂＝p₂,……,q_n＝p_nCalculate u_i，

Word embedding, position embedding and segmented embedding are spliced into input of a Unilm model, and a sentence vector X obtained after a pre-training layer of the Unilm model is equal to (X)₁，x₂,…,x_n)＝Unilm(sent₁,sent₂,sent₃,…,sent_n)。

Preferably, the decoding of the generated text summary in the model generated in step 3) includes: the abstract generation learns the characteristics of the document level through a Transfoamer layer of a multi-layer attention mechanism. A copy mechanism is introduced in the decoding process of the model, and the copy mechanism comprises copying and generating. For a multi-layer transform backbone network, a text sequence X with an input length n is given as (X)₁,x₂,…,x_n). Transformer layer output H of first layer⁰The calculation formula of (2) is as follows: h⁰＝Transformer₀(x) In that respect Output H of Transformer passing through l layer^lThe calculation formula is as follows:

H^l＝Transformer₁(H^l-1)。

h is a layer of a Transformer. Final input result H^lThe calculation formula of (2) is as follows:

wherein L represents the number of layers, and is within [1, L ]]. L represents the total number of layers of the Transformer,

denotes x_iThe context of the input.

Preferably, in each transform module, a multi-head attention mechanism is added to aggregate output modules, and parts of output sequences needing attention are marked. Transformer self-attention for layer I A_lThe calculation formula is as follows:

in the formula, A_lIs a self-attention weight. The softmax function is a normalized exponential function. Q, K, V is derived from input X_iAnd linear transformation is carried out. V_lM represents a Mask matrix for Value of the l-th layer. d_kIs the number of columns of the Q, K matrix, i.e. the dimension of the vector, preventing QAnd the inner product of K is too large, so that the regulation effect is realized. T is denoted as transpose. In the formula (I), the compound is shown in the specification,

is a linear projection of the previous layer to Queries, Values and Keys and the parameters of the projection are respectively

The Mask matrix M controls whether the Token is allowed to be added or not, different Mask matrixes M are used for controlling to pay attention to different contexts, and a copy mechanism is introduced to solve the problems of unregistered words and repeated words brought in the generation process.

Preferably, the generating the text summary further comprises: when the decoding time is t, according to the last layer H of the Transformer_tAnd output O of Decoder_jThe calculation method of the correlation weight is that

Wherein W_cTo initialize the matrix, the formula for the attention distribution of the j-th word is calculated at the same time as:

in the formula, N is the number of words in the sentence. Exp is an exponential function with a natural constant e as the base. u is a hyperparameter. t is time. k represents the input sequence identification and has the value range of [1, N]. j is denoted as the ith word. The attention distribution can be interpreted as the attention degree of the ith word in context query, and the information weighted average is carried out on the attention distribution by the following formula to obtain a context expression vector h'_t：

Wherein h'_tAlso called context vectors, indicate that the information of interest is obtained from the distribution of attention.

The output of the last layer of the Transformer at the time t.

The attention distribution for the jth word. i is the value of the sequence identity. And N is the number of words in the sentence.

Preferably, the context vector is compared with the output O of the Decoder_jConnected and generate a vocabulary distribution through two linear layers

The calculation formula is as follows:

wherein V ', V, b ' are learnable parameters h '_tFor context representation of vectors, O_jIs the output of the Decoder and is the output of the Decoder,

is the probability distribution of all words in the vocabulary.

Preferably, a copy gating function g is reintroduced_t∈[0,1]To decide whether the current output chooses to copy from the source document or generate new words from the vocabulary. g_tThe calculation formula of (2) is as follows:

in the formula, W_g、b_gAre all learnable parameters.

The output of the last layer of the Transformer at the time t. O is_jIs the output of Decoder. The formula shows that at the time t, whether the next word is to generate a new word or to directly copy the new word is determined according to the attention weights of the j word and other words.

Preferably, for each document, the words in the vocabulary are combined with all the occurring words of the source document to form a new word stock, i.e., an extended word stock. Therefore, whether the copy is selected from the source document or generated from the vocabulary, the generation is performed from the extended thesaurus, so the final probability calculation formula is:

wherein the content of the first and second substances,

indicating the probability that the current word w was generated in a given vocabulary.

Representing the probability of selecting a copy from the source document based on the attention distribution.

Preferably, if w is an out-of-vocabulary word, then

If w does not appear in the source document, then

The final loss function for expanding the probability distribution of the word bank is as follows:

wherein T is the total time. P_t(w) is the distribution of words andthe attention distribution calculation generates a probability.

Calculating and generating probability P according to vocabulary distribution and attention distribution_tAnd (w), finally, automatically generating the text abstract according to the generation probability and the vocabulary distribution.

In the prior art, an existing intelligent system such as an internet court is generally used as an auxiliary work of legal workers, for example, information is extracted from a referee document through a technology such as semantic analysis, or a relationship between legal elements is constructed through a manual processing mode. The official document is in a standard writing, but the content is exhaustive and lengthy, at present, the abstract is generated by extracting words, phrases and sentences with larger weight from the official document and combining the words, the phrases and the sentences, and the semantic coherence of the generated abstract is poor, so that the law and the official knowledge are not effectively fused, and the generated abstract is inconsistent and inaccurate. Therefore, a method for generating a referee document abstract is needed to ensure the consistency and accuracy of the referee document abstract.

In the invention, firstly, the referee document is divided into sentences, and then the sentences with the highest similarity are found out in the referee document through the manual standard sentences and are used as the data of the extraction model; and calculating the similarity score between the sentence in the artificial abstract and the sentence in the source document (referee document library) by using cosine similarity, and selecting the sentence with the highest score in the source document as a key sentence according to the similarity score.

In the invention, text vectorization is carried out, sentences obtained after similarity calculation and original texts in a referee document are in the same line, and a source text, label data and an artificial abstract are subjected to word segmentation by adopting jieba. In the word segmentation process, legal nouns are crawled to be used as supplements of a word bank, and word vectorization is carried out by using a BERT model. The method comprises the steps of obtaining a primary key sentence after similarity calculation for text vectorization, performing word segmentation on a source text, tag data and an artificial abstract by adopting a jieba to determine a first word (a central word), crawling legal nouns (legal nouns in a mesoscale network) in the word segmentation process to be used as a supplement of a word bank, adding an incidence relation between the first word and a target word into an initial word embedding vector after determining the first word, obtaining a fused word embedding vector capable of reflecting the incidence relation between the target word (the target word) and the first word, and determining the fused word embedding vector as the target word embedding vector of the target word.

In the invention, the input of the extraction mode is based on word embedding, and also input position embedding and segmented embedding are carried out; for referee documents, there are n sentences D ═ S₁，S₂，……，S_nBy inserting [ CLS ] at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]Tag composition input, [ CLS]The token represents the vector of the current sentence, [ SEP ]]The mark represents a clause used for segmenting a sentence in the text; [ CLS]Identification sum [ SEP]The marks are used for segmenting sentences, so that the semantics of each sentence can be better captured, and the accuracy of information extraction is improved; position embedding encodes the position information of the word into a feature vector, and the position vector adopts a scheme in Attentions AllYouNeed:

PE_(pos,2i)＝sin(pos/10000^2i/dmodel)。

PE_(pos,2i+1)＝cos(pos/10000^2i/dmodel)。

in the formula, pos represents the position of a word in a sentence and has a value range of [0, n]. i refers to the dimension of the word vector. The input to the BERT is dmodel 128-1024, preferably 256-512. Segment embedding for distinguishing two sentences, different sentences are respectively marked with A and B before, so that the input sentence is expressed as (E)_A，E_B，E_A，E_B,……)。

In the present invention, word embedding, position embedding, and segment embedding represent concatenations as inputs to the BERT model, and the sentence vector X obtained after the BERT model pre-training layer is equal to (X)₁,X₂,……,X_n)＝BERT(sent₁,sent₂,sent₃,……,sent_n) Wherein sent_iThe i-th sentence, X, represented as the original referee document_iCorresponding send_iBERT-coded vector, X_iThe ith vector sequence that needs to be processed. Coding process is carried out through a BERT preprocessing model (word embedding, position embedding and segmentation are carried out on coded sentences through BERT + global average poolingEmbedding together, outputting to a dense layer, extracting the correlation among the features through nonlinear change of the features extracted in the encoding process, and finally mapping to an output space. ). And embedding each character, splitting Chinese into individual characters for learning, and classifying by a full connection layer and a softmax layer to obtain a classification result.

In the invention, the key sentences are subjected to feature learning through a Dilated Residual Gated Convolutional Neural Network (DRGCNN); compared with the more traditional convolutional neural network, the DRGCNN enhances the ability of a model to learn long-distance context semantic information; a gating mechanism (DGCNN) is introduced to control the flow direction of information, and a residual error mechanism is introduced to solve the problem of gradient disappearance and increase the multi-channel transmission of the information; extracting key abstract sentences by stacking 6-10 layers, preferably 7-8 layers of DRGCNN networks, wherein the expansion coefficients of each layer are 1, 2, 4, 8, 1 and 1 respectively; before the coding sequence is processed by adopting a self-attention mechanism, a residual error network and a gate control convolution are adopted to process data, and the coding sequence with a text relation is obtained. Original input sequence for convolutional network X ═ X (X)₁,X₂,……,X_n) The convolution kernel is W, and the arbitrary convolution operation obtains a feature map C_iThe calculation formula of (2) is as follows:

in the formula, W_cRepresenting a one-dimensional convolution kernel, also called a weight coefficient, is a learnable parameter. k denotes a distance from the input identity i. n represents the number of words in the sentence. x is the number of_i±kA word vector representing k words forward or backward from the ith word. The resulting signature graph may represent input X_iThe degree of association with the context. Expanding the convolution width by adding an expansion coefficient alpha, and increasing the network depth by maturing a stacked expansion convolution neural network, thereby solving the problem of long-distance dependence of a text sequence and the problem of extracting global effective information; when α is 1, the dilation convolution operation is equivalent to a full convolution operation; when alpha is>1, the dilated convolution can learn more distant context informationAt this time, the characteristic diagram C_iThe calculation formula of (2) is as follows:

wherein α is an expansion coefficient, W_cRepresenting a one-dimensional convolution kernel, also called a weight coefficient, is a learnable parameter. k represents the distance from the input identifier i, and the resulting feature map may represent the input X_iThe degree of association with the context. In the feature map C_iThe convolutional neural network of a gating mechanism (DGCNN) is introduced on the basis, and the output calculation formula is as follows:

in the formula, convD₁And convD₂Representing a one-dimensional convolution function, X representing a vector of sentences,

representing point-by-point multiplication, σ being a gating function, conv₁And conv₂Two convolution operations are performed and weights are not shared; activating one convolution and calculating an outer product between the two convolutions, so that gradient disappearance of the neural network can be relieved; if a Plain network (Plain network) similar to a vgg (visual Geometry group) network is used, there is no residual error, and it is empirically found that as the depth of the network increases, the training errors decrease and then increase (and the increase in the errors is not caused by overfitting, but is difficult to train due to the fact that the network becomes deeper). The deeper the network depth the better, but in practice, the deeper the depth means that it is harder to train with an optimization algorithm for a normal network if there is no residual network. In fact, as the depth of the network increases, training errors increase, which is described as network degradation. The residual error network is helpful for solving the problems of gradient disappearance, gradient explosion and network degradation, so that the good information can be ensured while a deeper network is trained. Therefore, a residual error net is introduced on the basis of a gating mechanismThe output calculation formula of the network structure is as follows:

Representing point-by-point multiplication. σ is a gating function. And judging whether the sentences are classified into two categories of key sentences or not through the full connection layer. During training, cross entropy (cross entropy is information for measuring difference between two probability distributions) is selected as a loss function, and is expressed as follows:

in the formula (I), the compound is shown in the specification,

label representing sample i has a positive class of 1 and a negative class of 0. y represents the probability that sample i is predicted as a positive class. Loss represents a Loss function.

In the invention, the coding and classification in the extraction abstract model extracts a sentence combination key sentence collection from a referee document as the input of a generation model, and the input is coded and decoded by the model to generate a text abstract; the model coding adopts a Unilm pre-training language model, a pre-training data set is constructed through Unilm, a target detection method is used for carrying out target detection on a text, a result is used as key text information, and the key text information is input in a keyword embedding mode, wherein the input of the model consists of word embedding, position embedding and segment embedding; the word embedding and segment embedding modes in the word embedding and segment embedding extraction type model are the same; the position embedding is position coding of hierarchical decomposition, and a position coding vector trained by BERT is used as p₁,p₂,p₃,…,p_nConstructing a new set of position codes q by formula₁,q₂,q₃,…,q_mWherein m is>n; the structural formula is as follows:

in the formula, q_(i-1)×n+jIs position-coded.

And

because q is₁＝p₁,q₂＝p₂,……,q_n＝p_nCalculate u_i，

Word embedding, position embedding and segmented embedding are spliced into input of a Unilm model, and a sentence vector X obtained after a pre-training layer of the Unilm model is equal to (X)₁,x₂,…,x_n)＝Unilm(sent₁,sent₂,sent₃,…,sent_n). Inputting the key paragraphs and key sentence information into an encoder in a Unilm model for encoding to form comprehensive semantic representation; and finally, decoding is carried out through transformations, which is helpful for ensuring that the key information finally generated by the formed sentence generation model has the diversity of language organization modes and the comprehensiveness of knowledge point coverage.

In the invention, the generated text learns the characteristics of the document level through a Transformer layer of a multi-layer attention mechanism, and a copying mechanism is introduced in the decoding process of the model and comprises copying and generating; and adding the position vector and the word vector to obtain a word vector containing word sequence information, wherein all the word vectors containing the word sequence information in each paragraph form a word vector set containing context information and with known context. For a multi-layer transform backbone network, a text sequence X with an input length n is given as (X)₁,x₂,…,x_n) (ii) a Then the transform layer output H of the first layer⁰The calculation formula of (2) is as follows: h⁰＝Transformer₀(x) In that respect Output H of Transformer passing through l layer^lThe calculation formula is as follows:

H^l＝Transformer₁(H^l-1)。

denotes x_iThe context of the input.

in the formula, A_lIs a self-attention weight. The softmax function is a normalized exponential function. Q, K, V is derived from input X_iAnd linear transformation is carried out. V_lIs the first layerValue, M, of (a) represents a Mask matrix. d_kThe column number of Q, K matrix, namely the dimension of vector, prevents Q, K inner product from being too large and plays a role in regulation. T is denoted as transpose. In the formula (I), the compound is shown in the specification,

In the invention, when the decoding time is t, the last layer H according to the Transformer_tAnd output O of Decoder_jThe calculation method of the correlation weight is that

in the formula, N is the number of words in the sentence. Exp is an exponential function with a natural constant e as the base. u is a hyperparameter. t is time. k represents the input sequence identification and has the value range of [1, N]. j is denoted as the ith word. The attention distribution can be interpreted as the attention degree of the ith word in the context query, and the attention is divided by the following formulaCarrying out information weighted average to obtain a context expression vector h'_t：

The output of the last layer of the Transformer at the time t.

The attention distribution for the jth word. i is the value of the sequence identity. And N is the number of words in the sentence. The context vector is compared with the output O of Decoder_jConnected and generate a vocabulary distribution through two linear layers

The calculation formula is as follows:

is the probability distribution of all words in the vocabulary. Reintroducing copy gating function g_t∈[0,1]To decide whether the current output chooses to copy from the source document or generate new words from the vocabulary. g_tThe calculation formula of (2) is as follows:

in the formula, W_g、b_gAre all learnable parameters.

The output of the last layer of the Transformer at the time t. O is_jIs the output of Decoder. The formula shows that at the time t, whether the next word is to generate a new word or to directly copy the new word is determined according to the attention weights of the j word and other words. For each document, the words in the vocabulary are combined with all the appearing words in the source document to form a new word stock, namely an extended word stock. Therefore, whether the copy is selected from the source document or generated from the vocabulary, the generation is performed from the extended thesaurus, so the final probability calculation formula is:

in the formula (I), the compound is shown in the specification,

Representing the probability of selecting a copy from the source document based on the attention distribution. If w is an out-of-vocabulary word, then

If w does not appear in the source document, then

wherein T is the total time. P_t(w) generating probabilities for the vocabulary distribution and the attention distribution calculations. Calculating and generating probability P according to vocabulary distribution and attention distribution_t(w) finally according to the generationThe probability and lexical distribution automatically generate a text summary. The method for generating the abstract by the Transformer can learn the dependency relationship in each text first and then model the relationship among the texts, so that the sequence length of single input is greatly shortened, cross-text association can be conveniently learned, and the abstract generation is quick and accurate.

By adopting the method, the data can be better migrated to various specific fields such as tourism, medicine, news, natural science and the like through fine adjustment.

Compared with the prior art, the technical scheme of the invention has the following beneficial technical effects:

1. in the invention, through combining the extraction mode and the generation mode, the problems of poor readability and poor continuity of the abstract formed by the single extraction mode, and the problems of contradictory meaning and low fidelity of the abstract formed by the single generation mode and the original meaning are solved.

2. In the invention, the abstract of the referee document is formed by adopting two stages, the first stage extracts sentences from the referee document to be combined into key sentences, the second stage takes the extracted key sentences as the input of a generation mode, and the text abstract is formed by model coding and decoding, so that the accuracy and the fidelity degree of the text abstract can be ensured by the two stages.

3. In the invention, key information is extracted from the source document to form key sentences, and the key sentences are coded and combined to form the abstract with the same meaning as the source document through the generation mode, thereby greatly reducing the space of manual abstract text.

Drawings

FIG. 1 is a schematic structural diagram of a two-stage hybrid automatic summarization method for judicial official documents according to the present invention.

FIG. 2 is a schematic diagram of an abstraction model structure of a two-stage hybrid automatic summarization method for judicial official documents according to the present invention.

FIG. 3 is a schematic diagram of a generative model structure of a two-stage hybrid automatic summarization method for judicial official documents according to the present invention.

Detailed Description

The technical solution of the present invention is illustrated below, and the claimed scope of the present invention includes, but is not limited to, the following examples.

Preferably, step 1) further comprises:

and (5) extracting model coding. At the coding layer, word embedding adopts target word embedding vector, and for a text with n sentences, D ═ S₁，S₂，……，S_nPre-treatment by two special markers. First, [ CLS ] is inserted at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]Marking compositionAnd (4) inputting. [ CLS]The token represents the vector of the current sentence, [ SEP ]]The tokens represent clauses used to segment sentences in the text. On the basis of word embedding, input position embedding and segmentation embedding are further arranged.

PE_(pos,2i)＝sin(pos/10000^2i/dmodel)。

PE_(pos，2i+1)＝cos(pos/10000^2i/dmodel)。

in the formula, W_cRepresenting a one-dimensional convolution kernel, also called a weight coefficient, is a learnable parameter. k denotes a distance from the input identity i. n represents the number of words in the sentence. x is the number of_i±kA word vector representing k words forward or backward from the ith word. The resulting signature graph may represent input X_iThe degree of association with the context.

alpha is the coefficient of expansion, W_cRepresenting a one-dimensional convolution kernel, also called a weight coefficient, is a learnable parameter. k represents the distance from the input identifier i, and the resulting feature map may represent the input X_iThe degree of association with the context. In the feature map C_iOn the basis of the method, a gate control mechanism convolutional neural network is introduced, and an output calculation formula is as follows:

Representing point-by-point multiplication. σ is a gating function.

in the formula (I), the compound is shown in the specification,

Preferably, the words are embedded as n sentences for a text D ═ S₁，S₂，……，S_nPre-treatment by two special markers. First, [ CLS ] is inserted at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]The tokens constitute the input. [ CLS]The token represents the vector of the current sentence, [ SEP ]]The marks representing clausesFor segmenting sentences in the text.

Preferably, the position in the model input is embedded as a position code of hierarchical decomposition. The position coding vector trained by using BERT is p₁,p₂，p₃,…,p_nConstructing a new set of position codes q by formula₁,q₂，q₃，…,q_mIn the formula, the structural formula is as follows:

q_(i-1)×n+jis position-coded.

And

because q is₁＝p₁,q₂＝p₂,……,q_n＝p_nCalculate u_i，

Word embedding, position embedding and segmented embedding are spliced into input of a Unilm model, and a sentence vector X obtained after a pre-training layer of the Unilm model is equal to (X)₁,x₂,…,x_n)＝Unilm(sent₁,sent₂,sent₃,…,sent_n)。

H^l＝Transformer₁(H^l-1)。

denotes x_iThe context of the input.

in the formula, A_lTo self-attention weight. The softmax function is a normalized exponential function. Q, K, V is derived from input X_iAnd linear transformation is carried out. V_lM represents a Mask matrix for Value of the l-th layer. d_kThe column number of Q, K matrix, namely the dimension of vector, prevents Q, K inner product from being too large and plays a role in regulation. T is denoted as transpose. In the formula (I), the compound is shown in the specification,

in the formula, N is the number of words in the sentence. Exp is an exponential function with a natural constant e as the base. u is a hyperparameter. t is time. k represents the sequence identifier of the input and has the value range of[1,N]. j is denoted as the ith word. The attention distribution can be interpreted as the attention degree of the ith word in context query, and the information weighted average is carried out on the attention distribution by the following formula to obtain a context expression vector h'_t：

h'_tAlso called context vectors, indicate that the information of interest is obtained from the distribution of attention.

The output of the last layer of the Transformer at the time t.

The calculation formula is as follows:

is the probability distribution of all words in the vocabulary.

in the formula, W_g、b_gAre all learnable parameters.

in the formula (I), the compound is shown in the specification,

Preferably, if w is an out-of-vocabulary word, then

If w does not appear in the source document, then

wherein T is the total time. P_t(w) generating probabilities for the vocabulary distribution and the attention distribution calculations.

Example 1

As shown in fig. 1, a two-stage hybrid automatic summarization method for judicial official documents includes the following steps:

Example 2

Repeating the embodiment 1, as shown in fig. 2, in the step 1.1), sentence division is performed on the referee document, then an artificial standard sentence is found in the referee document, and then a sentence with the highest similarity is found from the original text and is used as a tag data set of the abstract. And calculating the similarity score between the sentences in the artificial abstract and the sentences in the source document through cosine similarity, and selecting the sentences with the highest score in the source document, namely the key sentences.

Example 3

The embodiment 2 is repeated, except that the sentences with the highest scores are selected from the source documents in the step 1) for vectorization. The sentences obtained after similarity calculation and the original texts in the referee documents are in the same line, and the source texts, the label data and the artificial abstracts are subjected to word segmentation by adopting jieba. In the word segmentation process, legal nouns are crawled to be used as supplement of a word bank, and then word vectorization is carried out by using a BERT model.

Example 4

Repeating the embodiment 3, namely extracting and coding the key sentence in the step 1), wherein in a coding layer, the word embedding adopts a target word embedding vector, and on the basis of the word embedding, the input position embedding and the segmented embedding are carried out.

In the sentence coding layer, firstly, the sentence is divided into words to obtain word-level information for word embedding representation, and the word-level information is converted into a sentence vector as input:

for a text with n sentences, D ═ S₁，S₂，……，S_nPre-treatment by two special markers. First, [ CLS ] is inserted at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]The tokens constitute the input. [ CLS]The token represents the vector of the current sentence, [ SEP ]]The tokens represent clauses used to segment sentences in the text.

Position embedding: the position information of the word is encoded into a feature vector, and the position vector adopts the scheme in AttentinisAllyOutneed:

PE_(pos,2i)＝sin(pos/10000^2i/dmodel)。

PE_(pos,2i+1)＝cos(pos/10000^2i/dmodel)。

in the formula, pos represents the position of a word in a sentence, and the numeric area is [0, n ]. i refers to the dimension of the word vector. The input to the dmodel for BERT is 256.

Segment embedding for distinguishing two sentences, different sentences are respectively marked with A and B before, so that the input sentence is expressed as (E)_A，E_B，E_A，E_B… …). The word embedding, position embedding and segment embedding representations are stitched as BERT model inputs. Sentence vector X ═ X (X) obtained after pre-training layer via BERT model₁,X₂,……,X_n)＝BERT(sent₁,sent₂,sent₃,……,sent_n) Wherein sent_iThe i-th sentence, X, represented as the original referee document_iCorresponding send_iBERT-coded vector, X_iThe ith vector sequence that needs to be processed.

Each sentence vector is represented by word embedding, position embedding and segment embedding, so that the text vectorization work is completed.

Example 5

Example 4 was repeated, with the dmodel having a BERT input of 512.

Example 6

Example 5 was repeated, with the dmodel having a BERT input of 1024.

Example 7

The embodiment 6 is repeated, except that the words, positions and segments are embedded together by the coded sentences in the step 1) through bert + global average pooling, the words, the positions and the segments are output to a dense layer, the features extracted in the coding process are subjected to nonlinear change in dense, the association among the features is extracted, and finally the features are mapped to an output space.

Example 8

Example 7 is repeated, except that step 1) further comprises a classification layer, and a dilation residual gating convolutional neural network structure, namely DRGCNN, is adopted after the first dense layer is passed. And (3) abstracting key sentences by stacking the number of DRGCNN network layers, wherein the number of DRGCNN layers is 6, and the expansion coefficients of each layer are 1, 2, 4, 8, 1 and 1 respectively. Original input sequence for convolutional network X ═ X (X)₁,X₂,……,X_n) The convolution kernel is W, and the arbitrary convolution operation obtains a feature map C_iThe calculation formula of (2) is as follows:

in the formula, Wc represents a one-dimensional convolution kernel, also called a weight coefficient, and is a learnable parameter. k denotes a distance from the input identity i. n represents the number of words in the sentence. x is the number of_i±kA word vector representing k words forward or backward from the ith word. The resulting feature map may represent the degree of association between the input Xi and the context.

Example 9

Example 8 is repeated, except that the number of layers of the DRGCNN network is stacked in step 1) to extract the key abstract sentence, the number of layers of the DRGCNN is 8, and the expansion coefficients of each layer are 1, 2, 4, 8, 1 and 1 respectively.

Example 10

The embodiment 9 is repeated, the number of layers of the DRGCNN network is stacked to extract key sentences of the summary, the number of layers of the DRGCNN is 10, and the expansion coefficients of each layer are 1, 2, 4, 8, 1 and 1 respectively.

Example 11

Example 10 was repeated except that in step 1) the convolution width was expanded by adding the expansion coefficient α. And the network depth is increased through the maturity of the stacked expansion convolution neural network, the problem of long-distance dependence of the text sequence is solved, and global effective information is extracted. When α is 1, the dilation convolution operation corresponds to a full convolution operation. Alpha is alpha>1, the dilated convolution can learn more distant context information, feature map C_iThe calculation formula of (2) is as follows:

Introducing a residual error structure on the basis of a gating mechanism, wherein an output calculation formula is as follows:

Representing point-by-point multiplication. σ is a gating function.

And judging whether the sentences are classified into two categories of key sentences or not through the full connection layer. During training, cross entropy is selected as a loss function and is expressed as:

in the formula (I), the compound is shown in the specification,

Example 12

The embodiment 11 is repeated, and as shown in fig. 3, the sentence combination key sentence collection extracted from the referee document by the encoding and classification in the abstract model in the step 2) is used as the input of the generation model.

Example 13

The embodiment 12 is repeated, except that the key sentence collection is combined as the input of generating the model in the step 3), and the text abstract is generated by model coding and decoding the input. The model coding adopts a Unilm pre-training language model, and the input of the model consists of word embedding, segment embedding and position embedding. The words are embedded as text D ═ S for a piece of n sentences₁，S₂，……，S_nPre-treatment by two special markers. First, [ CLS ] is inserted at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]The tokens constitute the input. [ CLS]The token represents the vector of the current sentence, [ SEP ]]The tokens represent clauses used to segment sentences in the text. Segment embedding is used to distinguish two sentences, different sentences are preceded by a and B labels respectively,so that the input sentence is represented as (E)_A，E_B，E_A，E_B… …). The position in the model input is embedded into the position code of the hierarchical decomposition. The position coding vector trained by using BERT is p₁,p₂，p₃,…,p_nConstructing a new set of position codes q by formula₁，q₂,q₃，…，q_mIn the formula, the structural formula is as follows:

q_(i-1)×n+jis position-coded.

And

because q is₁＝p₁,q₂＝p₂,……,q_n＝p_nCalculate u_i，

Example 14

The embodiment 13 is repeated, except that the step 3) of generating the text summary by decoding the generated model comprises the following steps: the abstract generation learns the characteristics of the document level through a Transfoamer layer of a multi-layer attention mechanism. A copy mechanism is introduced in the decoding process of the model, and the copy mechanism comprises copying and generating. For a multi-layer transform backbone network, a text sequence X with an input length n is given as (X)₁,x₂,…,x_n). Transformer layer output H of first layer⁰The calculation formula of (2) is as follows: h⁰＝Transformer₀(x) In that respect Output H of Transformer passing through l layer^lThe calculation formula is as follows:

H^l＝Transformer₁(H^l-1)。

denotes x_iThe context of the input. In each Transformer module, a multi-head attention mechanism is added to aggregate output modules, and parts of output sequences needing attention are marked. Transformer self-attention for layer I A_lThe calculation formula is as follows:

in the formula, A_lIs a self-attention weight. The softmax function is a normalized exponential function. Q, K, V is derived from input X_iAnd linear transformation is carried out. V_lM represents a Mask matrix for Value of the l-th layer. d_kIs the column number of the Q, K matrix, i.e.The vector dimension prevents Q, K from being excessively large, and plays a role in regulation. T is denoted as transpose. In the formula (I), the compound is shown in the specification,

Example 15

Example 14 is repeated except that generating the text excerpt in step 3) further includes generating the text excerpt according to the last layer H of the Transformer at the decoding time t_tAnd output O of Decoder_jThe calculation method of the correlation weight is that

in the formula, N is the number of words in the sentence. Exp is an exponential function with a natural constant e as the base. u is a hyperparameter. t is time. k represents the input sequence identification and has the value range of [1, N]. j is denoted as the ith word. The attention distribution can be interpreted as the degree of attention of the ith word in the context query, as in the following formula pairCarrying out information weighted average on the attention distribution to obtain a context expression vector h'_t：

The output of the last layer of the Transformer at the time t.

The context vector is compared with the output O of Decoder_jConnected and generate a vocabulary distribution through two linear layers

The calculation formula is as follows:

is the probability distribution of all words in the vocabulary.

Reintroducing copy gating function g_t∈[0,1]To decide whether the current output chooses to copy from the source document or generate new words from the vocabulary. g_tThe calculation formula of (2) is as follows:

in the formula, W_g、b_gAre all learnable parameters.

in the formula (I), the compound is shown in the specification,

If w does not appear in the source document, then

Example 16

Example 15 is repeated, as shown in fig. 3, for example, input (five thousand dollars of pay to original notice within seven days of notice), prediction sentence is (pay to original notice of pay) and [ CLS ] is set before input sentence]As the current sentence vector, input [ SEP ] at the end of the input sentence]The tokens represent clauses used to segment sentences in the text. Sentence vector X ═ X (X) obtained after pre-training layer of Unilm model₁,x₂,…,x_n)＝Unilm(sent₁,sent₂,sent₃,…,sent_n) And inputting the data into a Transfoamer layer, and calculating a formula by using a final probability as follows:

derived (reported to original pay-Per-Payment [ SEP)])。

Claims

1. A two-stage hybrid automatic summarization method for judicial official documents is characterized by comprising the following steps:

1) calculating the similarity of key sentences in the referee document, coding and classifying abstract models of the key sentences, and finally extracting abstract key sentences;

2) extracting sentences from the referee document to combine into a key sentence collection;

2. The two-stage hybrid automatic summarization method for judicial official documents according to claim 1, wherein the calculating the similarity of key sentences in step 1) comprises:

step 1.1) sentence division is carried out on a referee document, then an artificial standard sentence is found in the referee document, and then a sentence with the highest similarity is found from an original text and is used as a tag data set of an abstract; and calculating the similarity score between the sentences in the artificial abstract and the sentences in the source document through cosine similarity, and selecting the sentences with the highest score in the source document, namely the key sentences.

3. The two-stage hybrid automatic summarization method for judicial official documents according to claim 1 or 2, wherein step 1) further comprises:

step 1.2) vectorizing the text, wherein sentences obtained after similarity calculation and original texts in a referee document are in the same line, and a source text, label data and an artificial abstract are subjected to word segmentation by adopting jieba; in the word segmentation process, legal nouns are crawled to be used as supplement of a word bank, and then word vectorization is carried out by using a BERT model.

4. The two-stage hybrid automatic summarization method for judicial official documents according to any of claims 1-3, wherein the encoding of the summarization model of key sentences in step 1) comprises:

extracting model coding; at the coding layer, word embedding adopts target word embedding vector, and for a text with n sentences, D ═ S₁，S₂，……，S_nPreprocessing by two special marks; first, [ CLS ] is inserted at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]Marking composition input; [ CLS]The token represents the vector of the current sentence, [ SEP ]]The mark represents a clause used for segmenting a sentence in the text; on the basis of word embedding, input position embedding and segmented embedding are also arranged;

the position is embedded; the position information of the word is coded into a feature vector, and the position vector adopts a scheme in the Attention is All You Need:

PE_(pos,2i)＝sin(pos/10000^2i/dmodel)；

PE_(pos,2i+1)＝cos(pos/10000^2i/dmodel)；

in the formula, pos represents the position of a word in a sentence, and the numeric area is [0, n ]; i refers to the dimension of the word vector; the input to the dmodel for BERT is 128-;

the segment embedding; for distinguishing two sentences, the different sentences are preceded by a and B labels, respectively, so that the input sentence is represented as (E)_A，E_B，E_A，E_B… …); concatenating word embedding, position embedding and segment embedding representations into BERT model input; sentence vector X ═ X (X) obtained after pre-training layer via BERT model₁,X₂,……,X_n)＝BERT(sent₁,sent₂,sent₃,……,sent_n) Wherein sent_iThe i-th sentence, X, represented as the original referee document_iCorresponding send_iBERT-coded vector, X_iThe ith vector sequence that needs to be processed.

5. The two-stage hybrid automatic summarization method for judicial official documents according to any of claims 1-4, wherein classifying the summarization models of key sentences in step 1) comprises:

a classification layer adopts an expansion residual gated convolutional neural network structure, namely the expansion residual gated convolutional neural network structure is DRGCNN; extracting key abstract sentences by stacking a plurality of layers of DRGCNN networks, wherein the number of the layers of the DRGCNN is 6-10, preferably 7-8, and the expansion coefficients of each layer are 1, 2, 4, 8, 1 and 1 respectively; original input sequence for convolutional network X ═ X (X)₁,X₂,……,X_n) With convolution kernel W, signature C of arbitrary convolution operation_iThe calculation formula of (2) is as follows:

in the formula, W_cA one-dimensional convolution kernel, also called a weight coefficient, is represented and is a learnable parameter; k represents the distance from the input identifier i; n represents the number of words in the sentence; x is the number of_i±kA word vector representing k words forward or backward from the ith word; the resulting signature graph may represent input X_iDegree of association with context;

expanding the convolution width by adding an expansion coefficient alpha; when α is 1, the dilation convolution operation is equivalent to a full convolution operation; alpha is alpha>1, the dilated convolution can learn more distant context information, feature map C_iThe calculation formula of (2) is as follows:

wherein α is an expansion coefficient, W_cA one-dimensional convolution kernel, also called a weight coefficient, is represented and is a learnable parameter; k represents the distance from the input identifier i, and the resulting feature map may represent the input X_iDegree of association with context; in the feature map C_iOn the basis of the method, a gate control mechanism convolutional neural network is introduced, and an output calculation formula is as follows:

in the formula, convD₁，convD₂Representing a one-dimensional convolution function; x represents a sentence vector;

represents point-by-point multiplication; sigma is a gating function; convD₁And convD₂Operating for two convolution functions and not sharing weight;

represents point-by-point multiplication; sigma is a gating function;

judging whether the sentences are classified into two categories of key sentences or not through the full-connection layer; during training, cross entropy is selected as a loss function and is expressed as:

in the formula (I), the compound is shown in the specification,

label data representing a sample i, the positive class being 1 and the negative class being 0; y represents the probability that sample i is predicted as a positive class; loss represents a Loss function.

6. The two-stage hybrid automatic summarization method for judicial official documents according to any of claims 2-4 wherein in step 2) the sentence combination key sentence collections are extracted from the official documents as input to the generation model by encoding and classification in the abstract model.

7. The two-stage hybrid automatic summarization method for judicial official documents according to any of claims 1-6, wherein the generating of the formula model in step 3) comprises: combining the key sentence collection as the input of a generation model, and generating a text abstract by model coding and decoding the input; the model coding adopts a Unilm pre-training language model, and the input of the model consists of word embedding, segment embedding and position embedding;

the words are embedded as text D ═ S for a piece of n sentences₁，S₂，……，S_nPreprocessing by two special marks; first, [ CLS ] is inserted at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]Marking composition input; [ CLS]The token represents the vector of the current sentence, [ SEP ]]The mark represents a clause used for segmenting a sentence in the text;

segment embedding is used to distinguish two sentences, the different sentences being preceded by a and B labels, respectively, so that the input sentence is represented as (E)_A，E_B，E_A，E_B,……)。

8. The two-stage hybrid automatic summarization method for judicial official documents according to claim 7, wherein: the position in the model input is embedded into a position code of hierarchical decomposition; the position coding vector trained by using BERT is p₁，p₂，p₃，…，p_nConstructing a new set of position codes q by formula₁，q₂，q₃，…，q_mIn the formula, the structural formula is as follows:

in the formula, q_(i-1)×n+jCoding the position;

the value is 0.4 for the hyper-parameter; q is the position code of the (i-1) x n + j position; i is the ith word; j is the jth word; u is a vector, the base vector of the q vector is represented by the trained position p vector

The pos represents the position of the word in the sentence and has the value range of 0, n](ii) a Hierarchically expressing the position code of (i-1) × n + j as (i, j) by a formula; the position codes corresponding to i and j are respectively

And

because q is₁＝p₁,q₂＝p₂,……,q_n＝p_nCalculate u_i，

9. The two-stage hybrid automatic summarization method for judicial official documents according to claim 7 or 8, wherein the decoding of the generated text summary in the generation model in step 3) comprises: the abstract generation learns the characteristics of the document level through a Transfoamer layer of a multi-layer attention mechanism; introducing a copying mechanism in the decoding process of the model, wherein the copying mechanism comprises copying and generating; for a multi-layer transform backbone network, a text sequence X with an input length n is given as (X)₁,x₂,…,x_n) (ii) a Transformer layer output H of first layer⁰The calculation formula of (2) is as follows: h⁰＝Transformer₀(x) (ii) a Output H of Transformer passing through l layer^lThe calculation formula is as follows:

H^l＝Transformer₁(H^l-1)；

h is a layer of a Transformer; final input result H^lThe calculation formula of (2) is as follows:

wherein L represents the number of layers, and is within [1, L ]](ii) a L represents the total number of layers of the Transformer,

denotes x_iThe context of the input;

adding a multi-head attention mechanism to each Transformer module to aggregate output modules, and marking the parts of the output sequences needing attention; transformer self-attention for layer I A_lThe calculation formula is as follows:

in the formula, A_lIs a self-attention weight; the softmax function is a normalized exponential function; q, K, V is derived from input X_iLinear transformation is obtained; v_lValue of the l-th layer, M represents a Mask matrix; d_kThe column number of the Q, K matrix, namely the dimension of the vector, prevents Q, K inner product from being too large and plays a role in regulation; t is denoted as transpose; in the formula (I), the compound is shown in the specification,

H_l-1∈

10. The two-stage hybrid automatic summarization method for judicial official documents according to any of claims 7-9, wherein generating a text summary further comprises: when the decoding time is t, according to the last layer H of the Transformer_tAnd output O of Decoder_jThe calculation method of the correlation weight is that

in the formula, N is the number of words in the sentence; exp is an exponential function with a natural constant e as the base; u is a hyperparameter; t is time; k represents an input sequence identifier, and the value range is [1, N ]; j is denoted as the ith word;

the attention distribution can be interpreted as the attention degree of the ith word in context query, and the information weighted average is carried out on the attention distribution by the following formula to obtain a context expression vector h'_t：

Wherein h'_tAlso called context vectors, which represent the acquisition of information of interest according to the distribution of attention;

the output of the last layer of the Transformer at the time t;

attention distribution for the jth word; i is the value of the sequence identifier; n is the number of words in the sentence;

The calculation formula is as follows:

wherein V ', V, b ' are learnable parameters h '_tFor context representation of vectors, O_jIs DThe output of the ecoder is then output,

is the probability distribution of all words in the vocabulary;

reintroducing copy gating function g_t∈[0,1]To decide whether the current output chooses to copy from the source document or generate new words from the vocabulary; g_tThe calculation formula of (2) is as follows:

in the formula, W_g、b_gAre all learnable parameters;

the output of the last layer of the Transformer at the time t; o is_jIs the output of Decoder; the formula shows that at the time t, whether the next word is a new word or is directly copied is determined according to the attention weight of the j word and other words;

for each document, combining the words in the vocabulary with all the appearing words of the source document to form a new word stock, namely an extended word stock; therefore, whether the copy is selected from the source document or generated from the vocabulary, the generation is performed from the extended thesaurus, so the final probability calculation formula is:

in the formula (I), the compound is shown in the specification,

representing the probability that the current word w is generated in a given vocabulary;

representing a probability of selecting a copy from the source document based on the attention distribution;

if w is an out-of-vocabulary word, then

If w does not appear in the source document, then

wherein T is the total time; p_t(w) generating probabilities for the lexical distribution and the attention distribution calculations;