CN114169312A - Two-stage hybrid automatic summarization method for judicial official documents - Google Patents

Two-stage hybrid automatic summarization method for judicial official documents Download PDF

Info

Publication number
CN114169312A
CN114169312A CN202111494073.7A CN202111494073A CN114169312A CN 114169312 A CN114169312 A CN 114169312A CN 202111494073 A CN202111494073 A CN 202111494073A CN 114169312 A CN114169312 A CN 114169312A
Authority
CN
China
Prior art keywords
word
sentence
input
sentences
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111494073.7A
Other languages
Chinese (zh)
Inventor
李波
欧阳建权
黄文鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Hailong International Intelligent Technology Co ltd
Xiangtan University
Original Assignee
Hunan Hailong International Intelligent Technology Co ltd
Xiangtan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Hailong International Intelligent Technology Co ltd, Xiangtan University filed Critical Hunan Hailong International Intelligent Technology Co ltd
Priority to CN202111494073.7A priority Critical patent/CN114169312A/en
Publication of CN114169312A publication Critical patent/CN114169312A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

A two-stage hybrid automatic summarization method for judicial official documents comprises the following steps: 1) calculating the similarity of key sentences in the referee document, coding and classifying abstract models of the key sentences, and finally extracting abstract key sentences; 2) extracting sentences from the referee document to combine into a key sentence collection; 3) and 2) taking the key sentence collection in the step 2) as the input of a generative model, and generating a text abstract through model coding and decoding. The invention can concentrate and refine long texts in many referee documents to obtain accurate useful information to generate the abstract. The abstract generated by the method provided by the invention has strong readability, strong continuity and high identification degree, and ensures the fidelity between the text and the abstract.

Description

Two-stage hybrid automatic summarization method for judicial official documents
Technical Field
The invention belongs to the technical field of official document data processing, and particularly designs a two-stage hybrid automatic summarization method for judicial official documents.
Background
With the rapid development of the information age, the data volume on the internet is exponentially increased. The text abstract technology abstracts and summarizes text information to extract the gist of articles, and the abstract is used for replacing an original text chapter to participate in indexing, so that the retrieval time can be effectively shortened, redundant information in a retrieval result can be reduced, and a user can efficiently acquire required information from a large amount of data.
Existing intelligent systems such as internet courts are generally used as auxiliary work for legal workers, for example, extracting information from referee documents by techniques such as semantic analysis, or constructing relationships between legal elements by manual processing. The official document is in a standard writing, but the content is exhaustive and lengthy, at present, the abstract is generated by extracting words, phrases and sentences with larger weight from the official document and combining the words, the phrases and the sentences, and the semantic coherence of the generated abstract is poor, so that the law and the official knowledge are not effectively fused, and the generated abstract is inconsistent and inaccurate. Therefore, a method for generating a referee document abstract is needed to ensure the consistency and accuracy of the referee document abstract.
The judicial official documents are the final carriers of judicial activities, and the existing judicial official documents are important bases for assisting criminal decision-making and standardizing the officials' scales. However, the number of official documents which are disclosed at present is as large as 1.2 hundred million, and how to acquire useful information from a plurality of official documents is an urgent problem to be solved. The automatic summarization technology can concentrate and refine long texts, and the short summaries are used for representing the long original texts, so that the automatic summarization technology is an important means for solving the problem of information overload.
The text automatic summarization technology can be divided into an abstract summary and a generated summary according to different summary generation modes. The extraction method is to take the text summarization task as a classification problem and judge whether the sentence is a summarization sentence, and the method keeps the loyalty with the original text, but because the text is directly extracted and spliced from the original text, the readability and the continuity of the generated summary are poor. Compared with the extraction method, the generation method is closer to the process of artificial summarization, a deep learning model is used for learning a large amount of text data, the text is encoded and decoded, and the summary of the extracted content is generated by a rephrasing and replacing method. Instead of extracting sentences directly from the source document, the generative abstract replaces the original text sentences by generating new sentences. Although the generating method can generate a new sentence, the generated sentence is easily contrary to the original consciousness, the fidelity is not guaranteed, and the generating method is easy to lose information for long text. The above disadvantages are more prominent when the judicial official document is used as a text with an ultra-long space and a single extraction or generation method is applied to the judicial official document. Therefore, the present invention provides a two-stage hybrid automatic summarization method combining an extraction method and a generation method, which effectively solves the above problems.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a two-stage hybrid automatic summarization method for a judicial official document. Firstly, forming a key sentence collection by adopting an extraction mode, secondly, taking the sentence collection as the input of a generation mode, and generating a text abstract by model coding and decoding; the text of the whole referee document is concentrated and refined, so that the space of the abstract text is reduced, the fidelity, readability and continuity of the generated abstract and the meaning of the original text are ensured, and the number of characters of the abstract generated manually is reduced, and the reliability is low.
In order to solve the problems, the following technical scheme is provided:
a two-stage hybrid automatic summarization method for judicial official documents comprises the following steps:
1) and calculating the similarity of the key sentences in the referee document, coding and classifying abstract models of the key sentences, and finally extracting the abstract key sentences.
2) Sentences are extracted from the referee documents and combined into a key sentence collection.
3) And 2) taking the key sentence collection in the step 2) as the input of a generative model, and generating a text abstract through model coding and decoding.
Preferably, the calculating the similarity of the key sentences in the step 1) includes:
step 1.1) sentence division is carried out on the referee document, then an artificial standard sentence is found in the referee document, and then a sentence with the highest similarity is found from the original text and is used as a tag data set of the extraction abstract. And calculating the similarity score between the sentences in the artificial abstract and the sentences in the source document through cosine similarity, and selecting the sentences with the highest score in the source document, namely the key sentences.
Preferably, step 1) further comprises:
and step 1.2) vectorizing the text, wherein sentences obtained after similarity calculation and original texts in the referee document are in the same line, and the source text, the label data and the artificial abstract are subjected to word segmentation by adopting jieba. In the word segmentation process, legal nouns are crawled to be used as supplement of a word bank, and then word vectorization is carried out by using a BERT model.
Preferably, the encoding of the abstract model of the key sentence in step 1) includes:
and (5) extracting model coding. At the coding layer, word embedding adopts target word embedding vector, and for a text with n sentences, D ═ S1,S2,……,SnPre-treatment by two special markers. First, [ CLS ] is inserted at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]The tokens constitute the input. [ CLS]The token represents the vector of the current sentence, [ SEP ]]The tokens represent clauses used to segment sentences in the text. On the basis of word embedding, input position embedding and segmentation embedding are further arranged.
Preferably, the location is embedded. The position information of the word is coded into a feature vector, and the position vector adopts a scheme in the Attention islalyouneed:
PE(pos,2i)=sin(pos/100002i/dmodel)。
PE(pos,2i+1)=cos(pos/100002i/dmodel)。
in the formula, pos represents the position of a word in a sentence, and the numeric area is [0, n ]. i refers to the dimension of the word vector. The input to the BERT is dmodel 128-1024, preferably 256-512.
Preferably, the segments are embedded. For distinguishing two sentences, the different sentences are preceded by a and B labels, respectively, so that the input sentence is represented as (E)A,EB,EA,EB… …). The word embedding, position embedding and segment embedding representations are stitched as BERT model inputs. Sentence vector X ═ X (X) obtained after pre-training layer via BERT model1,X2,……,Xn)=BERT(sent1,sent2,sent3,……,sentn) Wherein sentiThe i-th sentence, X, represented as the original referee documentiCorresponding sendiBERT-coded vector, XiThe ith vector sequence that needs to be processed.
Preferably, the classifying the abstract model of the key sentence in step 1) includes:
and (4) a classification layer, wherein an expansion residual gated convolutional neural network structure is adopted, namely the expansion residual gated convolutional neural network structure is DRGCNN. The key sentence extraction of the abstract is carried out by stacking a plurality of layers of DRGCNN networks, the number of the layers of the DRGCNN is 6-10, preferably 7-8, and the expansion coefficient of each layer is 1, 2, 4, 8, 1 and 1 respectively. Original input sequence for convolutional network X ═ X (X)1,X2,……,Xn) With convolution kernel W, signature C of arbitrary convolution operationiThe calculation formula of (2) is as follows:
Figure BDA0003399495640000031
in the formula, WcRepresenting a one-dimensional volumeThe product-kernel, also called weight coefficient, is a learnable parameter. k denotes a distance from the input identity i. n represents the number of words in the sentence. x is the number ofi±kA word vector representing k words forward or backward from the ith word. The resulting signature graph may represent input XiThe degree of association with the context.
Preferably, the convolution width is expanded by adding the expansion coefficient α. When α is 1, the dilation convolution operation corresponds to a full convolution operation. Alpha is alpha>1, the dilated convolution can learn more distant context information, feature map CiThe calculation formula of (2) is as follows:
Figure BDA0003399495640000032
wherein α is an expansion coefficient, WcRepresenting a one-dimensional convolution kernel, also called a weight coefficient, is a learnable parameter. k represents the distance from the input identifier i, and the resulting feature map may represent the input XiThe degree of association with the context. In the feature map CiOn the basis of the method, a gate control mechanism convolutional neural network is introduced, and an output calculation formula is as follows:
Figure BDA0003399495640000033
in the formula, convD1,convD2Representing a one-dimensional convolution function. X denotes a sentence vector.
Figure BDA0003399495640000034
Representing point-by-point multiplication. σ is a gating function. convD1And convD2Operate for two convolution functions and the weights are not shared.
Preferably, a residual structure is introduced on the basis of a gating mechanism, and the output calculation formula is as follows:
Figure BDA0003399495640000035
in the formula, convD1,convD2Representing a one-dimensional convolution function. X denotes a sentence vector.
Figure BDA0003399495640000041
Representing point-by-point multiplication. σ is a gating function.
Preferably, the sentence is further classified into two categories by the full link layer. During training, cross entropy is selected as a loss function and is expressed as:
Figure BDA0003399495640000042
in the formula (I), the compound is shown in the specification,
Figure BDA0003399495640000043
the label data representing sample i has a positive class of 1 and a negative class of 0. y represents the probability that sample i is predicted as a positive class. Loss represents a Loss function.
Preferably, the sentence combination key sentence collection is extracted from the referee document as the input of the generation model in the step 2) through coding and classification in the extraction abstract model.
Preferably, the formula model generated in step 3) includes: and combining the key sentence collection as the input of a generation model, and generating a text abstract by performing model coding and decoding on the input. The model coding adopts a Unilm pre-training language model, and the input of the model consists of word embedding, segment embedding and position embedding.
Preferably, the words are embedded as n sentences for a text D ═ S1,S2,……,SnPre-treatment by two special markers. First, [ CLS ] is inserted at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]The tokens constitute the input. [ CLS]The token represents the vector of the current sentence, [ SEP ]]The tokens represent clauses used to segment sentences in the text.
Preferably, segment embedding is used to distinguish two sentences, different sentences being preceded by a and B labels, respectively, so that the input sentence is represented as (E)A,EB,EA,EB,……)。
Preferably, the position in the model input is embedded as a position code of hierarchical decomposition. The position coding vector trained by using BERT is p1,p2,p3,…,pnConstructing a new set of position codes q by formula1,q2,q3,…,qmIn the formula, the structural formula is as follows:
Figure BDA0003399495640000044
in the formula, q(i-1)×n+jIs position-coded.
Figure BDA0003399495640000045
The value is a hyperparameter and is 0.4. q is the position code of the (i-1) × n + j position. i is the ith word. j is the jth word. u is a vector, the base vector of the q vector is represented by the trained position p vector
Figure BDA0003399495640000046
The pos represents the position of the word in the sentence and has the value range of 0, n]. The position code of (i-1) × n + j is hierarchically represented as (i, j) by the formula. The position codes corresponding to i and j are respectively
Figure BDA0003399495640000047
And
Figure BDA0003399495640000048
because q is1=p1,q2=p2,……,qn=pnCalculate ui
Figure BDA0003399495640000049
Word embedding, position embedding and segmented embedding are spliced into input of a Unilm model, and a sentence vector X obtained after a pre-training layer of the Unilm model is equal to (X)1,x2,…,xn)=Unilm(sent1,sent2,sent3,…,sentn)。
Preferably, the decoding of the generated text summary in the model generated in step 3) includes: the abstract generation learns the characteristics of the document level through a Transfoamer layer of a multi-layer attention mechanism. A copy mechanism is introduced in the decoding process of the model, and the copy mechanism comprises copying and generating. For a multi-layer transform backbone network, a text sequence X with an input length n is given as (X)1,x2,…,xn). Transformer layer output H of first layer0The calculation formula of (2) is as follows: h0=Transformer0(x) In that respect Output H of Transformer passing through l layerlThe calculation formula is as follows:
Hl=Transformer1(Hl-1)。
h is a layer of a Transformer. Final input result HlThe calculation formula of (2) is as follows:
Figure BDA0003399495640000051
wherein L represents the number of layers, and is within [1, L ]]. L represents the total number of layers of the Transformer,
Figure BDA0003399495640000052
denotes xiThe context of the input.
Preferably, in each transform module, a multi-head attention mechanism is added to aggregate output modules, and parts of output sequences needing attention are marked. Transformer self-attention for layer I AlThe calculation formula is as follows:
Figure BDA0003399495640000053
in the formula, AlIs a self-attention weight. The softmax function is a normalized exponential function. Q, K, V is derived from input XiAnd linear transformation is carried out. VlM represents a Mask matrix for Value of the l-th layer. dkIs the number of columns of the Q, K matrix, i.e. the dimension of the vector, preventing QAnd the inner product of K is too large, so that the regulation effect is realized. T is denoted as transpose. In the formula (I), the compound is shown in the specification,
Figure BDA0003399495640000054
Figure BDA0003399495640000055
Figure BDA0003399495640000056
is a linear projection of the previous layer to Queries, Values and Keys and the parameters of the projection are respectively
Figure BDA0003399495640000057
Figure BDA0003399495640000058
The Mask matrix M controls whether the Token is allowed to be added or not, different Mask matrixes M are used for controlling to pay attention to different contexts, and a copy mechanism is introduced to solve the problems of unregistered words and repeated words brought in the generation process.
Preferably, the generating the text summary further comprises: when the decoding time is t, according to the last layer H of the TransformertAnd output O of DecoderjThe calculation method of the correlation weight is that
Figure BDA0003399495640000059
Wherein WcTo initialize the matrix, the formula for the attention distribution of the j-th word is calculated at the same time as:
Figure BDA00033994956400000510
in the formula, N is the number of words in the sentence. Exp is an exponential function with a natural constant e as the base. u is a hyperparameter. t is time. k represents the input sequence identification and has the value range of [1, N]. j is denoted as the ith word. The attention distribution can be interpreted as the attention degree of the ith word in context query, and the information weighted average is carried out on the attention distribution by the following formula to obtain a context expression vector h't
Figure BDA0003399495640000061
Wherein h'tAlso called context vectors, indicate that the information of interest is obtained from the distribution of attention.
Figure BDA0003399495640000062
The output of the last layer of the Transformer at the time t.
Figure BDA0003399495640000063
The attention distribution for the jth word. i is the value of the sequence identity. And N is the number of words in the sentence.
Preferably, the context vector is compared with the output O of the DecoderjConnected and generate a vocabulary distribution through two linear layers
Figure BDA0003399495640000064
The calculation formula is as follows:
Figure BDA0003399495640000065
wherein V ', V, b ' are learnable parameters h 'tFor context representation of vectors, OjIs the output of the Decoder and is the output of the Decoder,
Figure BDA0003399495640000066
is the probability distribution of all words in the vocabulary.
Preferably, a copy gating function g is reintroducedt∈[0,1]To decide whether the current output chooses to copy from the source document or generate new words from the vocabulary. gtThe calculation formula of (2) is as follows:
Figure BDA0003399495640000067
in the formula, Wg、bgAre all learnable parameters.
Figure BDA0003399495640000068
The output of the last layer of the Transformer at the time t. O isjIs the output of Decoder. The formula shows that at the time t, whether the next word is to generate a new word or to directly copy the new word is determined according to the attention weights of the j word and other words.
Preferably, for each document, the words in the vocabulary are combined with all the occurring words of the source document to form a new word stock, i.e., an extended word stock. Therefore, whether the copy is selected from the source document or generated from the vocabulary, the generation is performed from the extended thesaurus, so the final probability calculation formula is:
Figure BDA0003399495640000069
wherein the content of the first and second substances,
Figure BDA00033994956400000610
indicating the probability that the current word w was generated in a given vocabulary.
Figure BDA00033994956400000611
Representing the probability of selecting a copy from the source document based on the attention distribution.
Preferably, if w is an out-of-vocabulary word, then
Figure BDA00033994956400000612
If w does not appear in the source document, then
Figure BDA00033994956400000613
The final loss function for expanding the probability distribution of the word bank is as follows:
Figure BDA0003399495640000071
wherein T is the total time. Pt(w) is the distribution of words andthe attention distribution calculation generates a probability.
Calculating and generating probability P according to vocabulary distribution and attention distributiontAnd (w), finally, automatically generating the text abstract according to the generation probability and the vocabulary distribution.
In the prior art, an existing intelligent system such as an internet court is generally used as an auxiliary work of legal workers, for example, information is extracted from a referee document through a technology such as semantic analysis, or a relationship between legal elements is constructed through a manual processing mode. The official document is in a standard writing, but the content is exhaustive and lengthy, at present, the abstract is generated by extracting words, phrases and sentences with larger weight from the official document and combining the words, the phrases and the sentences, and the semantic coherence of the generated abstract is poor, so that the law and the official knowledge are not effectively fused, and the generated abstract is inconsistent and inaccurate. Therefore, a method for generating a referee document abstract is needed to ensure the consistency and accuracy of the referee document abstract.
In the invention, firstly, the referee document is divided into sentences, and then the sentences with the highest similarity are found out in the referee document through the manual standard sentences and are used as the data of the extraction model; and calculating the similarity score between the sentence in the artificial abstract and the sentence in the source document (referee document library) by using cosine similarity, and selecting the sentence with the highest score in the source document as a key sentence according to the similarity score.
In the invention, text vectorization is carried out, sentences obtained after similarity calculation and original texts in a referee document are in the same line, and a source text, label data and an artificial abstract are subjected to word segmentation by adopting jieba. In the word segmentation process, legal nouns are crawled to be used as supplements of a word bank, and word vectorization is carried out by using a BERT model. The method comprises the steps of obtaining a primary key sentence after similarity calculation for text vectorization, performing word segmentation on a source text, tag data and an artificial abstract by adopting a jieba to determine a first word (a central word), crawling legal nouns (legal nouns in a mesoscale network) in the word segmentation process to be used as a supplement of a word bank, adding an incidence relation between the first word and a target word into an initial word embedding vector after determining the first word, obtaining a fused word embedding vector capable of reflecting the incidence relation between the target word (the target word) and the first word, and determining the fused word embedding vector as the target word embedding vector of the target word.
In the invention, the input of the extraction mode is based on word embedding, and also input position embedding and segmented embedding are carried out; for referee documents, there are n sentences D ═ S1,S2,……,SnBy inserting [ CLS ] at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]Tag composition input, [ CLS]The token represents the vector of the current sentence, [ SEP ]]The mark represents a clause used for segmenting a sentence in the text; [ CLS]Identification sum [ SEP]The marks are used for segmenting sentences, so that the semantics of each sentence can be better captured, and the accuracy of information extraction is improved; position embedding encodes the position information of the word into a feature vector, and the position vector adopts a scheme in Attentions AllYouNeed:
PE(pos,2i)=sin(pos/100002i/dmodel)。
PE(pos,2i+1)=cos(pos/100002i/dmodel)。
in the formula, pos represents the position of a word in a sentence and has a value range of [0, n]. i refers to the dimension of the word vector. The input to the BERT is dmodel 128-1024, preferably 256-512. Segment embedding for distinguishing two sentences, different sentences are respectively marked with A and B before, so that the input sentence is expressed as (E)A,EB,EA,EB,……)。
In the present invention, word embedding, position embedding, and segment embedding represent concatenations as inputs to the BERT model, and the sentence vector X obtained after the BERT model pre-training layer is equal to (X)1,X2,……,Xn)=BERT(sent1,sent2,sent3,……,sentn) Wherein sentiThe i-th sentence, X, represented as the original referee documentiCorresponding sendiBERT-coded vector, XiThe ith vector sequence that needs to be processed. Coding process is carried out through a BERT preprocessing model (word embedding, position embedding and segmentation are carried out on coded sentences through BERT + global average poolingEmbedding together, outputting to a dense layer, extracting the correlation among the features through nonlinear change of the features extracted in the encoding process, and finally mapping to an output space. ). And embedding each character, splitting Chinese into individual characters for learning, and classifying by a full connection layer and a softmax layer to obtain a classification result.
In the invention, the key sentences are subjected to feature learning through a Dilated Residual Gated Convolutional Neural Network (DRGCNN); compared with the more traditional convolutional neural network, the DRGCNN enhances the ability of a model to learn long-distance context semantic information; a gating mechanism (DGCNN) is introduced to control the flow direction of information, and a residual error mechanism is introduced to solve the problem of gradient disappearance and increase the multi-channel transmission of the information; extracting key abstract sentences by stacking 6-10 layers, preferably 7-8 layers of DRGCNN networks, wherein the expansion coefficients of each layer are 1, 2, 4, 8, 1 and 1 respectively; before the coding sequence is processed by adopting a self-attention mechanism, a residual error network and a gate control convolution are adopted to process data, and the coding sequence with a text relation is obtained. Original input sequence for convolutional network X ═ X (X)1,X2,……,Xn) The convolution kernel is W, and the arbitrary convolution operation obtains a feature map CiThe calculation formula of (2) is as follows:
Figure BDA0003399495640000081
in the formula, WcRepresenting a one-dimensional convolution kernel, also called a weight coefficient, is a learnable parameter. k denotes a distance from the input identity i. n represents the number of words in the sentence. x is the number ofi±kA word vector representing k words forward or backward from the ith word. The resulting signature graph may represent input XiThe degree of association with the context. Expanding the convolution width by adding an expansion coefficient alpha, and increasing the network depth by maturing a stacked expansion convolution neural network, thereby solving the problem of long-distance dependence of a text sequence and the problem of extracting global effective information; when α is 1, the dilation convolution operation is equivalent to a full convolution operation; when alpha is>1, the dilated convolution can learn more distant context informationAt this time, the characteristic diagram CiThe calculation formula of (2) is as follows:
Figure BDA0003399495640000082
wherein α is an expansion coefficient, WcRepresenting a one-dimensional convolution kernel, also called a weight coefficient, is a learnable parameter. k represents the distance from the input identifier i, and the resulting feature map may represent the input XiThe degree of association with the context. In the feature map CiThe convolutional neural network of a gating mechanism (DGCNN) is introduced on the basis, and the output calculation formula is as follows:
Figure BDA0003399495640000091
in the formula, convD1And convD2Representing a one-dimensional convolution function, X representing a vector of sentences,
Figure BDA0003399495640000092
representing point-by-point multiplication, σ being a gating function, conv1And conv2Two convolution operations are performed and weights are not shared; activating one convolution and calculating an outer product between the two convolutions, so that gradient disappearance of the neural network can be relieved; if a Plain network (Plain network) similar to a vgg (visual Geometry group) network is used, there is no residual error, and it is empirically found that as the depth of the network increases, the training errors decrease and then increase (and the increase in the errors is not caused by overfitting, but is difficult to train due to the fact that the network becomes deeper). The deeper the network depth the better, but in practice, the deeper the depth means that it is harder to train with an optimization algorithm for a normal network if there is no residual network. In fact, as the depth of the network increases, training errors increase, which is described as network degradation. The residual error network is helpful for solving the problems of gradient disappearance, gradient explosion and network degradation, so that the good information can be ensured while a deeper network is trained. Therefore, a residual error net is introduced on the basis of a gating mechanismThe output calculation formula of the network structure is as follows:
Figure BDA0003399495640000093
in the formula, convD1,convD2Representing a one-dimensional convolution function. X denotes a sentence vector.
Figure BDA0003399495640000094
Representing point-by-point multiplication. σ is a gating function. And judging whether the sentences are classified into two categories of key sentences or not through the full connection layer. During training, cross entropy (cross entropy is information for measuring difference between two probability distributions) is selected as a loss function, and is expressed as follows:
Figure BDA0003399495640000095
in the formula (I), the compound is shown in the specification,
Figure BDA0003399495640000096
label representing sample i has a positive class of 1 and a negative class of 0. y represents the probability that sample i is predicted as a positive class. Loss represents a Loss function.
In the invention, the coding and classification in the extraction abstract model extracts a sentence combination key sentence collection from a referee document as the input of a generation model, and the input is coded and decoded by the model to generate a text abstract; the model coding adopts a Unilm pre-training language model, a pre-training data set is constructed through Unilm, a target detection method is used for carrying out target detection on a text, a result is used as key text information, and the key text information is input in a keyword embedding mode, wherein the input of the model consists of word embedding, position embedding and segment embedding; the word embedding and segment embedding modes in the word embedding and segment embedding extraction type model are the same; the position embedding is position coding of hierarchical decomposition, and a position coding vector trained by BERT is used as p1,p2,p3,…,pnConstructing a new set of position codes q by formula1,q2,q3,…,qmWherein m is>n; the structural formula is as follows:
Figure BDA0003399495640000097
in the formula, q(i-1)×n+jIs position-coded.
Figure BDA0003399495640000098
The value is a hyperparameter and is 0.4. q is the position code of the (i-1) × n + j position. i is the ith word. j is the jth word. u is a vector, the base vector of the q vector is represented by the trained position p vector
Figure BDA0003399495640000101
The pos represents the position of the word in the sentence and has the value range of 0, n]. The position code of (i-1) × n + j is hierarchically represented as (i, j) by the formula. The position codes corresponding to i and j are respectively
Figure BDA0003399495640000102
And
Figure BDA0003399495640000103
because q is1=p1,q2=p2,……,qn=pnCalculate ui
Figure BDA0003399495640000104
Word embedding, position embedding and segmented embedding are spliced into input of a Unilm model, and a sentence vector X obtained after a pre-training layer of the Unilm model is equal to (X)1,x2,…,xn)=Unilm(sent1,sent2,sent3,…,sentn). Inputting the key paragraphs and key sentence information into an encoder in a Unilm model for encoding to form comprehensive semantic representation; and finally, decoding is carried out through transformations, which is helpful for ensuring that the key information finally generated by the formed sentence generation model has the diversity of language organization modes and the comprehensiveness of knowledge point coverage.
In the invention, the generated text learns the characteristics of the document level through a Transformer layer of a multi-layer attention mechanism, and a copying mechanism is introduced in the decoding process of the model and comprises copying and generating; and adding the position vector and the word vector to obtain a word vector containing word sequence information, wherein all the word vectors containing the word sequence information in each paragraph form a word vector set containing context information and with known context. For a multi-layer transform backbone network, a text sequence X with an input length n is given as (X)1,x2,…,xn) (ii) a Then the transform layer output H of the first layer0The calculation formula of (2) is as follows: h0=Transformer0(x) In that respect Output H of Transformer passing through l layerlThe calculation formula is as follows:
Hl=Transformer1(Hl-1)。
h is a layer of a Transformer. Final input result HlThe calculation formula of (2) is as follows:
Figure BDA0003399495640000105
wherein L represents the number of layers, and is within [1, L ]]. L represents the total number of layers of the Transformer,
Figure BDA0003399495640000106
denotes xiThe context of the input.
Preferably, in each transform module, a multi-head attention mechanism is added to aggregate output modules, and parts of output sequences needing attention are marked. Transformer self-attention for layer I AlThe calculation formula is as follows:
Figure BDA0003399495640000107
in the formula, AlIs a self-attention weight. The softmax function is a normalized exponential function. Q, K, V is derived from input XiAnd linear transformation is carried out. VlIs the first layerValue, M, of (a) represents a Mask matrix. dkThe column number of Q, K matrix, namely the dimension of vector, prevents Q, K inner product from being too large and plays a role in regulation. T is denoted as transpose. In the formula (I), the compound is shown in the specification,
Figure BDA0003399495640000108
Figure BDA0003399495640000109
Figure BDA00033994956400001010
is a linear projection of the previous layer to Queries, Values and Keys and the parameters of the projection are respectively
Figure BDA00033994956400001011
Figure BDA00033994956400001012
The Mask matrix M controls whether the Token is allowed to be added or not, different Mask matrixes M are used for controlling to pay attention to different contexts, and a copy mechanism is introduced to solve the problems of unregistered words and repeated words brought in the generation process.
In the invention, when the decoding time is t, the last layer H according to the TransformertAnd output O of DecoderjThe calculation method of the correlation weight is that
Figure BDA0003399495640000111
Wherein WcTo initialize the matrix, the formula for the attention distribution of the j-th word is calculated at the same time as:
Figure BDA0003399495640000112
in the formula, N is the number of words in the sentence. Exp is an exponential function with a natural constant e as the base. u is a hyperparameter. t is time. k represents the input sequence identification and has the value range of [1, N]. j is denoted as the ith word. The attention distribution can be interpreted as the attention degree of the ith word in the context query, and the attention is divided by the following formulaCarrying out information weighted average to obtain a context expression vector h't
Figure BDA0003399495640000113
Wherein h'tAlso called context vectors, indicate that the information of interest is obtained from the distribution of attention.
Figure BDA0003399495640000114
The output of the last layer of the Transformer at the time t.
Figure BDA0003399495640000115
The attention distribution for the jth word. i is the value of the sequence identity. And N is the number of words in the sentence. The context vector is compared with the output O of DecoderjConnected and generate a vocabulary distribution through two linear layers
Figure BDA0003399495640000116
The calculation formula is as follows:
Figure BDA0003399495640000117
wherein V ', V, b ' are learnable parameters h 'tFor context representation of vectors, OjIs the output of the Decoder and is the output of the Decoder,
Figure BDA0003399495640000118
is the probability distribution of all words in the vocabulary. Reintroducing copy gating function gt∈[0,1]To decide whether the current output chooses to copy from the source document or generate new words from the vocabulary. gtThe calculation formula of (2) is as follows:
Figure BDA0003399495640000119
in the formula, Wg、bgAre all learnable parameters.
Figure BDA00033994956400001110
The output of the last layer of the Transformer at the time t. O isjIs the output of Decoder. The formula shows that at the time t, whether the next word is to generate a new word or to directly copy the new word is determined according to the attention weights of the j word and other words. For each document, the words in the vocabulary are combined with all the appearing words in the source document to form a new word stock, namely an extended word stock. Therefore, whether the copy is selected from the source document or generated from the vocabulary, the generation is performed from the extended thesaurus, so the final probability calculation formula is:
Figure BDA00033994956400001111
in the formula (I), the compound is shown in the specification,
Figure BDA0003399495640000121
indicating the probability that the current word w was generated in a given vocabulary.
Figure BDA0003399495640000122
Representing the probability of selecting a copy from the source document based on the attention distribution. If w is an out-of-vocabulary word, then
Figure BDA0003399495640000123
If w does not appear in the source document, then
Figure BDA0003399495640000124
The final loss function for expanding the probability distribution of the word bank is as follows:
Figure BDA0003399495640000125
wherein T is the total time. Pt(w) generating probabilities for the vocabulary distribution and the attention distribution calculations. Calculating and generating probability P according to vocabulary distribution and attention distributiont(w) finally according to the generationThe probability and lexical distribution automatically generate a text summary. The method for generating the abstract by the Transformer can learn the dependency relationship in each text first and then model the relationship among the texts, so that the sequence length of single input is greatly shortened, cross-text association can be conveniently learned, and the abstract generation is quick and accurate.
By adopting the method, the data can be better migrated to various specific fields such as tourism, medicine, news, natural science and the like through fine adjustment.
Compared with the prior art, the technical scheme of the invention has the following beneficial technical effects:
1. in the invention, through combining the extraction mode and the generation mode, the problems of poor readability and poor continuity of the abstract formed by the single extraction mode, and the problems of contradictory meaning and low fidelity of the abstract formed by the single generation mode and the original meaning are solved.
2. In the invention, the abstract of the referee document is formed by adopting two stages, the first stage extracts sentences from the referee document to be combined into key sentences, the second stage takes the extracted key sentences as the input of a generation mode, and the text abstract is formed by model coding and decoding, so that the accuracy and the fidelity degree of the text abstract can be ensured by the two stages.
3. In the invention, key information is extracted from the source document to form key sentences, and the key sentences are coded and combined to form the abstract with the same meaning as the source document through the generation mode, thereby greatly reducing the space of manual abstract text.
Drawings
FIG. 1 is a schematic structural diagram of a two-stage hybrid automatic summarization method for judicial official documents according to the present invention.
FIG. 2 is a schematic diagram of an abstraction model structure of a two-stage hybrid automatic summarization method for judicial official documents according to the present invention.
FIG. 3 is a schematic diagram of a generative model structure of a two-stage hybrid automatic summarization method for judicial official documents according to the present invention.
Detailed Description
The technical solution of the present invention is illustrated below, and the claimed scope of the present invention includes, but is not limited to, the following examples.
A two-stage hybrid automatic summarization method for judicial official documents comprises the following steps:
1) and calculating the similarity of the key sentences in the referee document, coding and classifying abstract models of the key sentences, and finally extracting the abstract key sentences.
2) Sentences are extracted from the referee documents and combined into a key sentence collection.
3) And 2) taking the key sentence collection in the step 2) as the input of a generative model, and generating a text abstract through model coding and decoding.
Preferably, the calculating the similarity of the key sentences in the step 1) includes:
step 1.1) sentence division is carried out on the referee document, then an artificial standard sentence is found in the referee document, and then a sentence with the highest similarity is found from the original text and is used as a tag data set of the extraction abstract. And calculating the similarity score between the sentences in the artificial abstract and the sentences in the source document through cosine similarity, and selecting the sentences with the highest score in the source document, namely the key sentences.
Preferably, step 1) further comprises:
and step 1.2) vectorizing the text, wherein sentences obtained after similarity calculation and original texts in the referee document are in the same line, and the source text, the label data and the artificial abstract are subjected to word segmentation by adopting jieba. In the word segmentation process, legal nouns are crawled to be used as supplement of a word bank, and then word vectorization is carried out by using a BERT model.
Preferably, the encoding of the abstract model of the key sentence in step 1) includes:
and (5) extracting model coding. At the coding layer, word embedding adopts target word embedding vector, and for a text with n sentences, D ═ S1,S2,……,SnPre-treatment by two special markers. First, [ CLS ] is inserted at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]Marking compositionAnd (4) inputting. [ CLS]The token represents the vector of the current sentence, [ SEP ]]The tokens represent clauses used to segment sentences in the text. On the basis of word embedding, input position embedding and segmentation embedding are further arranged.
Preferably, the location is embedded. The position information of the word is coded into a feature vector, and the position vector adopts a scheme in the Attention islalyouneed:
PE(pos,2i)=sin(pos/100002i/dmodel)。
PE(pos,2i+1)=cos(pos/100002i/dmodel)。
in the formula, pos represents the position of a word in a sentence, and the numeric area is [0, n ]. i refers to the dimension of the word vector. The input to the BERT is dmodel 128-1024, preferably 256-512.
Preferably, the segments are embedded. For distinguishing two sentences, the different sentences are preceded by a and B labels, respectively, so that the input sentence is represented as (E)A,EB,EA,EB… …). The word embedding, position embedding and segment embedding representations are stitched as BERT model inputs. Sentence vector X ═ X (X) obtained after pre-training layer via BERT model1,X2,……,Xn)=BERT(sent1,sent2,sent3,……,sentn) Wherein sentiThe i-th sentence, X, represented as the original referee documentiCorresponding sendiBERT-coded vector, XiThe ith vector sequence that needs to be processed.
Preferably, the classifying the abstract model of the key sentence in step 1) includes:
and (4) a classification layer, wherein an expansion residual gated convolutional neural network structure is adopted, namely the expansion residual gated convolutional neural network structure is DRGCNN. The key sentence extraction of the abstract is carried out by stacking a plurality of layers of DRGCNN networks, the number of the layers of the DRGCNN is 6-10, preferably 7-8, and the expansion coefficient of each layer is 1, 2, 4, 8, 1 and 1 respectively. Original input sequence for convolutional network X ═ X (X)1,X2,……,Xn) With convolution kernel W, signature C of arbitrary convolution operationiThe calculation formula of (2) is as follows:
Figure BDA0003399495640000141
in the formula, WcRepresenting a one-dimensional convolution kernel, also called a weight coefficient, is a learnable parameter. k denotes a distance from the input identity i. n represents the number of words in the sentence. x is the number ofi±kA word vector representing k words forward or backward from the ith word. The resulting signature graph may represent input XiThe degree of association with the context.
Preferably, the convolution width is expanded by adding the expansion coefficient α. When α is 1, the dilation convolution operation corresponds to a full convolution operation. Alpha is alpha>1, the dilated convolution can learn more distant context information, feature map CiThe calculation formula of (2) is as follows:
Figure BDA0003399495640000142
alpha is the coefficient of expansion, WcRepresenting a one-dimensional convolution kernel, also called a weight coefficient, is a learnable parameter. k represents the distance from the input identifier i, and the resulting feature map may represent the input XiThe degree of association with the context. In the feature map CiOn the basis of the method, a gate control mechanism convolutional neural network is introduced, and an output calculation formula is as follows:
Figure BDA0003399495640000143
in the formula, convD1,convD2Representing a one-dimensional convolution function. X denotes a sentence vector.
Figure BDA0003399495640000144
Representing point-by-point multiplication. σ is a gating function. convD1And convD2Operate for two convolution functions and the weights are not shared.
Preferably, a residual structure is introduced on the basis of a gating mechanism, and the output calculation formula is as follows:
Figure BDA0003399495640000145
in the formula, convD1,convD2Representing a one-dimensional convolution function. X denotes a sentence vector.
Figure BDA0003399495640000146
Representing point-by-point multiplication. σ is a gating function.
Preferably, the sentence is further classified into two categories by the full link layer. During training, cross entropy is selected as a loss function and is expressed as:
Figure BDA0003399495640000147
in the formula (I), the compound is shown in the specification,
Figure BDA0003399495640000148
the label data representing sample i has a positive class of 1 and a negative class of 0. y represents the probability that sample i is predicted as a positive class. Loss represents a Loss function.
Preferably, the sentence combination key sentence collection is extracted from the referee document as the input of the generation model in the step 2) through coding and classification in the extraction abstract model.
Preferably, the formula model generated in step 3) includes: and combining the key sentence collection as the input of a generation model, and generating a text abstract by performing model coding and decoding on the input. The model coding adopts a Unilm pre-training language model, and the input of the model consists of word embedding, segment embedding and position embedding.
Preferably, the words are embedded as n sentences for a text D ═ S1,S2,……,SnPre-treatment by two special markers. First, [ CLS ] is inserted at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]The tokens constitute the input. [ CLS]The token represents the vector of the current sentence, [ SEP ]]The marks representing clausesFor segmenting sentences in the text.
Preferably, segment embedding is used to distinguish two sentences, different sentences being preceded by a and B labels, respectively, so that the input sentence is represented as (E)A,EB,EA,EB,……)。
Preferably, the position in the model input is embedded as a position code of hierarchical decomposition. The position coding vector trained by using BERT is p1,p2,p3,…,pnConstructing a new set of position codes q by formula1,q2,q3,…,qmIn the formula, the structural formula is as follows:
Figure BDA0003399495640000151
q(i-1)×n+jis position-coded.
Figure BDA0003399495640000152
The value is a hyperparameter and is 0.4. q is the position code of the (i-1) × n + j position. i is the ith word. j is the jth word. u is a vector, the base vector of the q vector is represented by the trained position p vector
Figure BDA0003399495640000153
The pos represents the position of the word in the sentence and has the value range of 0, n]. The position code of (i-1) × n + j is hierarchically represented as (i, j) by the formula. The position codes corresponding to i and j are respectively
Figure BDA0003399495640000154
And
Figure BDA0003399495640000155
because q is1=p1,q2=p2,……,qn=pnCalculate ui
Figure BDA0003399495640000156
Word embedding, position embedding and segmented embedding are spliced into input of a Unilm model, and a sentence vector X obtained after a pre-training layer of the Unilm model is equal to (X)1,x2,…,xn)=Unilm(sent1,sent2,sent3,…,sentn)。
Preferably, the decoding of the generated text summary in the model generated in step 3) includes: the abstract generation learns the characteristics of the document level through a Transfoamer layer of a multi-layer attention mechanism. A copy mechanism is introduced in the decoding process of the model, and the copy mechanism comprises copying and generating. For a multi-layer transform backbone network, a text sequence X with an input length n is given as (X)1,x2,…,xn). Transformer layer output H of first layer0The calculation formula of (2) is as follows: h0=Transformer0(x) In that respect Output H of Transformer passing through l layerlThe calculation formula is as follows:
Hl=Transformer1(Hl-1)。
h is a layer of a Transformer. Final input result HlThe calculation formula of (2) is as follows:
Figure BDA0003399495640000161
wherein L represents the number of layers, and is within [1, L ]]. L represents the total number of layers of the Transformer,
Figure BDA0003399495640000162
denotes xiThe context of the input.
Preferably, in each transform module, a multi-head attention mechanism is added to aggregate output modules, and parts of output sequences needing attention are marked. Transformer self-attention for layer I AlThe calculation formula is as follows:
Figure BDA0003399495640000163
in the formula, AlTo self-attention weight. The softmax function is a normalized exponential function. Q, K, V is derived from input XiAnd linear transformation is carried out. VlM represents a Mask matrix for Value of the l-th layer. dkThe column number of Q, K matrix, namely the dimension of vector, prevents Q, K inner product from being too large and plays a role in regulation. T is denoted as transpose. In the formula (I), the compound is shown in the specification,
Figure BDA0003399495640000164
Figure BDA0003399495640000165
Figure BDA0003399495640000166
is a linear projection of the previous layer to Queries, Values and Keys and the parameters of the projection are respectively
Figure BDA0003399495640000167
Figure BDA0003399495640000168
The Mask matrix M controls whether the Token is allowed to be added or not, different Mask matrixes M are used for controlling to pay attention to different contexts, and a copy mechanism is introduced to solve the problems of unregistered words and repeated words brought in the generation process.
Preferably, the generating the text summary further comprises: when the decoding time is t, according to the last layer H of the TransformertAnd output O of DecoderjThe calculation method of the correlation weight is that
Figure BDA0003399495640000169
Wherein WcTo initialize the matrix, the formula for the attention distribution of the j-th word is calculated at the same time as:
Figure BDA00033994956400001610
in the formula, N is the number of words in the sentence. Exp is an exponential function with a natural constant e as the base. u is a hyperparameter. t is time. k represents the sequence identifier of the input and has the value range of[1,N]. j is denoted as the ith word. The attention distribution can be interpreted as the attention degree of the ith word in context query, and the information weighted average is carried out on the attention distribution by the following formula to obtain a context expression vector h't
Figure BDA00033994956400001611
h'tAlso called context vectors, indicate that the information of interest is obtained from the distribution of attention.
Figure BDA00033994956400001612
The output of the last layer of the Transformer at the time t.
Figure BDA00033994956400001613
The attention distribution for the jth word. i is the value of the sequence identity. And N is the number of words in the sentence.
Preferably, the context vector is compared with the output O of the DecoderjConnected and generate a vocabulary distribution through two linear layers
Figure BDA0003399495640000171
The calculation formula is as follows:
Figure BDA0003399495640000172
wherein V ', V, b ' are learnable parameters h 'tFor context representation of vectors, OjIs the output of the Decoder and is the output of the Decoder,
Figure BDA0003399495640000173
is the probability distribution of all words in the vocabulary.
Preferably, a copy gating function g is reintroducedt∈[0,1]To decide whether the current output chooses to copy from the source document or generate new words from the vocabulary. gtThe calculation formula of (2) is as follows:
Figure BDA0003399495640000174
in the formula, Wg、bgAre all learnable parameters.
Figure BDA0003399495640000175
The output of the last layer of the Transformer at the time t. O isjIs the output of Decoder. The formula shows that at the time t, whether the next word is to generate a new word or to directly copy the new word is determined according to the attention weights of the j word and other words.
Preferably, for each document, the words in the vocabulary are combined with all the occurring words of the source document to form a new word stock, i.e., an extended word stock. Therefore, whether the copy is selected from the source document or generated from the vocabulary, the generation is performed from the extended thesaurus, so the final probability calculation formula is:
Figure BDA0003399495640000176
in the formula (I), the compound is shown in the specification,
Figure BDA0003399495640000177
indicating the probability that the current word w was generated in a given vocabulary.
Figure BDA0003399495640000178
Representing the probability of selecting a copy from the source document based on the attention distribution.
Preferably, if w is an out-of-vocabulary word, then
Figure BDA0003399495640000179
If w does not appear in the source document, then
Figure BDA00033994956400001710
The final loss function for expanding the probability distribution of the word bank is as follows:
Figure BDA00033994956400001711
wherein T is the total time. Pt(w) generating probabilities for the vocabulary distribution and the attention distribution calculations.
Calculating and generating probability P according to vocabulary distribution and attention distributiontAnd (w), finally, automatically generating the text abstract according to the generation probability and the vocabulary distribution.
Example 1
As shown in fig. 1, a two-stage hybrid automatic summarization method for judicial official documents includes the following steps:
1) and calculating the similarity of the key sentences in the referee document, coding and classifying abstract models of the key sentences, and finally extracting the abstract key sentences.
2) Sentences are extracted from the referee documents and combined into a key sentence collection.
3) And 2) taking the key sentence collection in the step 2) as the input of a generative model, and generating a text abstract through model coding and decoding.
Example 2
Repeating the embodiment 1, as shown in fig. 2, in the step 1.1), sentence division is performed on the referee document, then an artificial standard sentence is found in the referee document, and then a sentence with the highest similarity is found from the original text and is used as a tag data set of the abstract. And calculating the similarity score between the sentences in the artificial abstract and the sentences in the source document through cosine similarity, and selecting the sentences with the highest score in the source document, namely the key sentences.
Example 3
The embodiment 2 is repeated, except that the sentences with the highest scores are selected from the source documents in the step 1) for vectorization. The sentences obtained after similarity calculation and the original texts in the referee documents are in the same line, and the source texts, the label data and the artificial abstracts are subjected to word segmentation by adopting jieba. In the word segmentation process, legal nouns are crawled to be used as supplement of a word bank, and then word vectorization is carried out by using a BERT model.
Example 4
Repeating the embodiment 3, namely extracting and coding the key sentence in the step 1), wherein in a coding layer, the word embedding adopts a target word embedding vector, and on the basis of the word embedding, the input position embedding and the segmented embedding are carried out.
In the sentence coding layer, firstly, the sentence is divided into words to obtain word-level information for word embedding representation, and the word-level information is converted into a sentence vector as input:
for a text with n sentences, D ═ S1,S2,……,SnPre-treatment by two special markers. First, [ CLS ] is inserted at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]The tokens constitute the input. [ CLS]The token represents the vector of the current sentence, [ SEP ]]The tokens represent clauses used to segment sentences in the text.
Position embedding: the position information of the word is encoded into a feature vector, and the position vector adopts the scheme in AttentinisAllyOutneed:
PE(pos,2i)=sin(pos/100002i/dmodel)。
PE(pos,2i+1)=cos(pos/100002i/dmodel)。
in the formula, pos represents the position of a word in a sentence, and the numeric area is [0, n ]. i refers to the dimension of the word vector. The input to the dmodel for BERT is 256.
Segment embedding for distinguishing two sentences, different sentences are respectively marked with A and B before, so that the input sentence is expressed as (E)A,EB,EA,EB… …). The word embedding, position embedding and segment embedding representations are stitched as BERT model inputs. Sentence vector X ═ X (X) obtained after pre-training layer via BERT model1,X2,……,Xn)=BERT(sent1,sent2,sent3,……,sentn) Wherein sentiThe i-th sentence, X, represented as the original referee documentiCorresponding sendiBERT-coded vector, XiThe ith vector sequence that needs to be processed.
Each sentence vector is represented by word embedding, position embedding and segment embedding, so that the text vectorization work is completed.
Example 5
Example 4 was repeated, with the dmodel having a BERT input of 512.
Example 6
Example 5 was repeated, with the dmodel having a BERT input of 1024.
Example 7
The embodiment 6 is repeated, except that the words, positions and segments are embedded together by the coded sentences in the step 1) through bert + global average pooling, the words, the positions and the segments are output to a dense layer, the features extracted in the coding process are subjected to nonlinear change in dense, the association among the features is extracted, and finally the features are mapped to an output space.
Example 8
Example 7 is repeated, except that step 1) further comprises a classification layer, and a dilation residual gating convolutional neural network structure, namely DRGCNN, is adopted after the first dense layer is passed. And (3) abstracting key sentences by stacking the number of DRGCNN network layers, wherein the number of DRGCNN layers is 6, and the expansion coefficients of each layer are 1, 2, 4, 8, 1 and 1 respectively. Original input sequence for convolutional network X ═ X (X)1,X2,……,Xn) The convolution kernel is W, and the arbitrary convolution operation obtains a feature map CiThe calculation formula of (2) is as follows:
Figure BDA0003399495640000191
in the formula, Wc represents a one-dimensional convolution kernel, also called a weight coefficient, and is a learnable parameter. k denotes a distance from the input identity i. n represents the number of words in the sentence. x is the number ofi±kA word vector representing k words forward or backward from the ith word. The resulting feature map may represent the degree of association between the input Xi and the context.
Example 9
Example 8 is repeated, except that the number of layers of the DRGCNN network is stacked in step 1) to extract the key abstract sentence, the number of layers of the DRGCNN is 8, and the expansion coefficients of each layer are 1, 2, 4, 8, 1 and 1 respectively.
Example 10
The embodiment 9 is repeated, the number of layers of the DRGCNN network is stacked to extract key sentences of the summary, the number of layers of the DRGCNN is 10, and the expansion coefficients of each layer are 1, 2, 4, 8, 1 and 1 respectively.
Example 11
Example 10 was repeated except that in step 1) the convolution width was expanded by adding the expansion coefficient α. And the network depth is increased through the maturity of the stacked expansion convolution neural network, the problem of long-distance dependence of the text sequence is solved, and global effective information is extracted. When α is 1, the dilation convolution operation corresponds to a full convolution operation. Alpha is alpha>1, the dilated convolution can learn more distant context information, feature map CiThe calculation formula of (2) is as follows:
Figure BDA0003399495640000201
alpha is the coefficient of expansion, WcRepresenting a one-dimensional convolution kernel, also called a weight coefficient, is a learnable parameter. k represents the distance from the input identifier i, and the resulting feature map may represent the input XiThe degree of association with the context. In the feature map CiOn the basis of the method, a gate control mechanism convolutional neural network is introduced, and an output calculation formula is as follows:
Figure BDA0003399495640000202
in the formula, convD1,convD2Representing a one-dimensional convolution function. X denotes a sentence vector.
Figure BDA0003399495640000203
Representing point-by-point multiplication. σ is a gating function. convD1And convD2Operate for two convolution functions and the weights are not shared.
Introducing a residual error structure on the basis of a gating mechanism, wherein an output calculation formula is as follows:
Figure BDA0003399495640000204
in the formula, convD1,convD2Representing a one-dimensional convolution function. X denotes a sentence vector.
Figure BDA0003399495640000205
Representing point-by-point multiplication. σ is a gating function.
And judging whether the sentences are classified into two categories of key sentences or not through the full connection layer. During training, cross entropy is selected as a loss function and is expressed as:
Figure BDA0003399495640000206
in the formula (I), the compound is shown in the specification,
Figure BDA0003399495640000207
the label data representing sample i has a positive class of 1 and a negative class of 0. y represents the probability that sample i is predicted as a positive class. Loss represents a Loss function.
Example 12
The embodiment 11 is repeated, and as shown in fig. 3, the sentence combination key sentence collection extracted from the referee document by the encoding and classification in the abstract model in the step 2) is used as the input of the generation model.
Example 13
The embodiment 12 is repeated, except that the key sentence collection is combined as the input of generating the model in the step 3), and the text abstract is generated by model coding and decoding the input. The model coding adopts a Unilm pre-training language model, and the input of the model consists of word embedding, segment embedding and position embedding. The words are embedded as text D ═ S for a piece of n sentences1,S2,……,SnPre-treatment by two special markers. First, [ CLS ] is inserted at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]The tokens constitute the input. [ CLS]The token represents the vector of the current sentence, [ SEP ]]The tokens represent clauses used to segment sentences in the text. Segment embedding is used to distinguish two sentences, different sentences are preceded by a and B labels respectively,so that the input sentence is represented as (E)A,EB,EA,EB… …). The position in the model input is embedded into the position code of the hierarchical decomposition. The position coding vector trained by using BERT is p1,p2,p3,…,pnConstructing a new set of position codes q by formula1,q2,q3,…,qmIn the formula, the structural formula is as follows:
Figure BDA0003399495640000211
q(i-1)×n+jis position-coded.
Figure BDA0003399495640000212
The value is a hyperparameter and is 0.4. q is the position code of the (i-1) × n + j position. i is the ith word. j is the jth word. u is a vector, the base vector of the q vector is represented by the trained position p vector
Figure BDA0003399495640000213
The pos represents the position of the word in the sentence and has the value range of 0, n]. The position code of (i-1) × n + j is hierarchically represented as (i, j) by the formula. The position codes corresponding to i and j are respectively
Figure BDA0003399495640000214
And
Figure BDA0003399495640000215
because q is1=p1,q2=p2,……,qn=pnCalculate ui
Figure BDA0003399495640000216
Word embedding, position embedding and segmented embedding are spliced into input of a Unilm model, and a sentence vector X obtained after a pre-training layer of the Unilm model is equal to (X)1,x2,…,xn)=Unilm(sent1,sent2,sent3,…,sentn)。
Example 14
The embodiment 13 is repeated, except that the step 3) of generating the text summary by decoding the generated model comprises the following steps: the abstract generation learns the characteristics of the document level through a Transfoamer layer of a multi-layer attention mechanism. A copy mechanism is introduced in the decoding process of the model, and the copy mechanism comprises copying and generating. For a multi-layer transform backbone network, a text sequence X with an input length n is given as (X)1,x2,…,xn). Transformer layer output H of first layer0The calculation formula of (2) is as follows: h0=Transformer0(x) In that respect Output H of Transformer passing through l layerlThe calculation formula is as follows:
Hl=Transformer1(Hl-1)。
h is a layer of a Transformer. Final input result HlThe calculation formula of (2) is as follows:
Figure BDA0003399495640000217
wherein L represents the number of layers, and is within [1, L ]]. L represents the total number of layers of the Transformer,
Figure BDA0003399495640000218
denotes xiThe context of the input. In each Transformer module, a multi-head attention mechanism is added to aggregate output modules, and parts of output sequences needing attention are marked. Transformer self-attention for layer I AlThe calculation formula is as follows:
Figure BDA0003399495640000221
in the formula, AlIs a self-attention weight. The softmax function is a normalized exponential function. Q, K, V is derived from input XiAnd linear transformation is carried out. VlM represents a Mask matrix for Value of the l-th layer. dkIs the column number of the Q, K matrix, i.e.The vector dimension prevents Q, K from being excessively large, and plays a role in regulation. T is denoted as transpose. In the formula (I), the compound is shown in the specification,
Figure BDA0003399495640000222
Figure BDA0003399495640000223
Figure BDA0003399495640000224
is a linear projection of the previous layer to Queries, Values and Keys and the parameters of the projection are respectively
Figure BDA0003399495640000225
Figure BDA0003399495640000226
The Mask matrix M controls whether the Token is allowed to be added or not, different Mask matrixes M are used for controlling to pay attention to different contexts, and a copy mechanism is introduced to solve the problems of unregistered words and repeated words brought in the generation process.
Example 15
Example 14 is repeated except that generating the text excerpt in step 3) further includes generating the text excerpt according to the last layer H of the Transformer at the decoding time ttAnd output O of DecoderjThe calculation method of the correlation weight is that
Figure BDA0003399495640000227
Wherein WcTo initialize the matrix, the formula for the attention distribution of the j-th word is calculated at the same time as:
Figure BDA0003399495640000228
in the formula, N is the number of words in the sentence. Exp is an exponential function with a natural constant e as the base. u is a hyperparameter. t is time. k represents the input sequence identification and has the value range of [1, N]. j is denoted as the ith word. The attention distribution can be interpreted as the degree of attention of the ith word in the context query, as in the following formula pairCarrying out information weighted average on the attention distribution to obtain a context expression vector h't
Figure BDA0003399495640000229
Wherein h'tAlso called context vectors, indicate that the information of interest is obtained from the distribution of attention.
Figure BDA00033994956400002210
The output of the last layer of the Transformer at the time t.
Figure BDA00033994956400002211
The attention distribution for the jth word. i is the value of the sequence identity. And N is the number of words in the sentence.
The context vector is compared with the output O of DecoderjConnected and generate a vocabulary distribution through two linear layers
Figure BDA00033994956400002212
The calculation formula is as follows:
Figure BDA00033994956400002213
wherein V ', V, b ' are learnable parameters h 'tFor context representation of vectors, OjIs the output of the Decoder and is the output of the Decoder,
Figure BDA0003399495640000231
is the probability distribution of all words in the vocabulary.
Reintroducing copy gating function gt∈[0,1]To decide whether the current output chooses to copy from the source document or generate new words from the vocabulary. gtThe calculation formula of (2) is as follows:
Figure BDA0003399495640000232
in the formula, Wg、bgAre all learnable parameters.
Figure BDA0003399495640000233
The output of the last layer of the Transformer at the time t. O isjIs the output of Decoder. The formula shows that at the time t, whether the next word is to generate a new word or to directly copy the new word is determined according to the attention weights of the j word and other words. For each document, the words in the vocabulary are combined with all the appearing words in the source document to form a new word stock, namely an extended word stock. Therefore, whether the copy is selected from the source document or generated from the vocabulary, the generation is performed from the extended thesaurus, so the final probability calculation formula is:
Figure BDA0003399495640000234
in the formula (I), the compound is shown in the specification,
Figure BDA0003399495640000235
indicating the probability that the current word w was generated in a given vocabulary.
Figure BDA0003399495640000236
Representing the probability of selecting a copy from the source document based on the attention distribution. If w is an out-of-vocabulary word, then
Figure BDA0003399495640000237
If w does not appear in the source document, then
Figure BDA0003399495640000238
The final loss function for expanding the probability distribution of the word bank is as follows:
Figure BDA0003399495640000239
wherein T is the total time. Pt(w) generating probabilities for the vocabulary distribution and the attention distribution calculations.
Calculating and generating probability P according to vocabulary distribution and attention distributiontAnd (w), finally, automatically generating the text abstract according to the generation probability and the vocabulary distribution.
Example 16
Example 15 is repeated, as shown in fig. 3, for example, input (five thousand dollars of pay to original notice within seven days of notice), prediction sentence is (pay to original notice of pay) and [ CLS ] is set before input sentence]As the current sentence vector, input [ SEP ] at the end of the input sentence]The tokens represent clauses used to segment sentences in the text. Sentence vector X ═ X (X) obtained after pre-training layer of Unilm model1,x2,…,xn)=Unilm(sent1,sent2,sent3,…,sentn) And inputting the data into a Transfoamer layer, and calculating a formula by using a final probability as follows:
Figure BDA00033994956400002310
derived (reported to original pay-Per-Payment [ SEP)])。

Claims (10)

1. A two-stage hybrid automatic summarization method for judicial official documents is characterized by comprising the following steps:
1) calculating the similarity of key sentences in the referee document, coding and classifying abstract models of the key sentences, and finally extracting abstract key sentences;
2) extracting sentences from the referee document to combine into a key sentence collection;
3) and 2) taking the key sentence collection in the step 2) as the input of a generative model, and generating a text abstract through model coding and decoding.
2. The two-stage hybrid automatic summarization method for judicial official documents according to claim 1, wherein the calculating the similarity of key sentences in step 1) comprises:
step 1.1) sentence division is carried out on a referee document, then an artificial standard sentence is found in the referee document, and then a sentence with the highest similarity is found from an original text and is used as a tag data set of an abstract; and calculating the similarity score between the sentences in the artificial abstract and the sentences in the source document through cosine similarity, and selecting the sentences with the highest score in the source document, namely the key sentences.
3. The two-stage hybrid automatic summarization method for judicial official documents according to claim 1 or 2, wherein step 1) further comprises:
step 1.2) vectorizing the text, wherein sentences obtained after similarity calculation and original texts in a referee document are in the same line, and a source text, label data and an artificial abstract are subjected to word segmentation by adopting jieba; in the word segmentation process, legal nouns are crawled to be used as supplement of a word bank, and then word vectorization is carried out by using a BERT model.
4. The two-stage hybrid automatic summarization method for judicial official documents according to any of claims 1-3, wherein the encoding of the summarization model of key sentences in step 1) comprises:
extracting model coding; at the coding layer, word embedding adopts target word embedding vector, and for a text with n sentences, D ═ S1,S2,……,SnPreprocessing by two special marks; first, [ CLS ] is inserted at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]Marking composition input; [ CLS]The token represents the vector of the current sentence, [ SEP ]]The mark represents a clause used for segmenting a sentence in the text; on the basis of word embedding, input position embedding and segmented embedding are also arranged;
the position is embedded; the position information of the word is coded into a feature vector, and the position vector adopts a scheme in the Attention is All You Need:
PE(pos,2i)=sin(pos/100002i/dmodel);
PE(pos,2i+1)=cos(pos/100002i/dmodel);
in the formula, pos represents the position of a word in a sentence, and the numeric area is [0, n ]; i refers to the dimension of the word vector; the input to the dmodel for BERT is 128-;
the segment embedding; for distinguishing two sentences, the different sentences are preceded by a and B labels, respectively, so that the input sentence is represented as (E)A,EB,EA,EB… …); concatenating word embedding, position embedding and segment embedding representations into BERT model input; sentence vector X ═ X (X) obtained after pre-training layer via BERT model1,X2,……,Xn)=BERT(sent1,sent2,sent3,……,sentn) Wherein sentiThe i-th sentence, X, represented as the original referee documentiCorresponding sendiBERT-coded vector, XiThe ith vector sequence that needs to be processed.
5. The two-stage hybrid automatic summarization method for judicial official documents according to any of claims 1-4, wherein classifying the summarization models of key sentences in step 1) comprises:
a classification layer adopts an expansion residual gated convolutional neural network structure, namely the expansion residual gated convolutional neural network structure is DRGCNN; extracting key abstract sentences by stacking a plurality of layers of DRGCNN networks, wherein the number of the layers of the DRGCNN is 6-10, preferably 7-8, and the expansion coefficients of each layer are 1, 2, 4, 8, 1 and 1 respectively; original input sequence for convolutional network X ═ X (X)1,X2,……,Xn) With convolution kernel W, signature C of arbitrary convolution operationiThe calculation formula of (2) is as follows:
Figure FDA0003399495630000021
in the formula, WcA one-dimensional convolution kernel, also called a weight coefficient, is represented and is a learnable parameter; k represents the distance from the input identifier i; n represents the number of words in the sentence; x is the number ofi±kA word vector representing k words forward or backward from the ith word; the resulting signature graph may represent input XiDegree of association with context;
expanding the convolution width by adding an expansion coefficient alpha; when α is 1, the dilation convolution operation is equivalent to a full convolution operation; alpha is alpha>1, the dilated convolution can learn more distant context information, feature map CiThe calculation formula of (2) is as follows:
Figure FDA0003399495630000022
wherein α is an expansion coefficient, WcA one-dimensional convolution kernel, also called a weight coefficient, is represented and is a learnable parameter; k represents the distance from the input identifier i, and the resulting feature map may represent the input XiDegree of association with context; in the feature map CiOn the basis of the method, a gate control mechanism convolutional neural network is introduced, and an output calculation formula is as follows:
Figure FDA0003399495630000023
in the formula, convD1,convD2Representing a one-dimensional convolution function; x represents a sentence vector;
Figure FDA0003399495630000024
represents point-by-point multiplication; sigma is a gating function; convD1And convD2Operating for two convolution functions and not sharing weight;
introducing a residual error structure on the basis of a gating mechanism, wherein an output calculation formula is as follows:
Figure FDA0003399495630000025
in the formula, convD1,convD2Representing a one-dimensional convolution function; x represents a sentence vector;
Figure FDA0003399495630000026
represents point-by-point multiplication; sigma is a gating function;
judging whether the sentences are classified into two categories of key sentences or not through the full-connection layer; during training, cross entropy is selected as a loss function and is expressed as:
Figure FDA0003399495630000031
in the formula (I), the compound is shown in the specification,
Figure FDA0003399495630000037
label data representing a sample i, the positive class being 1 and the negative class being 0; y represents the probability that sample i is predicted as a positive class; loss represents a Loss function.
6. The two-stage hybrid automatic summarization method for judicial official documents according to any of claims 2-4 wherein in step 2) the sentence combination key sentence collections are extracted from the official documents as input to the generation model by encoding and classification in the abstract model.
7. The two-stage hybrid automatic summarization method for judicial official documents according to any of claims 1-6, wherein the generating of the formula model in step 3) comprises: combining the key sentence collection as the input of a generation model, and generating a text abstract by model coding and decoding the input; the model coding adopts a Unilm pre-training language model, and the input of the model consists of word embedding, segment embedding and position embedding;
the words are embedded as text D ═ S for a piece of n sentences1,S2,……,SnPreprocessing by two special marks; first, [ CLS ] is inserted at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]Marking composition input; [ CLS]The token represents the vector of the current sentence, [ SEP ]]The mark represents a clause used for segmenting a sentence in the text;
segment embedding is used to distinguish two sentences, the different sentences being preceded by a and B labels, respectively, so that the input sentence is represented as (E)A,EB,EA,EB,……)。
8. The two-stage hybrid automatic summarization method for judicial official documents according to claim 7, wherein: the position in the model input is embedded into a position code of hierarchical decomposition; the position coding vector trained by using BERT is p1,p2,p3,…,pnConstructing a new set of position codes q by formula1,q2,q3,…,qmIn the formula, the structural formula is as follows:
Figure FDA0003399495630000032
in the formula, q(i-1)×n+jCoding the position;
Figure FDA0003399495630000033
the value is 0.4 for the hyper-parameter; q is the position code of the (i-1) x n + j position; i is the ith word; j is the jth word; u is a vector, the base vector of the q vector is represented by the trained position p vector
Figure FDA0003399495630000034
The pos represents the position of the word in the sentence and has the value range of 0, n](ii) a Hierarchically expressing the position code of (i-1) × n + j as (i, j) by a formula; the position codes corresponding to i and j are respectively
Figure FDA0003399495630000038
And
Figure FDA0003399495630000035
because q is1=p1,q2=p2,……,qn=pnCalculate ui
Figure FDA0003399495630000036
Word embedding, position embedding and segmented embedding are spliced into input of a Unilm model, and a sentence vector X obtained after a pre-training layer of the Unilm model is equal to (X)1,x2,…,xn)=Unilm(sent1,sent2,sent3,…,sentn)。
9. The two-stage hybrid automatic summarization method for judicial official documents according to claim 7 or 8, wherein the decoding of the generated text summary in the generation model in step 3) comprises: the abstract generation learns the characteristics of the document level through a Transfoamer layer of a multi-layer attention mechanism; introducing a copying mechanism in the decoding process of the model, wherein the copying mechanism comprises copying and generating; for a multi-layer transform backbone network, a text sequence X with an input length n is given as (X)1,x2,…,xn) (ii) a Transformer layer output H of first layer0The calculation formula of (2) is as follows: h0=Transformer0(x) (ii) a Output H of Transformer passing through l layerlThe calculation formula is as follows:
Hl=Transformer1(Hl-1);
h is a layer of a Transformer; final input result HlThe calculation formula of (2) is as follows:
Figure FDA0003399495630000041
wherein L represents the number of layers, and is within [1, L ]](ii) a L represents the total number of layers of the Transformer,
Figure FDA0003399495630000042
denotes xiThe context of the input;
adding a multi-head attention mechanism to each Transformer module to aggregate output modules, and marking the parts of the output sequences needing attention; transformer self-attention for layer I AlThe calculation formula is as follows:
Figure FDA0003399495630000043
in the formula, AlIs a self-attention weight; the softmax function is a normalized exponential function; q, K, V is derived from input XiLinear transformation is obtained; vlValue of the l-th layer, M represents a Mask matrix; dkThe column number of the Q, K matrix, namely the dimension of the vector, prevents Q, K inner product from being too large and plays a role in regulation; t is denoted as transpose; in the formula (I), the compound is shown in the specification,
Figure FDA0003399495630000044
Hl-1
Figure FDA0003399495630000049
is a linear projection of the previous layer to Queries, Values and Keys and the parameters of the projection are respectively
Figure FDA0003399495630000045
Figure FDA0003399495630000046
The Mask matrix M controls whether the Token is allowed to be added or not, different Mask matrixes M are used for controlling to pay attention to different contexts, and a copy mechanism is introduced to solve the problems of unregistered words and repeated words brought in the generation process.
10. The two-stage hybrid automatic summarization method for judicial official documents according to any of claims 7-9, wherein generating a text summary further comprises: when the decoding time is t, according to the last layer H of the TransformertAnd output O of DecoderjThe calculation method of the correlation weight is that
Figure FDA0003399495630000047
Wherein WcTo initialize the matrix, the formula for the attention distribution of the j-th word is calculated at the same time as:
Figure FDA0003399495630000048
in the formula, N is the number of words in the sentence; exp is an exponential function with a natural constant e as the base; u is a hyperparameter; t is time; k represents an input sequence identifier, and the value range is [1, N ]; j is denoted as the ith word;
the attention distribution can be interpreted as the attention degree of the ith word in context query, and the information weighted average is carried out on the attention distribution by the following formula to obtain a context expression vector h't
Figure FDA0003399495630000051
Wherein h'tAlso called context vectors, which represent the acquisition of information of interest according to the distribution of attention;
Figure FDA00033994956300000510
the output of the last layer of the Transformer at the time t;
Figure FDA0003399495630000052
attention distribution for the jth word; i is the value of the sequence identifier; n is the number of words in the sentence;
the context vector is compared with the output O of DecoderjConnected and generate a vocabulary distribution through two linear layers
Figure FDA0003399495630000053
The calculation formula is as follows:
Figure FDA0003399495630000054
wherein V ', V, b ' are learnable parameters h 'tFor context representation of vectors, OjIs DThe output of the ecoder is then output,
Figure FDA0003399495630000055
is the probability distribution of all words in the vocabulary;
reintroducing copy gating function gt∈[0,1]To decide whether the current output chooses to copy from the source document or generate new words from the vocabulary; gtThe calculation formula of (2) is as follows:
Figure FDA0003399495630000056
in the formula, Wg、bgAre all learnable parameters;
Figure FDA00033994956300000511
the output of the last layer of the Transformer at the time t; o isjIs the output of Decoder; the formula shows that at the time t, whether the next word is a new word or is directly copied is determined according to the attention weight of the j word and other words;
for each document, combining the words in the vocabulary with all the appearing words of the source document to form a new word stock, namely an extended word stock; therefore, whether the copy is selected from the source document or generated from the vocabulary, the generation is performed from the extended thesaurus, so the final probability calculation formula is:
Figure FDA0003399495630000057
in the formula (I), the compound is shown in the specification,
Figure FDA0003399495630000058
representing the probability that the current word w is generated in a given vocabulary;
Figure FDA0003399495630000059
representing a probability of selecting a copy from the source document based on the attention distribution;
if w is an out-of-vocabulary word, then
Figure FDA0003399495630000061
If w does not appear in the source document, then
Figure FDA0003399495630000062
The final loss function for expanding the probability distribution of the word bank is as follows:
Figure FDA0003399495630000063
wherein T is the total time; pt(w) generating probabilities for the lexical distribution and the attention distribution calculations;
calculating and generating probability P according to vocabulary distribution and attention distributiontAnd (w), finally, automatically generating the text abstract according to the generation probability and the vocabulary distribution.
CN202111494073.7A 2021-12-08 2021-12-08 Two-stage hybrid automatic summarization method for judicial official documents Pending CN114169312A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111494073.7A CN114169312A (en) 2021-12-08 2021-12-08 Two-stage hybrid automatic summarization method for judicial official documents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111494073.7A CN114169312A (en) 2021-12-08 2021-12-08 Two-stage hybrid automatic summarization method for judicial official documents

Publications (1)

Publication Number Publication Date
CN114169312A true CN114169312A (en) 2022-03-11

Family

ID=80484516

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111494073.7A Pending CN114169312A (en) 2021-12-08 2021-12-08 Two-stage hybrid automatic summarization method for judicial official documents

Country Status (1)

Country Link
CN (1) CN114169312A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114691858A (en) * 2022-03-15 2022-07-01 电子科技大学 Improved UNILM abstract generation method
CN114996442A (en) * 2022-05-27 2022-09-02 北京中科智加科技有限公司 Text abstract generation system combining abstract degree judgment and abstract optimization
CN115809329A (en) * 2023-01-30 2023-03-17 医智生命科技(天津)有限公司 Method for generating abstract of long text
CN115982343A (en) * 2023-03-13 2023-04-18 阿里巴巴达摩院(杭州)科技有限公司 Abstract generation method, method and device for training abstract generation model
CN117151069A (en) * 2023-10-31 2023-12-01 中国电子科技集团公司第十五研究所 Security scheme generation system
CN117875268A (en) * 2024-03-13 2024-04-12 山东科技大学 Extraction type text abstract generation method based on clause coding

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061862A (en) * 2019-12-16 2020-04-24 湖南大学 Method for generating abstract based on attention mechanism
CN111897949A (en) * 2020-07-28 2020-11-06 北京工业大学 Guided text abstract generation method based on Transformer

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061862A (en) * 2019-12-16 2020-04-24 湖南大学 Method for generating abstract based on attention mechanism
CN111897949A (en) * 2020-07-28 2020-11-06 北京工业大学 Guided text abstract generation method based on Transformer

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘国靖: "基于深度学习的文本自动摘要技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, pages 138 - 2817 *
王义真等: "民事裁判文书两阶段式自动摘要研究", 《数据分析与知识发现》, pages 104 - 114 *
苏剑林: "层次分解位置编码,让BERT可以处理超长文本", pages 1 - 4, Retrieved from the Internet <URL:https://kexue.fm/archives/7947> *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114691858A (en) * 2022-03-15 2022-07-01 电子科技大学 Improved UNILM abstract generation method
CN114691858B (en) * 2022-03-15 2023-10-03 电子科技大学 Improved UNILM digest generation method
CN114996442A (en) * 2022-05-27 2022-09-02 北京中科智加科技有限公司 Text abstract generation system combining abstract degree judgment and abstract optimization
CN114996442B (en) * 2022-05-27 2023-07-11 北京中科智加科技有限公司 Text abstract generation system combining abstract degree discrimination and abstract optimization
CN115809329A (en) * 2023-01-30 2023-03-17 医智生命科技(天津)有限公司 Method for generating abstract of long text
CN115982343A (en) * 2023-03-13 2023-04-18 阿里巴巴达摩院(杭州)科技有限公司 Abstract generation method, method and device for training abstract generation model
CN115982343B (en) * 2023-03-13 2023-08-22 阿里巴巴达摩院(杭州)科技有限公司 Abstract generation method, and method and device for training abstract generation model
CN117151069A (en) * 2023-10-31 2023-12-01 中国电子科技集团公司第十五研究所 Security scheme generation system
CN117151069B (en) * 2023-10-31 2024-01-02 中国电子科技集团公司第十五研究所 Security scheme generation system
CN117875268A (en) * 2024-03-13 2024-04-12 山东科技大学 Extraction type text abstract generation method based on clause coding
CN117875268B (en) * 2024-03-13 2024-05-31 山东科技大学 Extraction type text abstract generation method based on clause coding

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN110083831B (en) Chinese named entity identification method based on BERT-BiGRU-CRF
CN109657239B (en) Chinese named entity recognition method based on attention mechanism and language model learning
CN114169312A (en) Two-stage hybrid automatic summarization method for judicial official documents
CN110008469B (en) Multilevel named entity recognition method
CN109871538A (en) A kind of Chinese electronic health record name entity recognition method
CN110866399B (en) Chinese short text entity recognition and disambiguation method based on enhanced character vector
CN111563375B (en) Text generation method and device
CN117076653B (en) Knowledge base question-answering method based on thinking chain and visual lifting context learning
CN112836046A (en) Four-risk one-gold-field policy and regulation text entity identification method
CN111209749A (en) Method for applying deep learning to Chinese word segmentation
CN116151256A (en) Small sample named entity recognition method based on multitasking and prompt learning
CN112183094A (en) Chinese grammar debugging method and system based on multivariate text features
CN110276396B (en) Image description generation method based on object saliency and cross-modal fusion features
CN110222338B (en) Organization name entity identification method
CN113190656A (en) Chinese named entity extraction method based on multi-label framework and fusion features
CN115310448A (en) Chinese named entity recognition method based on combining bert and word vector
CN115438674B (en) Entity data processing method, entity linking method, entity data processing device, entity linking device and computer equipment
CN112966117A (en) Entity linking method
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN115600597A (en) Named entity identification method, device and system based on attention mechanism and intra-word semantic fusion and storage medium
CN114757184B (en) Method and system for realizing knowledge question and answer in aviation field
CN113065349A (en) Named entity recognition method based on conditional random field
CN111145914A (en) Method and device for determining lung cancer clinical disease library text entity
Gu et al. Named entity recognition in judicial field based on BERT-BiLSTM-CRF model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination