CN114169312A - Two-stage hybrid automatic summarization method for judicial official documents - Google Patents
Two-stage hybrid automatic summarization method for judicial official documents Download PDFInfo
- Publication number
- CN114169312A CN114169312A CN202111494073.7A CN202111494073A CN114169312A CN 114169312 A CN114169312 A CN 114169312A CN 202111494073 A CN202111494073 A CN 202111494073A CN 114169312 A CN114169312 A CN 114169312A
- Authority
- CN
- China
- Prior art keywords
- word
- sentence
- input
- sentences
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 61
- 239000013598 vector Substances 0.000 claims description 121
- 238000004364 calculation method Methods 0.000 claims description 71
- 238000009826 distribution Methods 0.000 claims description 70
- 230000006870 function Effects 0.000 claims description 59
- 230000007246 mechanism Effects 0.000 claims description 36
- 238000012549 training Methods 0.000 claims description 24
- 230000008569 process Effects 0.000 claims description 21
- 239000011159 matrix material Substances 0.000 claims description 20
- 230000011218 segmentation Effects 0.000 claims description 15
- 150000001875 compounds Chemical class 0.000 claims description 14
- 238000013527 convolutional neural network Methods 0.000 claims description 14
- 238000003780 insertion Methods 0.000 claims description 9
- 230000037431 insertion Effects 0.000 claims description 9
- 230000010339 dilation Effects 0.000 claims description 6
- 239000013589 supplement Substances 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 6
- 230000033228 biological regulation Effects 0.000 claims description 5
- 238000000354 decomposition reaction Methods 0.000 claims description 5
- 239000013604 expression vector Substances 0.000 claims description 5
- 239000000047 product Substances 0.000 claims description 5
- 239000000203 mixture Substances 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 239000012141 concentrate Substances 0.000 abstract description 2
- 238000000605 extraction Methods 0.000 description 18
- 238000002203 pretreatment Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000008034 disappearance Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 108091026890 Coding region Proteins 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 230000015556 catabolic process Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000006731 degradation reaction Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000003213 activating effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 239000000969 carrier Substances 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Probability & Statistics with Applications (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
Abstract
A two-stage hybrid automatic summarization method for judicial official documents comprises the following steps: 1) calculating the similarity of key sentences in the referee document, coding and classifying abstract models of the key sentences, and finally extracting abstract key sentences; 2) extracting sentences from the referee document to combine into a key sentence collection; 3) and 2) taking the key sentence collection in the step 2) as the input of a generative model, and generating a text abstract through model coding and decoding. The invention can concentrate and refine long texts in many referee documents to obtain accurate useful information to generate the abstract. The abstract generated by the method provided by the invention has strong readability, strong continuity and high identification degree, and ensures the fidelity between the text and the abstract.
Description
Technical Field
The invention belongs to the technical field of official document data processing, and particularly designs a two-stage hybrid automatic summarization method for judicial official documents.
Background
With the rapid development of the information age, the data volume on the internet is exponentially increased. The text abstract technology abstracts and summarizes text information to extract the gist of articles, and the abstract is used for replacing an original text chapter to participate in indexing, so that the retrieval time can be effectively shortened, redundant information in a retrieval result can be reduced, and a user can efficiently acquire required information from a large amount of data.
Existing intelligent systems such as internet courts are generally used as auxiliary work for legal workers, for example, extracting information from referee documents by techniques such as semantic analysis, or constructing relationships between legal elements by manual processing. The official document is in a standard writing, but the content is exhaustive and lengthy, at present, the abstract is generated by extracting words, phrases and sentences with larger weight from the official document and combining the words, the phrases and the sentences, and the semantic coherence of the generated abstract is poor, so that the law and the official knowledge are not effectively fused, and the generated abstract is inconsistent and inaccurate. Therefore, a method for generating a referee document abstract is needed to ensure the consistency and accuracy of the referee document abstract.
The judicial official documents are the final carriers of judicial activities, and the existing judicial official documents are important bases for assisting criminal decision-making and standardizing the officials' scales. However, the number of official documents which are disclosed at present is as large as 1.2 hundred million, and how to acquire useful information from a plurality of official documents is an urgent problem to be solved. The automatic summarization technology can concentrate and refine long texts, and the short summaries are used for representing the long original texts, so that the automatic summarization technology is an important means for solving the problem of information overload.
The text automatic summarization technology can be divided into an abstract summary and a generated summary according to different summary generation modes. The extraction method is to take the text summarization task as a classification problem and judge whether the sentence is a summarization sentence, and the method keeps the loyalty with the original text, but because the text is directly extracted and spliced from the original text, the readability and the continuity of the generated summary are poor. Compared with the extraction method, the generation method is closer to the process of artificial summarization, a deep learning model is used for learning a large amount of text data, the text is encoded and decoded, and the summary of the extracted content is generated by a rephrasing and replacing method. Instead of extracting sentences directly from the source document, the generative abstract replaces the original text sentences by generating new sentences. Although the generating method can generate a new sentence, the generated sentence is easily contrary to the original consciousness, the fidelity is not guaranteed, and the generating method is easy to lose information for long text. The above disadvantages are more prominent when the judicial official document is used as a text with an ultra-long space and a single extraction or generation method is applied to the judicial official document. Therefore, the present invention provides a two-stage hybrid automatic summarization method combining an extraction method and a generation method, which effectively solves the above problems.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a two-stage hybrid automatic summarization method for a judicial official document. Firstly, forming a key sentence collection by adopting an extraction mode, secondly, taking the sentence collection as the input of a generation mode, and generating a text abstract by model coding and decoding; the text of the whole referee document is concentrated and refined, so that the space of the abstract text is reduced, the fidelity, readability and continuity of the generated abstract and the meaning of the original text are ensured, and the number of characters of the abstract generated manually is reduced, and the reliability is low.
In order to solve the problems, the following technical scheme is provided:
a two-stage hybrid automatic summarization method for judicial official documents comprises the following steps:
1) and calculating the similarity of the key sentences in the referee document, coding and classifying abstract models of the key sentences, and finally extracting the abstract key sentences.
2) Sentences are extracted from the referee documents and combined into a key sentence collection.
3) And 2) taking the key sentence collection in the step 2) as the input of a generative model, and generating a text abstract through model coding and decoding.
Preferably, the calculating the similarity of the key sentences in the step 1) includes:
step 1.1) sentence division is carried out on the referee document, then an artificial standard sentence is found in the referee document, and then a sentence with the highest similarity is found from the original text and is used as a tag data set of the extraction abstract. And calculating the similarity score between the sentences in the artificial abstract and the sentences in the source document through cosine similarity, and selecting the sentences with the highest score in the source document, namely the key sentences.
Preferably, step 1) further comprises:
and step 1.2) vectorizing the text, wherein sentences obtained after similarity calculation and original texts in the referee document are in the same line, and the source text, the label data and the artificial abstract are subjected to word segmentation by adopting jieba. In the word segmentation process, legal nouns are crawled to be used as supplement of a word bank, and then word vectorization is carried out by using a BERT model.
Preferably, the encoding of the abstract model of the key sentence in step 1) includes:
and (5) extracting model coding. At the coding layer, word embedding adopts target word embedding vector, and for a text with n sentences, D ═ S1,S2,……,SnPre-treatment by two special markers. First, [ CLS ] is inserted at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]The tokens constitute the input. [ CLS]The token represents the vector of the current sentence, [ SEP ]]The tokens represent clauses used to segment sentences in the text. On the basis of word embedding, input position embedding and segmentation embedding are further arranged.
Preferably, the location is embedded. The position information of the word is coded into a feature vector, and the position vector adopts a scheme in the Attention islalyouneed:
PE(pos,2i)=sin(pos/100002i/dmodel)。
PE(pos,2i+1)=cos(pos/100002i/dmodel)。
in the formula, pos represents the position of a word in a sentence, and the numeric area is [0, n ]. i refers to the dimension of the word vector. The input to the BERT is dmodel 128-1024, preferably 256-512.
Preferably, the segments are embedded. For distinguishing two sentences, the different sentences are preceded by a and B labels, respectively, so that the input sentence is represented as (E)A,EB,EA,EB… …). The word embedding, position embedding and segment embedding representations are stitched as BERT model inputs. Sentence vector X ═ X (X) obtained after pre-training layer via BERT model1,X2,……,Xn)=BERT(sent1,sent2,sent3,……,sentn) Wherein sentiThe i-th sentence, X, represented as the original referee documentiCorresponding sendiBERT-coded vector, XiThe ith vector sequence that needs to be processed.
Preferably, the classifying the abstract model of the key sentence in step 1) includes:
and (4) a classification layer, wherein an expansion residual gated convolutional neural network structure is adopted, namely the expansion residual gated convolutional neural network structure is DRGCNN. The key sentence extraction of the abstract is carried out by stacking a plurality of layers of DRGCNN networks, the number of the layers of the DRGCNN is 6-10, preferably 7-8, and the expansion coefficient of each layer is 1, 2, 4, 8, 1 and 1 respectively. Original input sequence for convolutional network X ═ X (X)1,X2,……,Xn) With convolution kernel W, signature C of arbitrary convolution operationiThe calculation formula of (2) is as follows:
in the formula, WcRepresenting a one-dimensional volumeThe product-kernel, also called weight coefficient, is a learnable parameter. k denotes a distance from the input identity i. n represents the number of words in the sentence. x is the number ofi±kA word vector representing k words forward or backward from the ith word. The resulting signature graph may represent input XiThe degree of association with the context.
Preferably, the convolution width is expanded by adding the expansion coefficient α. When α is 1, the dilation convolution operation corresponds to a full convolution operation. Alpha is alpha>1, the dilated convolution can learn more distant context information, feature map CiThe calculation formula of (2) is as follows:
wherein α is an expansion coefficient, WcRepresenting a one-dimensional convolution kernel, also called a weight coefficient, is a learnable parameter. k represents the distance from the input identifier i, and the resulting feature map may represent the input XiThe degree of association with the context. In the feature map CiOn the basis of the method, a gate control mechanism convolutional neural network is introduced, and an output calculation formula is as follows:
in the formula, convD1,convD2Representing a one-dimensional convolution function. X denotes a sentence vector.Representing point-by-point multiplication. σ is a gating function. convD1And convD2Operate for two convolution functions and the weights are not shared.
Preferably, a residual structure is introduced on the basis of a gating mechanism, and the output calculation formula is as follows:
in the formula, convD1,convD2Representing a one-dimensional convolution function. X denotes a sentence vector.Representing point-by-point multiplication. σ is a gating function.
Preferably, the sentence is further classified into two categories by the full link layer. During training, cross entropy is selected as a loss function and is expressed as:
in the formula (I), the compound is shown in the specification,the label data representing sample i has a positive class of 1 and a negative class of 0. y represents the probability that sample i is predicted as a positive class. Loss represents a Loss function.
Preferably, the sentence combination key sentence collection is extracted from the referee document as the input of the generation model in the step 2) through coding and classification in the extraction abstract model.
Preferably, the formula model generated in step 3) includes: and combining the key sentence collection as the input of a generation model, and generating a text abstract by performing model coding and decoding on the input. The model coding adopts a Unilm pre-training language model, and the input of the model consists of word embedding, segment embedding and position embedding.
Preferably, the words are embedded as n sentences for a text D ═ S1,S2,……,SnPre-treatment by two special markers. First, [ CLS ] is inserted at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]The tokens constitute the input. [ CLS]The token represents the vector of the current sentence, [ SEP ]]The tokens represent clauses used to segment sentences in the text.
Preferably, segment embedding is used to distinguish two sentences, different sentences being preceded by a and B labels, respectively, so that the input sentence is represented as (E)A,EB,EA,EB,……)。
Preferably, the position in the model input is embedded as a position code of hierarchical decomposition. The position coding vector trained by using BERT is p1,p2,p3,…,pnConstructing a new set of position codes q by formula1,q2,q3,…,qmIn the formula, the structural formula is as follows:
in the formula, q(i-1)×n+jIs position-coded.The value is a hyperparameter and is 0.4. q is the position code of the (i-1) × n + j position. i is the ith word. j is the jth word. u is a vector, the base vector of the q vector is represented by the trained position p vectorThe pos represents the position of the word in the sentence and has the value range of 0, n]. The position code of (i-1) × n + j is hierarchically represented as (i, j) by the formula. The position codes corresponding to i and j are respectivelyAndbecause q is1=p1,q2=p2,……,qn=pnCalculate ui,Word embedding, position embedding and segmented embedding are spliced into input of a Unilm model, and a sentence vector X obtained after a pre-training layer of the Unilm model is equal to (X)1,x2,…,xn)=Unilm(sent1,sent2,sent3,…,sentn)。
Preferably, the decoding of the generated text summary in the model generated in step 3) includes: the abstract generation learns the characteristics of the document level through a Transfoamer layer of a multi-layer attention mechanism. A copy mechanism is introduced in the decoding process of the model, and the copy mechanism comprises copying and generating. For a multi-layer transform backbone network, a text sequence X with an input length n is given as (X)1,x2,…,xn). Transformer layer output H of first layer0The calculation formula of (2) is as follows: h0=Transformer0(x) In that respect Output H of Transformer passing through l layerlThe calculation formula is as follows:
Hl=Transformer1(Hl-1)。
h is a layer of a Transformer. Final input result HlThe calculation formula of (2) is as follows:
wherein L represents the number of layers, and is within [1, L ]]. L represents the total number of layers of the Transformer,denotes xiThe context of the input.
Preferably, in each transform module, a multi-head attention mechanism is added to aggregate output modules, and parts of output sequences needing attention are marked. Transformer self-attention for layer I AlThe calculation formula is as follows:
in the formula, AlIs a self-attention weight. The softmax function is a normalized exponential function. Q, K, V is derived from input XiAnd linear transformation is carried out. VlM represents a Mask matrix for Value of the l-th layer. dkIs the number of columns of the Q, K matrix, i.e. the dimension of the vector, preventing QAnd the inner product of K is too large, so that the regulation effect is realized. T is denoted as transpose. In the formula (I), the compound is shown in the specification, is a linear projection of the previous layer to Queries, Values and Keys and the parameters of the projection are respectively The Mask matrix M controls whether the Token is allowed to be added or not, different Mask matrixes M are used for controlling to pay attention to different contexts, and a copy mechanism is introduced to solve the problems of unregistered words and repeated words brought in the generation process.
Preferably, the generating the text summary further comprises: when the decoding time is t, according to the last layer H of the TransformertAnd output O of DecoderjThe calculation method of the correlation weight is thatWherein WcTo initialize the matrix, the formula for the attention distribution of the j-th word is calculated at the same time as:
in the formula, N is the number of words in the sentence. Exp is an exponential function with a natural constant e as the base. u is a hyperparameter. t is time. k represents the input sequence identification and has the value range of [1, N]. j is denoted as the ith word. The attention distribution can be interpreted as the attention degree of the ith word in context query, and the information weighted average is carried out on the attention distribution by the following formula to obtain a context expression vector h't:
Wherein h'tAlso called context vectors, indicate that the information of interest is obtained from the distribution of attention.The output of the last layer of the Transformer at the time t.The attention distribution for the jth word. i is the value of the sequence identity. And N is the number of words in the sentence.
Preferably, the context vector is compared with the output O of the DecoderjConnected and generate a vocabulary distribution through two linear layersThe calculation formula is as follows:
wherein V ', V, b ' are learnable parameters h 'tFor context representation of vectors, OjIs the output of the Decoder and is the output of the Decoder,is the probability distribution of all words in the vocabulary.
Preferably, a copy gating function g is reintroducedt∈[0,1]To decide whether the current output chooses to copy from the source document or generate new words from the vocabulary. gtThe calculation formula of (2) is as follows:
in the formula, Wg、bgAre all learnable parameters.The output of the last layer of the Transformer at the time t. O isjIs the output of Decoder. The formula shows that at the time t, whether the next word is to generate a new word or to directly copy the new word is determined according to the attention weights of the j word and other words.
Preferably, for each document, the words in the vocabulary are combined with all the occurring words of the source document to form a new word stock, i.e., an extended word stock. Therefore, whether the copy is selected from the source document or generated from the vocabulary, the generation is performed from the extended thesaurus, so the final probability calculation formula is:
wherein the content of the first and second substances,indicating the probability that the current word w was generated in a given vocabulary.Representing the probability of selecting a copy from the source document based on the attention distribution.
Preferably, if w is an out-of-vocabulary word, thenIf w does not appear in the source document, thenThe final loss function for expanding the probability distribution of the word bank is as follows:
wherein T is the total time. Pt(w) is the distribution of words andthe attention distribution calculation generates a probability.
Calculating and generating probability P according to vocabulary distribution and attention distributiontAnd (w), finally, automatically generating the text abstract according to the generation probability and the vocabulary distribution.
In the prior art, an existing intelligent system such as an internet court is generally used as an auxiliary work of legal workers, for example, information is extracted from a referee document through a technology such as semantic analysis, or a relationship between legal elements is constructed through a manual processing mode. The official document is in a standard writing, but the content is exhaustive and lengthy, at present, the abstract is generated by extracting words, phrases and sentences with larger weight from the official document and combining the words, the phrases and the sentences, and the semantic coherence of the generated abstract is poor, so that the law and the official knowledge are not effectively fused, and the generated abstract is inconsistent and inaccurate. Therefore, a method for generating a referee document abstract is needed to ensure the consistency and accuracy of the referee document abstract.
In the invention, firstly, the referee document is divided into sentences, and then the sentences with the highest similarity are found out in the referee document through the manual standard sentences and are used as the data of the extraction model; and calculating the similarity score between the sentence in the artificial abstract and the sentence in the source document (referee document library) by using cosine similarity, and selecting the sentence with the highest score in the source document as a key sentence according to the similarity score.
In the invention, text vectorization is carried out, sentences obtained after similarity calculation and original texts in a referee document are in the same line, and a source text, label data and an artificial abstract are subjected to word segmentation by adopting jieba. In the word segmentation process, legal nouns are crawled to be used as supplements of a word bank, and word vectorization is carried out by using a BERT model. The method comprises the steps of obtaining a primary key sentence after similarity calculation for text vectorization, performing word segmentation on a source text, tag data and an artificial abstract by adopting a jieba to determine a first word (a central word), crawling legal nouns (legal nouns in a mesoscale network) in the word segmentation process to be used as a supplement of a word bank, adding an incidence relation between the first word and a target word into an initial word embedding vector after determining the first word, obtaining a fused word embedding vector capable of reflecting the incidence relation between the target word (the target word) and the first word, and determining the fused word embedding vector as the target word embedding vector of the target word.
In the invention, the input of the extraction mode is based on word embedding, and also input position embedding and segmented embedding are carried out; for referee documents, there are n sentences D ═ S1,S2,……,SnBy inserting [ CLS ] at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]Tag composition input, [ CLS]The token represents the vector of the current sentence, [ SEP ]]The mark represents a clause used for segmenting a sentence in the text; [ CLS]Identification sum [ SEP]The marks are used for segmenting sentences, so that the semantics of each sentence can be better captured, and the accuracy of information extraction is improved; position embedding encodes the position information of the word into a feature vector, and the position vector adopts a scheme in Attentions AllYouNeed:
PE(pos,2i)=sin(pos/100002i/dmodel)。
PE(pos,2i+1)=cos(pos/100002i/dmodel)。
in the formula, pos represents the position of a word in a sentence and has a value range of [0, n]. i refers to the dimension of the word vector. The input to the BERT is dmodel 128-1024, preferably 256-512. Segment embedding for distinguishing two sentences, different sentences are respectively marked with A and B before, so that the input sentence is expressed as (E)A,EB,EA,EB,……)。
In the present invention, word embedding, position embedding, and segment embedding represent concatenations as inputs to the BERT model, and the sentence vector X obtained after the BERT model pre-training layer is equal to (X)1,X2,……,Xn)=BERT(sent1,sent2,sent3,……,sentn) Wherein sentiThe i-th sentence, X, represented as the original referee documentiCorresponding sendiBERT-coded vector, XiThe ith vector sequence that needs to be processed. Coding process is carried out through a BERT preprocessing model (word embedding, position embedding and segmentation are carried out on coded sentences through BERT + global average poolingEmbedding together, outputting to a dense layer, extracting the correlation among the features through nonlinear change of the features extracted in the encoding process, and finally mapping to an output space. ). And embedding each character, splitting Chinese into individual characters for learning, and classifying by a full connection layer and a softmax layer to obtain a classification result.
In the invention, the key sentences are subjected to feature learning through a Dilated Residual Gated Convolutional Neural Network (DRGCNN); compared with the more traditional convolutional neural network, the DRGCNN enhances the ability of a model to learn long-distance context semantic information; a gating mechanism (DGCNN) is introduced to control the flow direction of information, and a residual error mechanism is introduced to solve the problem of gradient disappearance and increase the multi-channel transmission of the information; extracting key abstract sentences by stacking 6-10 layers, preferably 7-8 layers of DRGCNN networks, wherein the expansion coefficients of each layer are 1, 2, 4, 8, 1 and 1 respectively; before the coding sequence is processed by adopting a self-attention mechanism, a residual error network and a gate control convolution are adopted to process data, and the coding sequence with a text relation is obtained. Original input sequence for convolutional network X ═ X (X)1,X2,……,Xn) The convolution kernel is W, and the arbitrary convolution operation obtains a feature map CiThe calculation formula of (2) is as follows:
in the formula, WcRepresenting a one-dimensional convolution kernel, also called a weight coefficient, is a learnable parameter. k denotes a distance from the input identity i. n represents the number of words in the sentence. x is the number ofi±kA word vector representing k words forward or backward from the ith word. The resulting signature graph may represent input XiThe degree of association with the context. Expanding the convolution width by adding an expansion coefficient alpha, and increasing the network depth by maturing a stacked expansion convolution neural network, thereby solving the problem of long-distance dependence of a text sequence and the problem of extracting global effective information; when α is 1, the dilation convolution operation is equivalent to a full convolution operation; when alpha is>1, the dilated convolution can learn more distant context informationAt this time, the characteristic diagram CiThe calculation formula of (2) is as follows:
wherein α is an expansion coefficient, WcRepresenting a one-dimensional convolution kernel, also called a weight coefficient, is a learnable parameter. k represents the distance from the input identifier i, and the resulting feature map may represent the input XiThe degree of association with the context. In the feature map CiThe convolutional neural network of a gating mechanism (DGCNN) is introduced on the basis, and the output calculation formula is as follows:
in the formula, convD1And convD2Representing a one-dimensional convolution function, X representing a vector of sentences,representing point-by-point multiplication, σ being a gating function, conv1And conv2Two convolution operations are performed and weights are not shared; activating one convolution and calculating an outer product between the two convolutions, so that gradient disappearance of the neural network can be relieved; if a Plain network (Plain network) similar to a vgg (visual Geometry group) network is used, there is no residual error, and it is empirically found that as the depth of the network increases, the training errors decrease and then increase (and the increase in the errors is not caused by overfitting, but is difficult to train due to the fact that the network becomes deeper). The deeper the network depth the better, but in practice, the deeper the depth means that it is harder to train with an optimization algorithm for a normal network if there is no residual network. In fact, as the depth of the network increases, training errors increase, which is described as network degradation. The residual error network is helpful for solving the problems of gradient disappearance, gradient explosion and network degradation, so that the good information can be ensured while a deeper network is trained. Therefore, a residual error net is introduced on the basis of a gating mechanismThe output calculation formula of the network structure is as follows:
in the formula, convD1,convD2Representing a one-dimensional convolution function. X denotes a sentence vector.Representing point-by-point multiplication. σ is a gating function. And judging whether the sentences are classified into two categories of key sentences or not through the full connection layer. During training, cross entropy (cross entropy is information for measuring difference between two probability distributions) is selected as a loss function, and is expressed as follows:
in the formula (I), the compound is shown in the specification,label representing sample i has a positive class of 1 and a negative class of 0. y represents the probability that sample i is predicted as a positive class. Loss represents a Loss function.
In the invention, the coding and classification in the extraction abstract model extracts a sentence combination key sentence collection from a referee document as the input of a generation model, and the input is coded and decoded by the model to generate a text abstract; the model coding adopts a Unilm pre-training language model, a pre-training data set is constructed through Unilm, a target detection method is used for carrying out target detection on a text, a result is used as key text information, and the key text information is input in a keyword embedding mode, wherein the input of the model consists of word embedding, position embedding and segment embedding; the word embedding and segment embedding modes in the word embedding and segment embedding extraction type model are the same; the position embedding is position coding of hierarchical decomposition, and a position coding vector trained by BERT is used as p1,p2,p3,…,pnConstructing a new set of position codes q by formula1,q2,q3,…,qmWherein m is>n; the structural formula is as follows:
in the formula, q(i-1)×n+jIs position-coded.The value is a hyperparameter and is 0.4. q is the position code of the (i-1) × n + j position. i is the ith word. j is the jth word. u is a vector, the base vector of the q vector is represented by the trained position p vectorThe pos represents the position of the word in the sentence and has the value range of 0, n]. The position code of (i-1) × n + j is hierarchically represented as (i, j) by the formula. The position codes corresponding to i and j are respectivelyAndbecause q is1=p1,q2=p2,……,qn=pnCalculate ui,Word embedding, position embedding and segmented embedding are spliced into input of a Unilm model, and a sentence vector X obtained after a pre-training layer of the Unilm model is equal to (X)1,x2,…,xn)=Unilm(sent1,sent2,sent3,…,sentn). Inputting the key paragraphs and key sentence information into an encoder in a Unilm model for encoding to form comprehensive semantic representation; and finally, decoding is carried out through transformations, which is helpful for ensuring that the key information finally generated by the formed sentence generation model has the diversity of language organization modes and the comprehensiveness of knowledge point coverage.
In the invention, the generated text learns the characteristics of the document level through a Transformer layer of a multi-layer attention mechanism, and a copying mechanism is introduced in the decoding process of the model and comprises copying and generating; and adding the position vector and the word vector to obtain a word vector containing word sequence information, wherein all the word vectors containing the word sequence information in each paragraph form a word vector set containing context information and with known context. For a multi-layer transform backbone network, a text sequence X with an input length n is given as (X)1,x2,…,xn) (ii) a Then the transform layer output H of the first layer0The calculation formula of (2) is as follows: h0=Transformer0(x) In that respect Output H of Transformer passing through l layerlThe calculation formula is as follows:
Hl=Transformer1(Hl-1)。
h is a layer of a Transformer. Final input result HlThe calculation formula of (2) is as follows:
wherein L represents the number of layers, and is within [1, L ]]. L represents the total number of layers of the Transformer,denotes xiThe context of the input.
Preferably, in each transform module, a multi-head attention mechanism is added to aggregate output modules, and parts of output sequences needing attention are marked. Transformer self-attention for layer I AlThe calculation formula is as follows:
in the formula, AlIs a self-attention weight. The softmax function is a normalized exponential function. Q, K, V is derived from input XiAnd linear transformation is carried out. VlIs the first layerValue, M, of (a) represents a Mask matrix. dkThe column number of Q, K matrix, namely the dimension of vector, prevents Q, K inner product from being too large and plays a role in regulation. T is denoted as transpose. In the formula (I), the compound is shown in the specification, is a linear projection of the previous layer to Queries, Values and Keys and the parameters of the projection are respectively The Mask matrix M controls whether the Token is allowed to be added or not, different Mask matrixes M are used for controlling to pay attention to different contexts, and a copy mechanism is introduced to solve the problems of unregistered words and repeated words brought in the generation process.
In the invention, when the decoding time is t, the last layer H according to the TransformertAnd output O of DecoderjThe calculation method of the correlation weight is thatWherein WcTo initialize the matrix, the formula for the attention distribution of the j-th word is calculated at the same time as:
in the formula, N is the number of words in the sentence. Exp is an exponential function with a natural constant e as the base. u is a hyperparameter. t is time. k represents the input sequence identification and has the value range of [1, N]. j is denoted as the ith word. The attention distribution can be interpreted as the attention degree of the ith word in the context query, and the attention is divided by the following formulaCarrying out information weighted average to obtain a context expression vector h't:
Wherein h'tAlso called context vectors, indicate that the information of interest is obtained from the distribution of attention.The output of the last layer of the Transformer at the time t.The attention distribution for the jth word. i is the value of the sequence identity. And N is the number of words in the sentence. The context vector is compared with the output O of DecoderjConnected and generate a vocabulary distribution through two linear layersThe calculation formula is as follows:
wherein V ', V, b ' are learnable parameters h 'tFor context representation of vectors, OjIs the output of the Decoder and is the output of the Decoder,is the probability distribution of all words in the vocabulary. Reintroducing copy gating function gt∈[0,1]To decide whether the current output chooses to copy from the source document or generate new words from the vocabulary. gtThe calculation formula of (2) is as follows:
in the formula, Wg、bgAre all learnable parameters.The output of the last layer of the Transformer at the time t. O isjIs the output of Decoder. The formula shows that at the time t, whether the next word is to generate a new word or to directly copy the new word is determined according to the attention weights of the j word and other words. For each document, the words in the vocabulary are combined with all the appearing words in the source document to form a new word stock, namely an extended word stock. Therefore, whether the copy is selected from the source document or generated from the vocabulary, the generation is performed from the extended thesaurus, so the final probability calculation formula is:
in the formula (I), the compound is shown in the specification,indicating the probability that the current word w was generated in a given vocabulary.Representing the probability of selecting a copy from the source document based on the attention distribution. If w is an out-of-vocabulary word, thenIf w does not appear in the source document, thenThe final loss function for expanding the probability distribution of the word bank is as follows:
wherein T is the total time. Pt(w) generating probabilities for the vocabulary distribution and the attention distribution calculations. Calculating and generating probability P according to vocabulary distribution and attention distributiont(w) finally according to the generationThe probability and lexical distribution automatically generate a text summary. The method for generating the abstract by the Transformer can learn the dependency relationship in each text first and then model the relationship among the texts, so that the sequence length of single input is greatly shortened, cross-text association can be conveniently learned, and the abstract generation is quick and accurate.
By adopting the method, the data can be better migrated to various specific fields such as tourism, medicine, news, natural science and the like through fine adjustment.
Compared with the prior art, the technical scheme of the invention has the following beneficial technical effects:
1. in the invention, through combining the extraction mode and the generation mode, the problems of poor readability and poor continuity of the abstract formed by the single extraction mode, and the problems of contradictory meaning and low fidelity of the abstract formed by the single generation mode and the original meaning are solved.
2. In the invention, the abstract of the referee document is formed by adopting two stages, the first stage extracts sentences from the referee document to be combined into key sentences, the second stage takes the extracted key sentences as the input of a generation mode, and the text abstract is formed by model coding and decoding, so that the accuracy and the fidelity degree of the text abstract can be ensured by the two stages.
3. In the invention, key information is extracted from the source document to form key sentences, and the key sentences are coded and combined to form the abstract with the same meaning as the source document through the generation mode, thereby greatly reducing the space of manual abstract text.
Drawings
FIG. 1 is a schematic structural diagram of a two-stage hybrid automatic summarization method for judicial official documents according to the present invention.
FIG. 2 is a schematic diagram of an abstraction model structure of a two-stage hybrid automatic summarization method for judicial official documents according to the present invention.
FIG. 3 is a schematic diagram of a generative model structure of a two-stage hybrid automatic summarization method for judicial official documents according to the present invention.
Detailed Description
The technical solution of the present invention is illustrated below, and the claimed scope of the present invention includes, but is not limited to, the following examples.
A two-stage hybrid automatic summarization method for judicial official documents comprises the following steps:
1) and calculating the similarity of the key sentences in the referee document, coding and classifying abstract models of the key sentences, and finally extracting the abstract key sentences.
2) Sentences are extracted from the referee documents and combined into a key sentence collection.
3) And 2) taking the key sentence collection in the step 2) as the input of a generative model, and generating a text abstract through model coding and decoding.
Preferably, the calculating the similarity of the key sentences in the step 1) includes:
step 1.1) sentence division is carried out on the referee document, then an artificial standard sentence is found in the referee document, and then a sentence with the highest similarity is found from the original text and is used as a tag data set of the extraction abstract. And calculating the similarity score between the sentences in the artificial abstract and the sentences in the source document through cosine similarity, and selecting the sentences with the highest score in the source document, namely the key sentences.
Preferably, step 1) further comprises:
and step 1.2) vectorizing the text, wherein sentences obtained after similarity calculation and original texts in the referee document are in the same line, and the source text, the label data and the artificial abstract are subjected to word segmentation by adopting jieba. In the word segmentation process, legal nouns are crawled to be used as supplement of a word bank, and then word vectorization is carried out by using a BERT model.
Preferably, the encoding of the abstract model of the key sentence in step 1) includes:
and (5) extracting model coding. At the coding layer, word embedding adopts target word embedding vector, and for a text with n sentences, D ═ S1,S2,……,SnPre-treatment by two special markers. First, [ CLS ] is inserted at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]Marking compositionAnd (4) inputting. [ CLS]The token represents the vector of the current sentence, [ SEP ]]The tokens represent clauses used to segment sentences in the text. On the basis of word embedding, input position embedding and segmentation embedding are further arranged.
Preferably, the location is embedded. The position information of the word is coded into a feature vector, and the position vector adopts a scheme in the Attention islalyouneed:
PE(pos,2i)=sin(pos/100002i/dmodel)。
PE(pos,2i+1)=cos(pos/100002i/dmodel)。
in the formula, pos represents the position of a word in a sentence, and the numeric area is [0, n ]. i refers to the dimension of the word vector. The input to the BERT is dmodel 128-1024, preferably 256-512.
Preferably, the segments are embedded. For distinguishing two sentences, the different sentences are preceded by a and B labels, respectively, so that the input sentence is represented as (E)A,EB,EA,EB… …). The word embedding, position embedding and segment embedding representations are stitched as BERT model inputs. Sentence vector X ═ X (X) obtained after pre-training layer via BERT model1,X2,……,Xn)=BERT(sent1,sent2,sent3,……,sentn) Wherein sentiThe i-th sentence, X, represented as the original referee documentiCorresponding sendiBERT-coded vector, XiThe ith vector sequence that needs to be processed.
Preferably, the classifying the abstract model of the key sentence in step 1) includes:
and (4) a classification layer, wherein an expansion residual gated convolutional neural network structure is adopted, namely the expansion residual gated convolutional neural network structure is DRGCNN. The key sentence extraction of the abstract is carried out by stacking a plurality of layers of DRGCNN networks, the number of the layers of the DRGCNN is 6-10, preferably 7-8, and the expansion coefficient of each layer is 1, 2, 4, 8, 1 and 1 respectively. Original input sequence for convolutional network X ═ X (X)1,X2,……,Xn) With convolution kernel W, signature C of arbitrary convolution operationiThe calculation formula of (2) is as follows:
in the formula, WcRepresenting a one-dimensional convolution kernel, also called a weight coefficient, is a learnable parameter. k denotes a distance from the input identity i. n represents the number of words in the sentence. x is the number ofi±kA word vector representing k words forward or backward from the ith word. The resulting signature graph may represent input XiThe degree of association with the context.
Preferably, the convolution width is expanded by adding the expansion coefficient α. When α is 1, the dilation convolution operation corresponds to a full convolution operation. Alpha is alpha>1, the dilated convolution can learn more distant context information, feature map CiThe calculation formula of (2) is as follows:
alpha is the coefficient of expansion, WcRepresenting a one-dimensional convolution kernel, also called a weight coefficient, is a learnable parameter. k represents the distance from the input identifier i, and the resulting feature map may represent the input XiThe degree of association with the context. In the feature map CiOn the basis of the method, a gate control mechanism convolutional neural network is introduced, and an output calculation formula is as follows:
in the formula, convD1,convD2Representing a one-dimensional convolution function. X denotes a sentence vector.Representing point-by-point multiplication. σ is a gating function. convD1And convD2Operate for two convolution functions and the weights are not shared.
Preferably, a residual structure is introduced on the basis of a gating mechanism, and the output calculation formula is as follows:
in the formula, convD1,convD2Representing a one-dimensional convolution function. X denotes a sentence vector.Representing point-by-point multiplication. σ is a gating function.
Preferably, the sentence is further classified into two categories by the full link layer. During training, cross entropy is selected as a loss function and is expressed as:
in the formula (I), the compound is shown in the specification,the label data representing sample i has a positive class of 1 and a negative class of 0. y represents the probability that sample i is predicted as a positive class. Loss represents a Loss function.
Preferably, the sentence combination key sentence collection is extracted from the referee document as the input of the generation model in the step 2) through coding and classification in the extraction abstract model.
Preferably, the formula model generated in step 3) includes: and combining the key sentence collection as the input of a generation model, and generating a text abstract by performing model coding and decoding on the input. The model coding adopts a Unilm pre-training language model, and the input of the model consists of word embedding, segment embedding and position embedding.
Preferably, the words are embedded as n sentences for a text D ═ S1,S2,……,SnPre-treatment by two special markers. First, [ CLS ] is inserted at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]The tokens constitute the input. [ CLS]The token represents the vector of the current sentence, [ SEP ]]The marks representing clausesFor segmenting sentences in the text.
Preferably, segment embedding is used to distinguish two sentences, different sentences being preceded by a and B labels, respectively, so that the input sentence is represented as (E)A,EB,EA,EB,……)。
Preferably, the position in the model input is embedded as a position code of hierarchical decomposition. The position coding vector trained by using BERT is p1,p2,p3,…,pnConstructing a new set of position codes q by formula1,q2,q3,…,qmIn the formula, the structural formula is as follows:
q(i-1)×n+jis position-coded.The value is a hyperparameter and is 0.4. q is the position code of the (i-1) × n + j position. i is the ith word. j is the jth word. u is a vector, the base vector of the q vector is represented by the trained position p vectorThe pos represents the position of the word in the sentence and has the value range of 0, n]. The position code of (i-1) × n + j is hierarchically represented as (i, j) by the formula. The position codes corresponding to i and j are respectivelyAndbecause q is1=p1,q2=p2,……,qn=pnCalculate ui,Word embedding, position embedding and segmented embedding are spliced into input of a Unilm model, and a sentence vector X obtained after a pre-training layer of the Unilm model is equal to (X)1,x2,…,xn)=Unilm(sent1,sent2,sent3,…,sentn)。
Preferably, the decoding of the generated text summary in the model generated in step 3) includes: the abstract generation learns the characteristics of the document level through a Transfoamer layer of a multi-layer attention mechanism. A copy mechanism is introduced in the decoding process of the model, and the copy mechanism comprises copying and generating. For a multi-layer transform backbone network, a text sequence X with an input length n is given as (X)1,x2,…,xn). Transformer layer output H of first layer0The calculation formula of (2) is as follows: h0=Transformer0(x) In that respect Output H of Transformer passing through l layerlThe calculation formula is as follows:
Hl=Transformer1(Hl-1)。
h is a layer of a Transformer. Final input result HlThe calculation formula of (2) is as follows:
wherein L represents the number of layers, and is within [1, L ]]. L represents the total number of layers of the Transformer,denotes xiThe context of the input.
Preferably, in each transform module, a multi-head attention mechanism is added to aggregate output modules, and parts of output sequences needing attention are marked. Transformer self-attention for layer I AlThe calculation formula is as follows:
in the formula, AlTo self-attention weight. The softmax function is a normalized exponential function. Q, K, V is derived from input XiAnd linear transformation is carried out. VlM represents a Mask matrix for Value of the l-th layer. dkThe column number of Q, K matrix, namely the dimension of vector, prevents Q, K inner product from being too large and plays a role in regulation. T is denoted as transpose. In the formula (I), the compound is shown in the specification, is a linear projection of the previous layer to Queries, Values and Keys and the parameters of the projection are respectively The Mask matrix M controls whether the Token is allowed to be added or not, different Mask matrixes M are used for controlling to pay attention to different contexts, and a copy mechanism is introduced to solve the problems of unregistered words and repeated words brought in the generation process.
Preferably, the generating the text summary further comprises: when the decoding time is t, according to the last layer H of the TransformertAnd output O of DecoderjThe calculation method of the correlation weight is thatWherein WcTo initialize the matrix, the formula for the attention distribution of the j-th word is calculated at the same time as:
in the formula, N is the number of words in the sentence. Exp is an exponential function with a natural constant e as the base. u is a hyperparameter. t is time. k represents the sequence identifier of the input and has the value range of[1,N]. j is denoted as the ith word. The attention distribution can be interpreted as the attention degree of the ith word in context query, and the information weighted average is carried out on the attention distribution by the following formula to obtain a context expression vector h't:
h'tAlso called context vectors, indicate that the information of interest is obtained from the distribution of attention.The output of the last layer of the Transformer at the time t.The attention distribution for the jth word. i is the value of the sequence identity. And N is the number of words in the sentence.
Preferably, the context vector is compared with the output O of the DecoderjConnected and generate a vocabulary distribution through two linear layersThe calculation formula is as follows:
wherein V ', V, b ' are learnable parameters h 'tFor context representation of vectors, OjIs the output of the Decoder and is the output of the Decoder,is the probability distribution of all words in the vocabulary.
Preferably, a copy gating function g is reintroducedt∈[0,1]To decide whether the current output chooses to copy from the source document or generate new words from the vocabulary. gtThe calculation formula of (2) is as follows:
in the formula, Wg、bgAre all learnable parameters.The output of the last layer of the Transformer at the time t. O isjIs the output of Decoder. The formula shows that at the time t, whether the next word is to generate a new word or to directly copy the new word is determined according to the attention weights of the j word and other words.
Preferably, for each document, the words in the vocabulary are combined with all the occurring words of the source document to form a new word stock, i.e., an extended word stock. Therefore, whether the copy is selected from the source document or generated from the vocabulary, the generation is performed from the extended thesaurus, so the final probability calculation formula is:
in the formula (I), the compound is shown in the specification,indicating the probability that the current word w was generated in a given vocabulary.Representing the probability of selecting a copy from the source document based on the attention distribution.
Preferably, if w is an out-of-vocabulary word, thenIf w does not appear in the source document, thenThe final loss function for expanding the probability distribution of the word bank is as follows:
wherein T is the total time. Pt(w) generating probabilities for the vocabulary distribution and the attention distribution calculations.
Calculating and generating probability P according to vocabulary distribution and attention distributiontAnd (w), finally, automatically generating the text abstract according to the generation probability and the vocabulary distribution.
Example 1
As shown in fig. 1, a two-stage hybrid automatic summarization method for judicial official documents includes the following steps:
1) and calculating the similarity of the key sentences in the referee document, coding and classifying abstract models of the key sentences, and finally extracting the abstract key sentences.
2) Sentences are extracted from the referee documents and combined into a key sentence collection.
3) And 2) taking the key sentence collection in the step 2) as the input of a generative model, and generating a text abstract through model coding and decoding.
Example 2
Repeating the embodiment 1, as shown in fig. 2, in the step 1.1), sentence division is performed on the referee document, then an artificial standard sentence is found in the referee document, and then a sentence with the highest similarity is found from the original text and is used as a tag data set of the abstract. And calculating the similarity score between the sentences in the artificial abstract and the sentences in the source document through cosine similarity, and selecting the sentences with the highest score in the source document, namely the key sentences.
Example 3
The embodiment 2 is repeated, except that the sentences with the highest scores are selected from the source documents in the step 1) for vectorization. The sentences obtained after similarity calculation and the original texts in the referee documents are in the same line, and the source texts, the label data and the artificial abstracts are subjected to word segmentation by adopting jieba. In the word segmentation process, legal nouns are crawled to be used as supplement of a word bank, and then word vectorization is carried out by using a BERT model.
Example 4
Repeating the embodiment 3, namely extracting and coding the key sentence in the step 1), wherein in a coding layer, the word embedding adopts a target word embedding vector, and on the basis of the word embedding, the input position embedding and the segmented embedding are carried out.
In the sentence coding layer, firstly, the sentence is divided into words to obtain word-level information for word embedding representation, and the word-level information is converted into a sentence vector as input:
for a text with n sentences, D ═ S1,S2,……,SnPre-treatment by two special markers. First, [ CLS ] is inserted at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]The tokens constitute the input. [ CLS]The token represents the vector of the current sentence, [ SEP ]]The tokens represent clauses used to segment sentences in the text.
Position embedding: the position information of the word is encoded into a feature vector, and the position vector adopts the scheme in AttentinisAllyOutneed:
PE(pos,2i)=sin(pos/100002i/dmodel)。
PE(pos,2i+1)=cos(pos/100002i/dmodel)。
in the formula, pos represents the position of a word in a sentence, and the numeric area is [0, n ]. i refers to the dimension of the word vector. The input to the dmodel for BERT is 256.
Segment embedding for distinguishing two sentences, different sentences are respectively marked with A and B before, so that the input sentence is expressed as (E)A,EB,EA,EB… …). The word embedding, position embedding and segment embedding representations are stitched as BERT model inputs. Sentence vector X ═ X (X) obtained after pre-training layer via BERT model1,X2,……,Xn)=BERT(sent1,sent2,sent3,……,sentn) Wherein sentiThe i-th sentence, X, represented as the original referee documentiCorresponding sendiBERT-coded vector, XiThe ith vector sequence that needs to be processed.
Each sentence vector is represented by word embedding, position embedding and segment embedding, so that the text vectorization work is completed.
Example 5
Example 4 was repeated, with the dmodel having a BERT input of 512.
Example 6
Example 5 was repeated, with the dmodel having a BERT input of 1024.
Example 7
The embodiment 6 is repeated, except that the words, positions and segments are embedded together by the coded sentences in the step 1) through bert + global average pooling, the words, the positions and the segments are output to a dense layer, the features extracted in the coding process are subjected to nonlinear change in dense, the association among the features is extracted, and finally the features are mapped to an output space.
Example 8
Example 7 is repeated, except that step 1) further comprises a classification layer, and a dilation residual gating convolutional neural network structure, namely DRGCNN, is adopted after the first dense layer is passed. And (3) abstracting key sentences by stacking the number of DRGCNN network layers, wherein the number of DRGCNN layers is 6, and the expansion coefficients of each layer are 1, 2, 4, 8, 1 and 1 respectively. Original input sequence for convolutional network X ═ X (X)1,X2,……,Xn) The convolution kernel is W, and the arbitrary convolution operation obtains a feature map CiThe calculation formula of (2) is as follows:
in the formula, Wc represents a one-dimensional convolution kernel, also called a weight coefficient, and is a learnable parameter. k denotes a distance from the input identity i. n represents the number of words in the sentence. x is the number ofi±kA word vector representing k words forward or backward from the ith word. The resulting feature map may represent the degree of association between the input Xi and the context.
Example 9
Example 8 is repeated, except that the number of layers of the DRGCNN network is stacked in step 1) to extract the key abstract sentence, the number of layers of the DRGCNN is 8, and the expansion coefficients of each layer are 1, 2, 4, 8, 1 and 1 respectively.
Example 10
The embodiment 9 is repeated, the number of layers of the DRGCNN network is stacked to extract key sentences of the summary, the number of layers of the DRGCNN is 10, and the expansion coefficients of each layer are 1, 2, 4, 8, 1 and 1 respectively.
Example 11
Example 10 was repeated except that in step 1) the convolution width was expanded by adding the expansion coefficient α. And the network depth is increased through the maturity of the stacked expansion convolution neural network, the problem of long-distance dependence of the text sequence is solved, and global effective information is extracted. When α is 1, the dilation convolution operation corresponds to a full convolution operation. Alpha is alpha>1, the dilated convolution can learn more distant context information, feature map CiThe calculation formula of (2) is as follows:
alpha is the coefficient of expansion, WcRepresenting a one-dimensional convolution kernel, also called a weight coefficient, is a learnable parameter. k represents the distance from the input identifier i, and the resulting feature map may represent the input XiThe degree of association with the context. In the feature map CiOn the basis of the method, a gate control mechanism convolutional neural network is introduced, and an output calculation formula is as follows:
in the formula, convD1,convD2Representing a one-dimensional convolution function. X denotes a sentence vector.Representing point-by-point multiplication. σ is a gating function. convD1And convD2Operate for two convolution functions and the weights are not shared.
Introducing a residual error structure on the basis of a gating mechanism, wherein an output calculation formula is as follows:
in the formula, convD1,convD2Representing a one-dimensional convolution function. X denotes a sentence vector.Representing point-by-point multiplication. σ is a gating function.
And judging whether the sentences are classified into two categories of key sentences or not through the full connection layer. During training, cross entropy is selected as a loss function and is expressed as:
in the formula (I), the compound is shown in the specification,the label data representing sample i has a positive class of 1 and a negative class of 0. y represents the probability that sample i is predicted as a positive class. Loss represents a Loss function.
Example 12
The embodiment 11 is repeated, and as shown in fig. 3, the sentence combination key sentence collection extracted from the referee document by the encoding and classification in the abstract model in the step 2) is used as the input of the generation model.
Example 13
The embodiment 12 is repeated, except that the key sentence collection is combined as the input of generating the model in the step 3), and the text abstract is generated by model coding and decoding the input. The model coding adopts a Unilm pre-training language model, and the input of the model consists of word embedding, segment embedding and position embedding. The words are embedded as text D ═ S for a piece of n sentences1,S2,……,SnPre-treatment by two special markers. First, [ CLS ] is inserted at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]The tokens constitute the input. [ CLS]The token represents the vector of the current sentence, [ SEP ]]The tokens represent clauses used to segment sentences in the text. Segment embedding is used to distinguish two sentences, different sentences are preceded by a and B labels respectively,so that the input sentence is represented as (E)A,EB,EA,EB… …). The position in the model input is embedded into the position code of the hierarchical decomposition. The position coding vector trained by using BERT is p1,p2,p3,…,pnConstructing a new set of position codes q by formula1,q2,q3,…,qmIn the formula, the structural formula is as follows:
q(i-1)×n+jis position-coded.The value is a hyperparameter and is 0.4. q is the position code of the (i-1) × n + j position. i is the ith word. j is the jth word. u is a vector, the base vector of the q vector is represented by the trained position p vectorThe pos represents the position of the word in the sentence and has the value range of 0, n]. The position code of (i-1) × n + j is hierarchically represented as (i, j) by the formula. The position codes corresponding to i and j are respectivelyAndbecause q is1=p1,q2=p2,……,qn=pnCalculate ui,Word embedding, position embedding and segmented embedding are spliced into input of a Unilm model, and a sentence vector X obtained after a pre-training layer of the Unilm model is equal to (X)1,x2,…,xn)=Unilm(sent1,sent2,sent3,…,sentn)。
Example 14
The embodiment 13 is repeated, except that the step 3) of generating the text summary by decoding the generated model comprises the following steps: the abstract generation learns the characteristics of the document level through a Transfoamer layer of a multi-layer attention mechanism. A copy mechanism is introduced in the decoding process of the model, and the copy mechanism comprises copying and generating. For a multi-layer transform backbone network, a text sequence X with an input length n is given as (X)1,x2,…,xn). Transformer layer output H of first layer0The calculation formula of (2) is as follows: h0=Transformer0(x) In that respect Output H of Transformer passing through l layerlThe calculation formula is as follows:
Hl=Transformer1(Hl-1)。
h is a layer of a Transformer. Final input result HlThe calculation formula of (2) is as follows:
wherein L represents the number of layers, and is within [1, L ]]. L represents the total number of layers of the Transformer,denotes xiThe context of the input. In each Transformer module, a multi-head attention mechanism is added to aggregate output modules, and parts of output sequences needing attention are marked. Transformer self-attention for layer I AlThe calculation formula is as follows:
in the formula, AlIs a self-attention weight. The softmax function is a normalized exponential function. Q, K, V is derived from input XiAnd linear transformation is carried out. VlM represents a Mask matrix for Value of the l-th layer. dkIs the column number of the Q, K matrix, i.e.The vector dimension prevents Q, K from being excessively large, and plays a role in regulation. T is denoted as transpose. In the formula (I), the compound is shown in the specification, is a linear projection of the previous layer to Queries, Values and Keys and the parameters of the projection are respectively The Mask matrix M controls whether the Token is allowed to be added or not, different Mask matrixes M are used for controlling to pay attention to different contexts, and a copy mechanism is introduced to solve the problems of unregistered words and repeated words brought in the generation process.
Example 15
Example 14 is repeated except that generating the text excerpt in step 3) further includes generating the text excerpt according to the last layer H of the Transformer at the decoding time ttAnd output O of DecoderjThe calculation method of the correlation weight is thatWherein WcTo initialize the matrix, the formula for the attention distribution of the j-th word is calculated at the same time as:
in the formula, N is the number of words in the sentence. Exp is an exponential function with a natural constant e as the base. u is a hyperparameter. t is time. k represents the input sequence identification and has the value range of [1, N]. j is denoted as the ith word. The attention distribution can be interpreted as the degree of attention of the ith word in the context query, as in the following formula pairCarrying out information weighted average on the attention distribution to obtain a context expression vector h't:
Wherein h'tAlso called context vectors, indicate that the information of interest is obtained from the distribution of attention.The output of the last layer of the Transformer at the time t.The attention distribution for the jth word. i is the value of the sequence identity. And N is the number of words in the sentence.
The context vector is compared with the output O of DecoderjConnected and generate a vocabulary distribution through two linear layersThe calculation formula is as follows:
wherein V ', V, b ' are learnable parameters h 'tFor context representation of vectors, OjIs the output of the Decoder and is the output of the Decoder,is the probability distribution of all words in the vocabulary.
Reintroducing copy gating function gt∈[0,1]To decide whether the current output chooses to copy from the source document or generate new words from the vocabulary. gtThe calculation formula of (2) is as follows:
in the formula, Wg、bgAre all learnable parameters.The output of the last layer of the Transformer at the time t. O isjIs the output of Decoder. The formula shows that at the time t, whether the next word is to generate a new word or to directly copy the new word is determined according to the attention weights of the j word and other words. For each document, the words in the vocabulary are combined with all the appearing words in the source document to form a new word stock, namely an extended word stock. Therefore, whether the copy is selected from the source document or generated from the vocabulary, the generation is performed from the extended thesaurus, so the final probability calculation formula is:
in the formula (I), the compound is shown in the specification,indicating the probability that the current word w was generated in a given vocabulary.Representing the probability of selecting a copy from the source document based on the attention distribution. If w is an out-of-vocabulary word, thenIf w does not appear in the source document, thenThe final loss function for expanding the probability distribution of the word bank is as follows:
wherein T is the total time. Pt(w) generating probabilities for the vocabulary distribution and the attention distribution calculations.
Calculating and generating probability P according to vocabulary distribution and attention distributiontAnd (w), finally, automatically generating the text abstract according to the generation probability and the vocabulary distribution.
Example 16
Example 15 is repeated, as shown in fig. 3, for example, input (five thousand dollars of pay to original notice within seven days of notice), prediction sentence is (pay to original notice of pay) and [ CLS ] is set before input sentence]As the current sentence vector, input [ SEP ] at the end of the input sentence]The tokens represent clauses used to segment sentences in the text. Sentence vector X ═ X (X) obtained after pre-training layer of Unilm model1,x2,…,xn)=Unilm(sent1,sent2,sent3,…,sentn) And inputting the data into a Transfoamer layer, and calculating a formula by using a final probability as follows:derived (reported to original pay-Per-Payment [ SEP)])。
Claims (10)
1. A two-stage hybrid automatic summarization method for judicial official documents is characterized by comprising the following steps:
1) calculating the similarity of key sentences in the referee document, coding and classifying abstract models of the key sentences, and finally extracting abstract key sentences;
2) extracting sentences from the referee document to combine into a key sentence collection;
3) and 2) taking the key sentence collection in the step 2) as the input of a generative model, and generating a text abstract through model coding and decoding.
2. The two-stage hybrid automatic summarization method for judicial official documents according to claim 1, wherein the calculating the similarity of key sentences in step 1) comprises:
step 1.1) sentence division is carried out on a referee document, then an artificial standard sentence is found in the referee document, and then a sentence with the highest similarity is found from an original text and is used as a tag data set of an abstract; and calculating the similarity score between the sentences in the artificial abstract and the sentences in the source document through cosine similarity, and selecting the sentences with the highest score in the source document, namely the key sentences.
3. The two-stage hybrid automatic summarization method for judicial official documents according to claim 1 or 2, wherein step 1) further comprises:
step 1.2) vectorizing the text, wherein sentences obtained after similarity calculation and original texts in a referee document are in the same line, and a source text, label data and an artificial abstract are subjected to word segmentation by adopting jieba; in the word segmentation process, legal nouns are crawled to be used as supplement of a word bank, and then word vectorization is carried out by using a BERT model.
4. The two-stage hybrid automatic summarization method for judicial official documents according to any of claims 1-3, wherein the encoding of the summarization model of key sentences in step 1) comprises:
extracting model coding; at the coding layer, word embedding adopts target word embedding vector, and for a text with n sentences, D ═ S1,S2,……,SnPreprocessing by two special marks; first, [ CLS ] is inserted at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]Marking composition input; [ CLS]The token represents the vector of the current sentence, [ SEP ]]The mark represents a clause used for segmenting a sentence in the text; on the basis of word embedding, input position embedding and segmented embedding are also arranged;
the position is embedded; the position information of the word is coded into a feature vector, and the position vector adopts a scheme in the Attention is All You Need:
PE(pos,2i)=sin(pos/100002i/dmodel);
PE(pos,2i+1)=cos(pos/100002i/dmodel);
in the formula, pos represents the position of a word in a sentence, and the numeric area is [0, n ]; i refers to the dimension of the word vector; the input to the dmodel for BERT is 128-;
the segment embedding; for distinguishing two sentences, the different sentences are preceded by a and B labels, respectively, so that the input sentence is represented as (E)A,EB,EA,EB… …); concatenating word embedding, position embedding and segment embedding representations into BERT model input; sentence vector X ═ X (X) obtained after pre-training layer via BERT model1,X2,……,Xn)=BERT(sent1,sent2,sent3,……,sentn) Wherein sentiThe i-th sentence, X, represented as the original referee documentiCorresponding sendiBERT-coded vector, XiThe ith vector sequence that needs to be processed.
5. The two-stage hybrid automatic summarization method for judicial official documents according to any of claims 1-4, wherein classifying the summarization models of key sentences in step 1) comprises:
a classification layer adopts an expansion residual gated convolutional neural network structure, namely the expansion residual gated convolutional neural network structure is DRGCNN; extracting key abstract sentences by stacking a plurality of layers of DRGCNN networks, wherein the number of the layers of the DRGCNN is 6-10, preferably 7-8, and the expansion coefficients of each layer are 1, 2, 4, 8, 1 and 1 respectively; original input sequence for convolutional network X ═ X (X)1,X2,……,Xn) With convolution kernel W, signature C of arbitrary convolution operationiThe calculation formula of (2) is as follows:
in the formula, WcA one-dimensional convolution kernel, also called a weight coefficient, is represented and is a learnable parameter; k represents the distance from the input identifier i; n represents the number of words in the sentence; x is the number ofi±kA word vector representing k words forward or backward from the ith word; the resulting signature graph may represent input XiDegree of association with context;
expanding the convolution width by adding an expansion coefficient alpha; when α is 1, the dilation convolution operation is equivalent to a full convolution operation; alpha is alpha>1, the dilated convolution can learn more distant context information, feature map CiThe calculation formula of (2) is as follows:
wherein α is an expansion coefficient, WcA one-dimensional convolution kernel, also called a weight coefficient, is represented and is a learnable parameter; k represents the distance from the input identifier i, and the resulting feature map may represent the input XiDegree of association with context; in the feature map CiOn the basis of the method, a gate control mechanism convolutional neural network is introduced, and an output calculation formula is as follows:
in the formula, convD1,convD2Representing a one-dimensional convolution function; x represents a sentence vector;represents point-by-point multiplication; sigma is a gating function; convD1And convD2Operating for two convolution functions and not sharing weight;
introducing a residual error structure on the basis of a gating mechanism, wherein an output calculation formula is as follows:
in the formula, convD1,convD2Representing a one-dimensional convolution function; x represents a sentence vector;represents point-by-point multiplication; sigma is a gating function;
judging whether the sentences are classified into two categories of key sentences or not through the full-connection layer; during training, cross entropy is selected as a loss function and is expressed as:
6. The two-stage hybrid automatic summarization method for judicial official documents according to any of claims 2-4 wherein in step 2) the sentence combination key sentence collections are extracted from the official documents as input to the generation model by encoding and classification in the abstract model.
7. The two-stage hybrid automatic summarization method for judicial official documents according to any of claims 1-6, wherein the generating of the formula model in step 3) comprises: combining the key sentence collection as the input of a generation model, and generating a text abstract by model coding and decoding the input; the model coding adopts a Unilm pre-training language model, and the input of the model consists of word embedding, segment embedding and position embedding;
the words are embedded as text D ═ S for a piece of n sentences1,S2,……,SnPreprocessing by two special marks; first, [ CLS ] is inserted at the beginning of each sentence]Sign, sentence end insertion [ SEP ]]Marking composition input; [ CLS]The token represents the vector of the current sentence, [ SEP ]]The mark represents a clause used for segmenting a sentence in the text;
segment embedding is used to distinguish two sentences, the different sentences being preceded by a and B labels, respectively, so that the input sentence is represented as (E)A,EB,EA,EB,……)。
8. The two-stage hybrid automatic summarization method for judicial official documents according to claim 7, wherein: the position in the model input is embedded into a position code of hierarchical decomposition; the position coding vector trained by using BERT is p1,p2,p3,…,pnConstructing a new set of position codes q by formula1,q2,q3,…,qmIn the formula, the structural formula is as follows:
in the formula, q(i-1)×n+jCoding the position;the value is 0.4 for the hyper-parameter; q is the position code of the (i-1) x n + j position; i is the ith word; j is the jth word; u is a vector, the base vector of the q vector is represented by the trained position p vectorThe pos represents the position of the word in the sentence and has the value range of 0, n](ii) a Hierarchically expressing the position code of (i-1) × n + j as (i, j) by a formula; the position codes corresponding to i and j are respectivelyAndbecause q is1=p1,q2=p2,……,qn=pnCalculate ui,Word embedding, position embedding and segmented embedding are spliced into input of a Unilm model, and a sentence vector X obtained after a pre-training layer of the Unilm model is equal to (X)1,x2,…,xn)=Unilm(sent1,sent2,sent3,…,sentn)。
9. The two-stage hybrid automatic summarization method for judicial official documents according to claim 7 or 8, wherein the decoding of the generated text summary in the generation model in step 3) comprises: the abstract generation learns the characteristics of the document level through a Transfoamer layer of a multi-layer attention mechanism; introducing a copying mechanism in the decoding process of the model, wherein the copying mechanism comprises copying and generating; for a multi-layer transform backbone network, a text sequence X with an input length n is given as (X)1,x2,…,xn) (ii) a Transformer layer output H of first layer0The calculation formula of (2) is as follows: h0=Transformer0(x) (ii) a Output H of Transformer passing through l layerlThe calculation formula is as follows:
Hl=Transformer1(Hl-1);
h is a layer of a Transformer; final input result HlThe calculation formula of (2) is as follows:
wherein L represents the number of layers, and is within [1, L ]](ii) a L represents the total number of layers of the Transformer,denotes xiThe context of the input;
adding a multi-head attention mechanism to each Transformer module to aggregate output modules, and marking the parts of the output sequences needing attention; transformer self-attention for layer I AlThe calculation formula is as follows:
in the formula, AlIs a self-attention weight; the softmax function is a normalized exponential function; q, K, V is derived from input XiLinear transformation is obtained; vlValue of the l-th layer, M represents a Mask matrix; dkThe column number of the Q, K matrix, namely the dimension of the vector, prevents Q, K inner product from being too large and plays a role in regulation; t is denoted as transpose; in the formula (I), the compound is shown in the specification,Hl-1∈is a linear projection of the previous layer to Queries, Values and Keys and the parameters of the projection are respectively The Mask matrix M controls whether the Token is allowed to be added or not, different Mask matrixes M are used for controlling to pay attention to different contexts, and a copy mechanism is introduced to solve the problems of unregistered words and repeated words brought in the generation process.
10. The two-stage hybrid automatic summarization method for judicial official documents according to any of claims 7-9, wherein generating a text summary further comprises: when the decoding time is t, according to the last layer H of the TransformertAnd output O of DecoderjThe calculation method of the correlation weight is thatWherein WcTo initialize the matrix, the formula for the attention distribution of the j-th word is calculated at the same time as:
in the formula, N is the number of words in the sentence; exp is an exponential function with a natural constant e as the base; u is a hyperparameter; t is time; k represents an input sequence identifier, and the value range is [1, N ]; j is denoted as the ith word;
the attention distribution can be interpreted as the attention degree of the ith word in context query, and the information weighted average is carried out on the attention distribution by the following formula to obtain a context expression vector h't:
Wherein h'tAlso called context vectors, which represent the acquisition of information of interest according to the distribution of attention;the output of the last layer of the Transformer at the time t;attention distribution for the jth word; i is the value of the sequence identifier; n is the number of words in the sentence;
the context vector is compared with the output O of DecoderjConnected and generate a vocabulary distribution through two linear layersThe calculation formula is as follows:
wherein V ', V, b ' are learnable parameters h 'tFor context representation of vectors, OjIs DThe output of the ecoder is then output,is the probability distribution of all words in the vocabulary;
reintroducing copy gating function gt∈[0,1]To decide whether the current output chooses to copy from the source document or generate new words from the vocabulary; gtThe calculation formula of (2) is as follows:
in the formula, Wg、bgAre all learnable parameters;the output of the last layer of the Transformer at the time t; o isjIs the output of Decoder; the formula shows that at the time t, whether the next word is a new word or is directly copied is determined according to the attention weight of the j word and other words;
for each document, combining the words in the vocabulary with all the appearing words of the source document to form a new word stock, namely an extended word stock; therefore, whether the copy is selected from the source document or generated from the vocabulary, the generation is performed from the extended thesaurus, so the final probability calculation formula is:
in the formula (I), the compound is shown in the specification,representing the probability that the current word w is generated in a given vocabulary;representing a probability of selecting a copy from the source document based on the attention distribution;
if w is an out-of-vocabulary word, thenIf w does not appear in the source document, thenThe final loss function for expanding the probability distribution of the word bank is as follows:
wherein T is the total time; pt(w) generating probabilities for the lexical distribution and the attention distribution calculations;
calculating and generating probability P according to vocabulary distribution and attention distributiontAnd (w), finally, automatically generating the text abstract according to the generation probability and the vocabulary distribution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111494073.7A CN114169312A (en) | 2021-12-08 | 2021-12-08 | Two-stage hybrid automatic summarization method for judicial official documents |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111494073.7A CN114169312A (en) | 2021-12-08 | 2021-12-08 | Two-stage hybrid automatic summarization method for judicial official documents |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114169312A true CN114169312A (en) | 2022-03-11 |
Family
ID=80484516
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111494073.7A Pending CN114169312A (en) | 2021-12-08 | 2021-12-08 | Two-stage hybrid automatic summarization method for judicial official documents |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114169312A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114691858A (en) * | 2022-03-15 | 2022-07-01 | 电子科技大学 | Improved UNILM abstract generation method |
CN114996442A (en) * | 2022-05-27 | 2022-09-02 | 北京中科智加科技有限公司 | Text abstract generation system combining abstract degree judgment and abstract optimization |
CN115809329A (en) * | 2023-01-30 | 2023-03-17 | 医智生命科技(天津)有限公司 | Method for generating abstract of long text |
CN115982343A (en) * | 2023-03-13 | 2023-04-18 | 阿里巴巴达摩院(杭州)科技有限公司 | Abstract generation method, method and device for training abstract generation model |
CN117151069A (en) * | 2023-10-31 | 2023-12-01 | 中国电子科技集团公司第十五研究所 | Security scheme generation system |
CN117875268A (en) * | 2024-03-13 | 2024-04-12 | 山东科技大学 | Extraction type text abstract generation method based on clause coding |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111061862A (en) * | 2019-12-16 | 2020-04-24 | 湖南大学 | Method for generating abstract based on attention mechanism |
CN111897949A (en) * | 2020-07-28 | 2020-11-06 | 北京工业大学 | Guided text abstract generation method based on Transformer |
-
2021
- 2021-12-08 CN CN202111494073.7A patent/CN114169312A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111061862A (en) * | 2019-12-16 | 2020-04-24 | 湖南大学 | Method for generating abstract based on attention mechanism |
CN111897949A (en) * | 2020-07-28 | 2020-11-06 | 北京工业大学 | Guided text abstract generation method based on Transformer |
Non-Patent Citations (3)
Title |
---|
刘国靖: "基于深度学习的文本自动摘要技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, pages 138 - 2817 * |
王义真等: "民事裁判文书两阶段式自动摘要研究", 《数据分析与知识发现》, pages 104 - 114 * |
苏剑林: "层次分解位置编码,让BERT可以处理超长文本", pages 1 - 4, Retrieved from the Internet <URL:https://kexue.fm/archives/7947> * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114691858A (en) * | 2022-03-15 | 2022-07-01 | 电子科技大学 | Improved UNILM abstract generation method |
CN114691858B (en) * | 2022-03-15 | 2023-10-03 | 电子科技大学 | Improved UNILM digest generation method |
CN114996442A (en) * | 2022-05-27 | 2022-09-02 | 北京中科智加科技有限公司 | Text abstract generation system combining abstract degree judgment and abstract optimization |
CN114996442B (en) * | 2022-05-27 | 2023-07-11 | 北京中科智加科技有限公司 | Text abstract generation system combining abstract degree discrimination and abstract optimization |
CN115809329A (en) * | 2023-01-30 | 2023-03-17 | 医智生命科技(天津)有限公司 | Method for generating abstract of long text |
CN115982343A (en) * | 2023-03-13 | 2023-04-18 | 阿里巴巴达摩院(杭州)科技有限公司 | Abstract generation method, method and device for training abstract generation model |
CN115982343B (en) * | 2023-03-13 | 2023-08-22 | 阿里巴巴达摩院(杭州)科技有限公司 | Abstract generation method, and method and device for training abstract generation model |
CN117151069A (en) * | 2023-10-31 | 2023-12-01 | 中国电子科技集团公司第十五研究所 | Security scheme generation system |
CN117151069B (en) * | 2023-10-31 | 2024-01-02 | 中国电子科技集团公司第十五研究所 | Security scheme generation system |
CN117875268A (en) * | 2024-03-13 | 2024-04-12 | 山东科技大学 | Extraction type text abstract generation method based on clause coding |
CN117875268B (en) * | 2024-03-13 | 2024-05-31 | 山东科技大学 | Extraction type text abstract generation method based on clause coding |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110298037B (en) | Convolutional neural network matching text recognition method based on enhanced attention mechanism | |
CN110083831B (en) | Chinese named entity identification method based on BERT-BiGRU-CRF | |
CN109657239B (en) | Chinese named entity recognition method based on attention mechanism and language model learning | |
CN114169312A (en) | Two-stage hybrid automatic summarization method for judicial official documents | |
CN110008469B (en) | Multilevel named entity recognition method | |
CN109871538A (en) | A kind of Chinese electronic health record name entity recognition method | |
CN110866399B (en) | Chinese short text entity recognition and disambiguation method based on enhanced character vector | |
CN111563375B (en) | Text generation method and device | |
CN117076653B (en) | Knowledge base question-answering method based on thinking chain and visual lifting context learning | |
CN112836046A (en) | Four-risk one-gold-field policy and regulation text entity identification method | |
CN111209749A (en) | Method for applying deep learning to Chinese word segmentation | |
CN116151256A (en) | Small sample named entity recognition method based on multitasking and prompt learning | |
CN112183094A (en) | Chinese grammar debugging method and system based on multivariate text features | |
CN110276396B (en) | Image description generation method based on object saliency and cross-modal fusion features | |
CN110222338B (en) | Organization name entity identification method | |
CN113190656A (en) | Chinese named entity extraction method based on multi-label framework and fusion features | |
CN115310448A (en) | Chinese named entity recognition method based on combining bert and word vector | |
CN115438674B (en) | Entity data processing method, entity linking method, entity data processing device, entity linking device and computer equipment | |
CN112966117A (en) | Entity linking method | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
CN115600597A (en) | Named entity identification method, device and system based on attention mechanism and intra-word semantic fusion and storage medium | |
CN114757184B (en) | Method and system for realizing knowledge question and answer in aviation field | |
CN113065349A (en) | Named entity recognition method based on conditional random field | |
CN111145914A (en) | Method and device for determining lung cancer clinical disease library text entity | |
Gu et al. | Named entity recognition in judicial field based on BERT-BiLSTM-CRF model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |