CN113868414A - Interpretable legal dispute focus summarizing method and system - Google Patents

Interpretable legal dispute focus summarizing method and system Download PDF

Info

Publication number
CN113868414A
CN113868414A CN202110982983.3A CN202110982983A CN113868414A CN 113868414 A CN113868414 A CN 113868414A CN 202110982983 A CN202110982983 A CN 202110982983A CN 113868414 A CN113868414 A CN 113868414A
Authority
CN
China
Prior art keywords
vector
bert
processing
module
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110982983.3A
Other languages
Chinese (zh)
Inventor
邓蔚
刘永聪
赵晨曦
刘新星
曹雅筠
高垒
查金豆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Weichuang Technology Co ltd
Original Assignee
Chengdu Weichuang Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Weichuang Technology Co ltd filed Critical Chengdu Weichuang Technology Co ltd
Priority to CN202110982983.3A priority Critical patent/CN113868414A/en
Publication of CN113868414A publication Critical patent/CN113868414A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services; Handling legal documents

Abstract

The invention relates to an interpretable legal dispute focus summarizing method and system, which comprises the following steps: setting a slice sequence with a certain length, and performing text word segmentation and character conversion processing on an original text through the slice sequence; coding the processed text information by using a BERT prediction model; carrying out weight distribution on the vector subjected to coding processing by adopting an attention mechanism to obtain a comprehensive vector; inputting the vector into a linear layer to obtain an output vector, and performing Sigmoid processing on the output vector to obtain a Probasic vector; and predicting the dispute focus of each category, and outputting the probability that each category is positive to obtain a focus induction result. According to the method, weights are given to token vectors coded by BERT through an attention mechanism, and each token represents the importance degree of a prediction result, so that a certain interpretability is provided while the conclusion effect of a dispute focus is ensured.

Description

Interpretable legal dispute focus summarizing method and system
Technical Field
The invention relates to the technical field of natural language processing, in particular to an interpretable legal dispute focus summarizing method and an interpretable legal dispute focus summarizing system.
Background
Legal intelligence has received increasing attention in recent years to improve jurisdictional efficiency and intelligent assistance in case of case. At present, the technology based on traditional machine learning and deep learning and the like are mainly used for carrying out research and application on legal intelligence. Models such as decision trees, random forests and the like are constructed based on a traditional machine learning method, knowledge in the judicial field is obtained through information extraction, legal intelligent tasks are solved, and interpretability is provided to a certain extent. A representation learning method is adopted based on a deep learning technology, and legal knowledge is embedded into a vector for modeling and prediction based on pre-training language model embedding. In legal intelligent applications, interpretability is very important, and not only is the prediction result of a model required to be correct, but also a certain degree of interpretability needs to be provided for people.
At present, research and application in the aspect of legal intelligence are mainly focused on fields such as legal knowledge maps, legal decision prediction, legal named entity identification, legal event extraction, legal information retrieval, legal question answering, dispute focus identification and class case matching, and related technical disclosures and reports are not provided in the aspects of legal reasoning, interpretable dispute focus identification and the like.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides an interpretable legal dispute focus summarizing method and an interpretable legal dispute focus summarizing system, and solves the problem that the prior art cannot accurately summarize the legal dispute focus.
The invention is realized by the following technical scheme: an interpretable legal dispute focus summarization method, the method comprising:
s1, setting a slice sequence with a certain length, and performing text word segmentation and character conversion processing on the original text through the slice sequence;
s2, coding the processed text information by using a BERT prediction model;
s3, carrying out weight distribution on the vector subjected to the encoding processing by adopting an attention mechanism to obtain a comprehensive vector;
s4, inputting the vector into a full-connection Sigmoid layer, obtaining an output vector with dimension of 1 x n through a linear layer, and carrying out Sigmoid processing on the output vector to obtain a Probasic vector;
and S5, predicting the dispute focus of each category, and outputting the probability that each category is positive to obtain a focus induction result.
The setting of the slice sequence with a certain length, and the text word and character conversion processing of the original text through the slice sequence comprises the following steps:
s11, dividing the original text into a character by taking token as a unit and storing the character into a list, adding a special character [ CLS ] at the beginning of the list, and replacing the character which is not contained in the dictionary with a character [ UNK ];
s12, setting the maximum length of the slicing sequence to be n, directly slicing the list with the length larger than n, and continuously adding characters [ PAD ] at the tail of the list with the length smaller than n until the length of the list is n;
and S13, converting each word in the sliced list into a sequence number corresponding to the word in the dictionary.
The encoding process of the processed text information by using the BERT prediction model comprises the following steps: and performing feature extraction on the context information through a bidirectional Transformer, transmitting the processed data into a BERT prediction model to encode the data, and outputting a vector T encoding token context information correspondingly by each token to realize vectorization representation of token meaning.
The performing weight distribution on the vector subjected to encoding processing by adopting an attention mechanism to obtain a comprehensive vector comprises the following steps:
s31, carrying out nonlinear activation on the BERT coded output vector T except for the special character [ CLS ] at the beginning to obtain an activation matrix T ^ a';
s32, multiplying the learnable matrix W initialized at random with the activation matrix T ^ to obtain a vector with the length of N-1, performing Softmax processing on the vector to obtain a weight vector A with the sum of 1, and calculating the inner product of the weight vector A and each row of the activation matrix T ^ to obtain a vector C ^ integrating text content.
The induction method further comprises the step of constructing a BERT prediction model before processing the original text; the step of constructing the BERT prediction model comprises the following steps:
constructing a BERT prediction model consisting of a BERT coding layer, an attention layer and a fully connected Sigmoid layer;
setting a network parameter L of the model to be 12, representing the number of transform layers, H to 768, representing the internal dimension of the transform, A to be 12 and representing the number of Heads;
and pre-training the network parameters of the BERT model by using all civil legal documents in the Chinese referee document network.
An interpretable legal dispute focus induction system comprises a prediction model construction module, an original text processing module, a text information coding module, a weight vector generation module, a full-connection Sigmoid layer processing module and a prediction module;
the prediction model construction module is used for constructing a BERT prediction model consisting of a BERT coding layer, an attention layer and a full-connection Sigmoid layer;
the original text processing module is used for processing an original text and performing text word segmentation and character conversion;
the text information coding module is used for coding the information subjected to the original text processing module through a BERT coding layer of a BERT prediction model;
the weight vector generation module is used for distributing weights to output vectors of the BERT coding module by using an attention mechanism;
the fully-connected Sigmoid layer processing module is used for transmitting the vector into a Linear layer to obtain an output vector, and performing Sigmoid processing on the output vector to obtain a Probasic vector;
the prediction module is used for predicting the dispute focus of each category and outputting the probability that each category is positive to obtain a focus induction result.
The invention has the following advantages: an interpretable law dispute focus summarizing method and system, wherein weights are given to token vectors coded by BERT through an attention mechanism to observe the importance of each token to a prediction result, and certain interpretability is provided while the effect is ensured. The BERT model with attention mechanism used by the invention is not only superior to other reference models in the effect of law dispute focus summarization, but also can realize interpretability of the model according to the weight value assigned to each word in the attention mechanism. Inputting the sample into the trained BERT model with the attention mechanism can not only obtain the final dispute focus prediction result, but also allocate the weight to each character according to the attention layer in the model. And then, the weight corresponding to each character is highlighted by taking the corresponding character of the first 15% from high to low, so that the highlighted content in the example dispute focus summary is observed according to the marking result, and the interpretable purpose is achieved.
Drawings
FIG. 1 is a flow chart of model construction for an interpretable legal dispute focus generalization method of the present invention based on a BERT model with attention mechanism;
FIG. 2 is a schematic diagram of text encoding of the BERT pre-training model of the present invention;
FIG. 3 is a schematic diagram of the BERT model with attention mechanism of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the detailed description of the embodiments of the present application provided below in connection with the appended drawings is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application. The invention is further described below with reference to the accompanying drawings.
The present invention provides weights to be interpretable in conjunction with a pre-trained language model BERT and in conjunction with an attention mechanism. The BERT model with the attention mechanism comprises a BERT coding layer, an attention layer, a full-connection Sigmoid layer and the like. Of these, attention is paid to observe the importance of each token to the prediction result by giving a weight to the token vector encoded by BERT. The BERT model with attention mechanism can provide certain interpretability while ensuring the effect.
As shown in fig. 1, one embodiment of the present invention relates to an interpretable legal dispute focus summarization method based on BERT model with attention mechanism, which specifically includes the following:
s1, original text processing: setting the maximum length of the sequence, and performing text word and character division conversion;
further, the specific implementation method comprises the following steps:
and S11, segmenting the text into words by taking the word (token) as a unit and storing the words in a list. Then, a [ CLS ] special character is added to the head of the list, and [ UNK ] is used to replace a word not included in the dictionary. Setting the maximum length of the sequence to be 512, directly slicing the list with the length of more than 512, and continuously adding [ PAD ] at the tail of the list with the length of less than 512 until the length of the list is 512;
s12, converting the characters into id. The words (tokens) in the natural language are converted to ids of numeric type for use by the model, where the ids refer to the order numbers of tokens in the dictionary.
S2, text information encoding: encoding the text by using a BERT pre-training model;
as shown in fig. 2, the BERT pre-training model uses a bidirectional Transformer to perform feature extraction on context information:
the Transformer has strong characteristic extraction capability. Transmitting the processed data into a BERT network to encode the data, wherein each token has a corresponding output vector T, and context information of the token is encoded in the T and is vectorized representation of the token meaning;
parameters of this part of the BERT model select parameters of the BERT base model: l ═ 12, representing the number of transform layers; 768, representing the Transformer internal dimension; a is 12, representing the number of Heads;
the pretraining parameters of the BERT model are network parameters of the BERT model pretrained by 2654 million civil law documents in the Chinese referee document network in OpenClap. The training data of the pre-training model is data of the corresponding field, the pre-training language model can obtain better word vector representation, and downstream tasks can also have better effect.
S3, distributing weight to the output vector of the BERT coding module by adopting an attention mechanism;
as shown in fig. 3, the attention mechanism is adopted to assign weights to the output vectors of the BERT coding module, and the weights are integrated to obtain a comprehensive vector C ^ according to:
the attention mechanism distributes weight to the information of the Encoder when predicting words each time, then weights the information of the Encoder according to the normalized weight to obtain a comprehensive vector value C, and finally outputs a prediction result Y through the Decode part;
s31, carrying out nonlinear activation on the BERT coding output vector T except for the initial [ CLS ] special character, wherein the activation function selects a tanh activation function, and the activated matrix is T ^ a';
s32, the matrix T ^' is transmitted to the attention module: multiplying the learnable matrix W initialized at random by T ^' to obtain a vector with the length of N-1, and then performing Softmax processing on the vector to obtain a weight vector A with the sum of 1. The weighted value of attention distribution is obtained through the method, and then the inner product is calculated by each line of A and T ^ to obtain a vector C ^ which integrates the text content.
And S4, inputting the vector into the fully-connected Sigmoid layer, obtaining an output vector with the dimension of 1 × n through the linear layer, and carrying out Sigmoid processing on the output vector to obtain a Proavailability vector.
Wherein the Sigmiod function is
Figure BDA0003229813780000041
Wherein δ (z)j) Indicating the application of Sigmoid function to the number zj,zjRepresenting a single raw output value, Sigmoid processing is to better judge whether each dimension is 0 or 1, and if more than 0.5 in each dimension is regarded as 1, the rest is regarded as 0.
S41, because the dimension information of the C ^ vector is the same as the BERT coded output vector corresponding to each token, in order to predict the dispute focus of 4 categories, an output vector with the dimension size of 1 × 4 needs to be obtained through the full connection layer. The vector C ^' in the layer is firstly transmitted into 1 Linear layer to obtain an output vector with the dimension of 1 multiplied by 4;
and S42, performing Sigmoid processing on the output vector to obtain a Proavailability vector.
S5, predicting the dispute focus of each category: and outputting the probability that each category is positive to obtain a focus induction result.
Another embodiment of the invention relates to an interpretable legal dispute focus summarization system based on a BERT model with attention mechanism, comprising:
a prediction model construction module: the method comprises the steps of constructing a BERT prediction model consisting of a BERT coding layer, an attention layer and a fully connected Sigmoid layer;
an original text processing module: the system is used for processing the original text and performing text word and character conversion;
the text information coding module: the BERT coding layer is used for coding the information subjected to the original text processing module through the BERT prediction model;
a weight vector generation module: for assigning weights to output vectors of the BERT coding module using an attention mechanism;
the fully-connected Sigmoid layer processing module: the device is used for transmitting the vector into a Linear layer to obtain an output vector, and performing Sigmoid processing on the output vector to obtain a Probasic vector;
a prediction module: the method is used for predicting the dispute focus of each category and outputting the probability that each category is positive to obtain a focus induction result.
The invention gives weight to token vector of BERT coding through attention mechanism to observe the importance of each token to the prediction result, and provides certain interpretability while ensuring effect. The BERT model with attention mechanism used by the invention is not only superior to other reference models in the effect of law dispute focus summarization, but also can realize interpretability of the model according to the weight value assigned to each word in the attention mechanism. Inputting the sample into the trained BERT model with the attention mechanism can not only obtain the final dispute focus prediction result, but also allocate the weight to each character according to the attention layer in the model. And then, the weight corresponding to each character is highlighted by taking the corresponding character of the first 15% from high to low, so that the highlighted content in the example dispute focus summary is observed according to the marking result, and the interpretable purpose is achieved.
The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (6)

1. An interpretable legal dispute focus summarization method, comprising: the induction method comprises the following steps:
s1, setting a slice sequence with a certain length, and performing text word segmentation and character conversion processing on the original text through the slice sequence;
s2, coding the processed text information by using a BERT prediction model;
s3, carrying out weight distribution on the vector subjected to the encoding processing by adopting an attention mechanism to obtain a comprehensive vector;
s4, inputting the vector into a full-connection Sigmoid layer, obtaining an output vector with dimension of 1 x n through a linear layer, and carrying out Sigmoid processing on the output vector to obtain a Probasic vector;
and S5, predicting the dispute focus of each category, and outputting the probability that each category is positive to obtain a focus induction result.
2. An interpretable legal dispute focus generalizing method according to claim 1, wherein: the setting of the slice sequence with a certain length, and the text word and character conversion processing of the original text through the slice sequence comprises the following steps:
s11, dividing the original text into a character by taking token as a unit and storing the character into a list, adding a special character [ CLS ] at the beginning of the list, and replacing the character which is not contained in the dictionary with a character [ UNK ];
s12, setting the maximum length of the slicing sequence to be n, directly slicing the list with the length larger than n, and continuously adding characters [ PAD ] at the tail of the list with the length smaller than n until the length of the list is n;
and S13, converting each word in the sliced list into a sequence number corresponding to the word in the dictionary.
3. An interpretable legal dispute focus generalizing method according to claim 1, wherein: the encoding process of the processed text information by using the BERT prediction model comprises the following steps: and performing feature extraction on the context information through a bidirectional Transformer, transmitting the processed data into a BERT prediction model to encode the data, and outputting a vector T encoding token context information correspondingly by each token to realize vectorization representation of token meaning.
4. An interpretable legal dispute focus generalizing method according to claim 1, wherein: the performing weight distribution on the vector subjected to encoding processing by adopting an attention mechanism to obtain a comprehensive vector comprises the following steps:
s31, carrying out nonlinear activation on the BERT coded output vector T except for the special character [ CLS ] at the beginning to obtain an activation matrix T ^ a';
s32, multiplying the learnable matrix W initialized at random with the activation matrix T ^ to obtain a vector with the length of N-1, performing Softmax processing on the vector to obtain a weight vector A with the sum of 1, and calculating the inner product of the weight vector A and each row of the activation matrix T ^ to obtain a vector C ^ integrating text content.
5. An interpretable legal dispute focus generalizing method according to claim 1, wherein: the induction method further comprises the step of constructing a BERT prediction model before processing the original text; the step of constructing the BERT prediction model comprises the following steps:
constructing a BERT prediction model consisting of a BERT coding layer, an attention layer and a fully connected Sigmoid layer;
setting a network parameter L of the model to be 12, representing the number of transform layers, H to 768, representing the internal dimension of the transform, A to be 12 and representing the number of Heads;
and pre-training the network parameters of the BERT model by using all civil legal documents in the Chinese referee document network.
6. An interpretable legal dispute focus summarization system, comprising: the system comprises a prediction model construction module, an original text processing module, a text information coding module, a weight vector generation module, a full-connection Sigmoid layer processing module and a prediction module;
the prediction model construction module is used for constructing a BERT prediction model consisting of a BERT coding layer, an attention layer and a full-connection Sigmoid layer;
the original text processing module is used for processing an original text and performing text word segmentation and character conversion;
the text information coding module is used for coding the information subjected to the original text processing module through a BERT coding layer of a BERT prediction model;
the weight vector generation module is used for distributing weights to output vectors of the BERT coding module by using an attention mechanism;
the fully-connected Sigmoid layer processing module is used for transmitting the vector into a Linear layer to obtain an output vector, and performing Sigmoid processing on the output vector to obtain a Probasic vector;
the prediction module is used for predicting the dispute focus of each category and outputting the probability that each category is positive to obtain a focus induction result.
CN202110982983.3A 2021-08-25 2021-08-25 Interpretable legal dispute focus summarizing method and system Pending CN113868414A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110982983.3A CN113868414A (en) 2021-08-25 2021-08-25 Interpretable legal dispute focus summarizing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110982983.3A CN113868414A (en) 2021-08-25 2021-08-25 Interpretable legal dispute focus summarizing method and system

Publications (1)

Publication Number Publication Date
CN113868414A true CN113868414A (en) 2021-12-31

Family

ID=78988408

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110982983.3A Pending CN113868414A (en) 2021-08-25 2021-08-25 Interpretable legal dispute focus summarizing method and system

Country Status (1)

Country Link
CN (1) CN113868414A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115457365A (en) * 2022-09-15 2022-12-09 北京百度网讯科技有限公司 Model interpretation method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115457365A (en) * 2022-09-15 2022-12-09 北京百度网讯科技有限公司 Model interpretation method and device, electronic equipment and storage medium
CN115457365B (en) * 2022-09-15 2024-01-05 北京百度网讯科技有限公司 Model interpretation method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111625641B (en) Dialog intention recognition method and system based on multi-dimensional semantic interaction representation model
CN110378334B (en) Natural scene text recognition method based on two-dimensional feature attention mechanism
CN111985239B (en) Entity identification method, entity identification device, electronic equipment and storage medium
CN111476023B (en) Method and device for identifying entity relationship
CN110134946B (en) Machine reading understanding method for complex data
CN109977416A (en) A kind of multi-level natural language anti-spam text method and system
CN111931517B (en) Text translation method, device, electronic equipment and storage medium
CN111858932A (en) Multiple-feature Chinese and English emotion classification method and system based on Transformer
CN112000801A (en) Government affair text classification and hot spot problem mining method and system based on machine learning
US20230169271A1 (en) System and methods for neural topic modeling using topic attention networks
CN111538809A (en) Voice service quality detection method, model training method and device
CN113868414A (en) Interpretable legal dispute focus summarizing method and system
CN113435208A (en) Student model training method and device and electronic equipment
CN112434512A (en) New word determining method and device in combination with context
CN112598039A (en) Method for acquiring positive sample in NLP classification field and related equipment
CN115687934A (en) Intention recognition method and device, computer equipment and storage medium
CN112818688B (en) Text processing method, device, equipment and storage medium
CN114707829A (en) Target person rescission risk prediction method based on structured data linear expansion
CN115169363A (en) Knowledge-fused incremental coding dialogue emotion recognition method
CN113704472A (en) Hate and offensive statement identification method and system based on topic memory network
CN113076424A (en) Data enhancement method and system for unbalanced text classified data
CN112395422A (en) Text information extraction method and device
CN114818644B (en) Text template generation method, device, equipment and storage medium
CN117113977B (en) Method, medium and system for identifying text generated by AI contained in test paper
Lee et al. A two-level recurrent neural network language model based on the continuous Bag-of-Words model for sentence classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination