CN116610770B

CN116610770B - Judicial field case pushing method based on big data

Info

Publication number: CN116610770B
Application number: CN202310464853.XA
Authority: CN
Inventors: 王进; 王一雄; 周羽; 李俊莲; 曾思盈; 周青
Original assignee: Huoyan Jinjing Data Services Xiongan Co ltd
Current assignee: Huoyan Jinjing Data Services Xiongan Co ltd; Yami Technology Guangzhou Co ltd
Priority date: 2023-04-26
Filing date: 2023-04-26
Publication date: 2024-02-27
Anticipated expiration: 2043-04-26
Also published as: CN116610770A

Abstract

The invention relates to the technical field of natural language processing, in particular to a judicial field case pushing method based on big data; uploading a judicial field document to a database for matching; inputting the judicial field document and the matching data thereof into a trained class similarity calculation model, and outputting the similarity between the judicial field document and each matching data; arranging all the similarities in descending order according to the size, and selecting matching data corresponding to the first k similarities for pushing; the method solves the problem that the text characteristics of the document text are similar in the pre-training model representation, and the data enhancement is carried out by a data disturbance method, so that the method overcomes the difficulties of high time and labor cost of constructing a supervision sample under the condition of pushing the document class in the judicial field, can efficiently, low-cost and automatically complete the pushing of the document class in the judicial field, and helps the practitioners in the judicial field to quickly acquire the information related to the case they are processing and the previous judging result.

Description

Judicial field case pushing method based on big data

Technical Field

The invention relates to the technical field of natural language processing, in particular to a judicial field case pushing method based on big data.

Background

In the judicial field, the requirement of class case pushing is derived from the fact that judicial personnel need to quickly and accurately find cases similar to the current case from a large number of cases so as to better know relevant conditions such as legal regulations, judgment standards and the like. Most of traditional case pushing methods are based on text similarity algorithms, and cases similar to the current case are found by matching text information of the cases; due to the specificity of case data in the judicial field, such as complex and various types of related cases, different judgment standards and the like, the case pushing method based on the text similarity algorithm is difficult to accurately reflect the similarity between cases. Therefore, the judicial field is more and more urgent for the demand of automatic and intelligent case pushing technology. With the rapid development of technology, the advent of big data technology has promoted the digital transformation of judicial trade, so that the judicial field can realize the automated analysis and excavation of a large amount of case data, thereby better serving judicial practices, improving the scientificity and accuracy of judicial decisions; the class case pushing technology based on big data has wide application prospect in the market.

In recent years, with the rapid development of pre-trained language models, text similarity algorithms have performed better. The BERT (Bidirectional Encoder Representations from Transformers) pre-training language model is excellent in tasks such as text similarity and the like; the method can be used for pre-training on large-scale non-supervision data so as to obtain rich semantic information, and can be used for fine adjustment on small-scale supervision data according to specific tasks. However, in practical applications, pre-trained language models such as BERT are found to be prone to semantic collapse (semantic collapse) when processing long text, i.e., mapping two long text with similar meaning but different expressions into the same vector representation. This results in poor similarity evaluation effect when performing similarity evaluation, and it is difficult to accurately reflect the similarity between texts.

Disclosure of Invention

The invention aims to provide a judicial field case pushing method based on big data, which considers the specificity of judicial field case data and solves the problems that the conventional text similarity algorithm can not really meet the demands of judicial personnel due to poor similarity evaluation effect when processing texts with different meanings and identical expression modes caused by neglecting the sequence and semantic information of words in the texts, so that the reliability and reliability verification of case pushing results are lacking.

The specific scheme provided by the invention comprises the following steps: uploading the judicial field document to a database for matching; inputting the judicial field document and the matching data thereof into a trained class similarity calculation model, and outputting the similarity between the judicial field document and each matching data; arranging all the similarities in descending order according to the size, and selecting matching data corresponding to the first k similarities for pushing;

the training process of the class similarity calculation model comprises the following steps:

s1, sampling in an acquired judicial field document data set D to obtain an original sample set with a batch_size of N;

s2, inputting the original sample set into a text embedding layer and a data disturbance layer to obtain an enhanced sample set; and the enhancement samples in the enhancement sample set are in one-to-one correspondence with the original samples in the original sample set;

s3, inputting an original sample set passing through the text embedding layer into a Bert pre-training model after carrying out ebadd to obtain text vector representations of N original samples, and inputting an enhanced sample set into the Bert pre-training model to obtain text vector representations of N enhanced samples;

s4, based on the data obtained in the step S3, calculating comparison learning loss and rewarding loss through a Simloss function and a Rewards loss function respectively, and back-propagating training parameters;

s5, repeating the steps S1-S4, and performing iterative training until the model converges.

Further, in step S1, the calculation formula of the size N of the batch_size is:

wherein floor () represents a downward rounding, K represents a video memory size, M represents an average video memory size of each piece of data in the judicial field document dataset D, and S represents a total number of data in the judicial field document dataset D.

Further, in step S2, any one of the original samples in the original sample set is input into the text embedding layer and the data perturbation layer to obtain a corresponding enhanced sample, which includes:

s21, converting the original sample into a token sequence according to a Bert model vocabulary;

s22, performing scrambling operation on the token sequence to obtain a new token sequence, wherein the scrambling operation comprises disordered sequence, dropout and random substitution;

s23, carrying out EMbedding on the new token sequence, and then carrying out inverse gradient attack to obtain an enhanced sample.

Further, in step S23, an inverse gradient attack is performed on the new token sequence after ebadd, where the inverse gradient attack is expressed as:

wherein x represents a new token sequence after ebedding, x _r Represents the enhanced sample, g represents the gradient, and e represents the degree of attack.

Further, step S3 inputs any enhanced sample or any original sample that has passed through the text embedding layer and is subjected to ebedding into the Bert pre-training model, and the process of obtaining the text vector representation corresponding to the enhanced sample or any original sample includes:

s31, inputting a sample into a Bert pre-training model, and obtaining the emmbedding expression output by each of the last 7 encoder layers in the Bert pre-training model;

s32, extracting CLS vectors in each ebadd expression, converting all the CLS vectors into one-dimensional vectors by using a linear layer, and normalizing to obtain 7 weights;

s33, multiplying each weight by the corresponding CLS vector to obtain a CLS weight vector, and adding all the CLS weight vectors to obtain a text vector representation of the sample.

Further, the Simloss function is expressed as:

wherein,text vector representation representing the i=1, 2, …, N original samples, +.>Text vector representation representing the i-th original sample corresponding to the enhanced sample, sim () represents a similarity calculation function, and dct () represents a distance calculation function.

The invention has the beneficial effects that:

the invention solves the problem of the trend of the text characteristics of the document text in the pre-training model representation by adopting contrast learning and rewarding learning, and enhances the data by a data disturbance method, thereby overcoming the difficulties of long time and high labor cost of constructing a supervision sample under the condition of pushing the document class in the judicial field, efficiently and automatically completing the pushing of the document class in the judicial field with low cost, and helping the staff in the judicial field to quickly acquire the information related to the case which the staff is processing and the prior judging result.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a training flow chart of the similarity calculation model of the present invention;

fig. 3 is a schematic structural diagram of a similarity calculation model according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention provides a judicial field case pushing method based on big data, as shown in fig. 1, comprising the following steps: uploading the judicial field document to a database to be matched with other documents; inputting the judicial field document and the matching data thereof into a trained class similarity calculation model, and outputting the similarity between the judicial field document and each matching data; and arranging all the similarities in descending order according to the sizes, selecting matching data corresponding to the first k similarities, and pushing the matching data in sequence from the big to the small according to the sizes of the similarities.

The training process of the class similarity calculation model, as shown in fig. 2, includes:

s1, sampling in the acquired judicial field document data set D to obtain an original sample set with the size of batch_size being N.

Specifically, in step S1, the calculation formula of the size N of the batch_size is:

S2, inputting the original sample set into a text embedding layer and a data disturbance layer to obtain an enhanced sample set; and the enhanced samples in the enhanced sample set are in one-to-one correspondence with the original samples in the original sample set.

Specifically, in step S2, any one of the original samples in the original sample set is input into the text embedding layer and the data perturbation layer, and a corresponding enhanced sample is obtained, which includes:

s21, converting the original sample according to a Bert model vocabulary to obtain a token sequence;

s22, under the premise of keeping an original token sequence, performing scrambling treatment on the token sequence to obtain a new token sequence, wherein the scrambling treatment comprises three operations of disorder, dropout and random replacement, and any one to three of the three operations are selected;

Specifically, in step S23, an inverse gradient attack is performed on the new token sequence after ebadd, where the inverse gradient attack is expressed as:

wherein x represents a new token sequence after ebedding, x _r Represents the enhanced sample, g represents the gradient, and e represents the degree of attack. The inverse gradient attack means that the original sample is attacked according to the direction of the inverse gradient but similar similarity, so that an enhanced sample is obtained; the inverse gradient attack provided by the invention is different from the traditional countermeasure attack, and the inverse gradient attack attacks according to the direction with similar similarity, can generate enhanced data with similar similarity but performing inverse gradient interference, and the data can be taken as a positive example in the step of calculating contrast learning loss later.

S3, inputting the original sample set passing through the text embedding layer into a Bert pre-training model after carrying out ebadd to obtain text vector representations of N original samples, and inputting the enhanced sample set into the Bert pre-training model to obtain text vector representations of N enhanced samples.

Specifically, step S3 is a process of inputting any enhanced sample or any original sample that has passed through the text embedding layer and undergone ebedding into a Bert pre-training model to obtain a text vector representation corresponding to the enhanced sample or any original sample, as shown in fig. 3, and includes:

s32, extracting the CLS vector in each embellishing expression, converting the CLS vector in each embellishing expression into a one-dimensional vector by using a linear layer, and respectively normalizing to obtain 7 weights;

S4, based on the data obtained in the step S3, calculating comparison learning loss and rewarding loss through a Simloss function and a Rewards loss function respectively, and back-propagating training parameters.

Specifically, the process of calculating the bonus loss in step S4 includes:

s41, original sample x _i Text vector representation of (c)Corresponding to which the enhancement sample x is _ri Text vector representation +.>Splicing, wherein a spliced result is linearly mapped to the input dimension of the Bertencoder through a reward learning layer to obtain an original sample x _i A corresponding first bonus vector;

s42, removing the original sample x _i Itself and its corresponding enhanced sample x _ri Original sample x _i Text vector representation of (c)With any of the remaining original samples x _j Splicing text vector representations of (2), or enhancing sample x with any other _rj Is a concatenation of text vector representations; the spliced result is linearly mapped to the input dimension of the Bertencoder through a reward learning layer to obtain an original sample x _i A corresponding second prize vector;

s43, original sample x _i The corresponding first rewarding vector and the second rewarding vector are mapped into 1 dimension through a linear layer respectively to obtain an original sample x _i Corresponding first and second bonus points.

S44, passing through original sample x _i Calculating the reward loss at the moment according to the corresponding first reward points and the second reward points, and back-propagating training parameters;

s45, repeating the steps S41-S44 until all original samples in the original sample set acquired in the current round are calculated or model parameters are converged.

Specifically, contrast learning is mainly to make a comparison between each original sample and its enhanced sample, and the original samples, so as to minimize the loss value of the Simloss function. Specifically, from cosine and distance directions, the similarity between the original sample and the enhanced sample is maximized, and the similarity between the original samples is minimized, so that the problem of semantic collapse of the Bert representation vector is solved. Wherein the Simloss function is expressed as:

Specifically, the calculation formula of the bonus loss is expressed as:

s in ₁ Representing a first bonus point, S ₂ Representing a second prize score.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The judicial field case pushing method based on big data is characterized by comprising the following steps: uploading the judicial field document to a database for matching; inputting the judicial field document and the matching data thereof into a trained class similarity calculation model, and outputting the similarity between the judicial field document and each matching data; arranging all the similarities in descending order according to the size, and selecting matching data corresponding to the first k similarities for pushing;

2. The judicial field case pushing method based on big data according to claim 1, wherein the calculation formula of the size N of batch_size in step S1 is:

3. The judicial field case pushing method based on big data according to claim 1, wherein in step S2, any one of the original samples in the original sample set is input into the text embedding layer and the data perturbation layer to obtain an enhanced sample corresponding to the original sample, and the method comprises the steps of:

4. The judicial field case pushing method based on big data according to claim 3, wherein in step S23, an inverse gradient attack is performed on the new token sequence after ebedding, where the inverse gradient attack is expressed as:

5. The judicial field case pushing method based on big data according to claim 1, wherein the step S3 is to input any enhanced sample or any original sample that has passed through a text embedding layer and is subjected to ebedding into a Bert pre-training model, and the process of obtaining the corresponding text vector representation includes:

6. The judicial field case pushing method based on big data according to claim 1, wherein Simloss function is expressed as:

7. The judicial field case pushing method based on big data according to claim 1, wherein the step S4 of calculating the prize loss includes:

s43, original sample x _i The corresponding first rewarding vector and the second rewarding vector are mapped into 1 dimension through a linear layer respectively to obtain an original sample x _i Corresponding first and second bonus points;

8. The big data-based judicial art case pushing method according to claim 7, wherein the calculation formula of the reward loss is expressed as: