CN113378563B

CN113378563B - Case feature extraction method and device based on genetic variation and semi-supervision

Info

Publication number: CN113378563B
Application number: CN202110163512.XA
Authority: CN
Inventors: 孙晓锐; 艾中良; 贾高峰; 刘贤艳; 杨哲
Original assignee: China Judicial Big Data Research Institute Co ltd
Current assignee: China Judicial Big Data Research Institute Co ltd
Priority date: 2021-02-05
Filing date: 2021-02-05
Publication date: 2022-05-17
Anticipated expiration: 2041-02-05
Also published as: CN113378563A

Abstract

The invention aims to solve the problem of case feature extraction under the condition of less labeled data, and provides a case feature extraction method and device based on genetic variation, semi-supervision and reinforcement learning. The method utilizes a genetic variation algorithm to bond semi-supervised learning and reinforcement learning together, overcomes the problem that the traditional semi-supervised learning is easy to be over-fitted through continuous trial and error learning under the condition of no artificial guidance, and realizes the accurate extraction of case characteristics under the condition of less labeled data. The method specifically comprises the steps of constructing a sample data set, utilizing a word segmentation tool and a word vector generation model to obtain an input word vector required by model training, injecting the word vector into a case characteristic extraction model based on genetic variation, semi-supervision and reinforcement learning, training and obtaining the case characteristic extraction model. And judging the case characteristics of the current case by inputting the on-handling documents.

Description

Case feature extraction method and device based on genetic variation and semi-supervision

Technical Field

The invention relates to a case feature extraction method, in particular to a case feature extraction method and device based on genetic variation, semi-supervision and reinforcement learning, and belongs to the field of natural language processing and deep learning.

Background

The case is characterized by being the basis of case examination by a judge, and the judge often combs the case context according to different case reasons and cases in the case examination process, thereby occupying a great deal of time and energy in the case examination process of the judge. Therefore, the method helps the officer to quickly comb the main case characteristics of the case, can greatly reduce the examination and treatment time of the officer, improves case handling efficiency, and is favorable for relieving the contradiction of a plurality of cases.

The earlier case feature extraction mainly depends on manually made rules, namely when the text content meets certain conditions, the corresponding case features are considered to exist. Although the staged achievement is achieved, the case feature extraction rules must be established in a manual mode, a large number of legal experts and knowledge engineers are involved, and the correctness and consistency of the combed knowledge and the rules are difficult to guarantee. With the development of machine learning technology, some case feature extraction methods based on supervised classification and unsupervised clustering appear. As two major branches of the classification method, supervised classification and unsupervised clustering have respective advantages and disadvantages. For supervised classification, a large number of samples with class labels exist as supervised learning information, and the prediction accuracy of the trained classification model on unlabelled samples is high. However, in the judicial field, the data with the category labels is not abundant, and a great deal of manpower and material resources are consumed for manually calibrating the data without the category labels. For unsupervised clustering, it does not need to know the class label of the data, however, due to the lack of guidance of a priori information, the performance thereof is still to be further improved. In this case, semi-supervised learning based on a small amount of labeled data and a large amount of unlabeled data arises. Therefore, in the invention, the semi-supervised learning and the reinforcement learning are bonded together by using the genetic variation algorithm, and the problem that the traditional semi-supervised learning is easy to over-fit is solved through continuous trial and error learning under the condition of no artificial guidance, so that the case characteristics are accurately extracted under the condition of less labeled data.

Disclosure of Invention

The invention relates to a case characteristic extraction method based on genetic variation, semi-supervision and reinforcement learning, which comprises the steps of constructing a sample data set, acquiring an input word vector required by model training by utilizing a word segmentation tool and a word vector generation model, injecting the word vector into a case characteristic extraction model based on the genetic variation, semi-supervision and reinforcement learning, training and acquiring a case characteristic extraction model. And judging whether the current case has a certain case characteristic or not by inputting the on-hand document. The method can comb the case situation characteristics of the current case from the aspect of semantic understanding, and can improve the case handling quality and efficiency of the judges.

A case feature extraction model training method based on genetic variation, semi-supervision and reinforcement learning comprises the following steps:

acquiring an initial sample set: the initial sample set includes a number of annotated samples and a number of unlabeled samples.

Step (2) acquiring a data set after word segmentation: and performing word segmentation and removal of stop words on the labeled samples and the unlabeled samples in the initial sample set to obtain a sample data set after word segmentation.

Step (3) enhancing the sample data: and performing text data enhancement processing on the labeled samples and the unlabeled samples in the segmented sample data set to obtain a corresponding data enhancement sample set.

Generating an input word vector of the model in the step (4): and respectively calculating word vector sets of labeled samples and unlabeled samples in the generated data enhanced sample set according to the word vector generation model.

And (5): and (3) constructing and training a case characteristic extraction model based on genetic variation, semi-supervision and reinforcement learning by utilizing the word vector set.

A case feature extraction method based on genetic variation, semi-supervision and reinforcement learning comprises the following steps:

and (6): deconstructing the target document of the case handling case to obtain the criminal fact of the case handling case.

And (7): and (4) obtaining a word vector of the target document of the case handling piece by using a vector generation model according to the crime fact of the case handling piece generated in the step (6).

And (8): and (5) inputting the word vectors of the target document generated in the step (7) into the case feature extraction model trained in the step (5) to obtain a case feature extraction result of the target document.

Further, the text data enhancement processing includes:

training by using a word2 vent method to obtain synonym and near synonym models;

and on the basis of the universal synonym and near synonym dictionary, constructing a synonym and near synonym word list in the judicial field by using the generated synonym and near synonym model.

Synonym and near synonym replacement is carried out on the marked samples and the unmarked samples to obtain n generalized samples corresponding to each sample, and the original samples and the generalized samples are collected to obtain a data-enhanced sample set.

Further, the training of case feature extraction model based on genetic variation, semi-supervision and reinforcement learning comprises:

training a base model classifier by using all labeled samples in the data set;

predicting the unlabeled samples by using a base model classifier, and giving the probability that each sample belongs to each class to obtain a first generation of pseudo-labeled sample set;

constructing a second generation training set, a verification set and a test set according to the genetic variation principle;

using the nth generation training set and the verification set to train the case characteristic extraction model, using the test set, the verification set and the training set to check the model, and judging whether to continue the (n + 1) th generation training or to adjust the category of pseudo-labeled data in the nth generation training set;

and repeating the steps, and gradually expanding the training set until the algebra of the expansion iteration or the size of the training set meets the set algebra or size.

Further, an nth generation training set N is constructed by utilizing the genetic variation principle_{n standard}The method comprises the following steps:

respectively screening out various classes of samples in the generated pseudo-marked data to obtain various classes of pseudo-marked sample sets W₁₁、W₁₂…W_1mWherein m represents the number of categories;

pseudo-labeled sample set W of i class from the n-1 th generation_1iTo select the N with the maximum confidence_{n standard}Adding 0.8 copies to the N generation sample set, and randomly selecting N from the N-1 generation pseudo-marked samples according to the confidence degree_{n standard}Adding/m 0.2 samples to the nth generation sample set;

and sequentially selecting samples of other categories from the (n-1) th generation of pseudo-labeled sample set by adopting the method, adding the samples into the nth generation of sample set, and constructing an nth generation of training set.

Further, the training of the case feature extraction model by using the nth generation training set and the verification set, and the verification of the model by using the test set, the verification set and the training set, and the determination of whether to continue the training of the (n + 1) th generation or to adjust the category of the pseudo-labeled data in the nth generation training set include:

extracting a model according to the characteristics of the training case of the nth generation training set and the verification set, and predicting and calculating the precision rate, the recall rate and the F1 value of the model on the training set, the verification set and the test set by using the model;

when the precision rate, the recall rate and the F1 values on the training set, the verification set and the test set are all larger than or equal to the set precision rate, the recall rate and the F1 threshold value, the accuracy of the current training set and the model is considered to be higher, and the (n + 1) th generation of training is carried out; otherwise, the accuracy of the current training set and the corresponding model is considered to be insufficient, and the class of the samples in the training set needs to be adjusted.

Further, the method for adjusting the class of the samples in the training set comprises:

when the precision rate of the training set is lower than a threshold value, the types of partial positive samples with lower confidence coefficient are adjusted into negative samples; when the recall rate of the training set is low, the category of the partial negative sample with low confidence coefficient is adjusted to be a positive sample; when the precision rate and the recall rate of the training set are lower than the threshold values, the categories of the positive samples and the negative samples with lower confidence degrees are adjusted at the same time;

when the precision rate of the verification set or the test set is lower than a threshold value, the types of the partial positive samples with lower confidence coefficient are adjusted into negative samples; when the recall rate of the verification set or the test set is low, the category of the partial negative sample with low confidence coefficient is adjusted to be a positive sample; when the precision rate and the recall rate of the verification set or the test set are lower than the threshold values, the categories of the positive samples and the negative samples with lower confidence degrees are adjusted simultaneously;

and repeating the steps until the precision rate, the recall rate and the F1 value of the training set, the verification set and the test set are all larger than or equal to the set corresponding threshold values.

Based on the same inventive concept, the invention also provides a case characteristic extraction device based on genetic variation, semi-supervision and reinforcement learning, which comprises:

the model training module is used for training a case characteristic extraction model based on genetic variation, semi-supervision and reinforcement learning by adopting the case characteristic extraction model training method based on genetic variation, semi-supervision and reinforcement learning;

and the characteristic extraction module is used for extracting the case characteristics of the target document by adopting the case characteristic extraction method based on genetic variation, semi-supervision and reinforcement learning.

Compared with the prior art, the invention has the following beneficial effects: the invention provides a case feature extraction method based on genetic variation, semi-supervised learning and reinforcement learning, which utilizes a genetic variation algorithm to bond the semi-supervised learning and the reinforcement learning together, overcomes the problem that the traditional semi-supervised learning is easy to be over-fitted through continuous trial and error learning under the condition of no artificial guidance, and realizes the accurate extraction of case features under the condition of less labeled data.

Drawings

FIG. 1 is a regular expression deconstructed by a crime fact segment document;

FIG. 2 is a crime fact;

FIG. 3 is a crime fact after word segmentation and word stop removal;

FIG. 4 is a schematic diagram of a case feature extraction method based on genetic variation, semi-supervised and reinforcement learning;

FIG. 5 is a schematic diagram of a neural network structure of a case feature extraction method based on genetic variation, semi-supervision and reinforcement learning.

Detailed Description

The invention aims to solve the problem of case feature extraction under the condition of less labeled data, and provides a case feature extraction method based on genetic variation, semi-supervision and reinforcement learning. And judging the case characteristics of the current case by inputting the on-handling documents.

To further illustrate the technical solution of the present invention, the above steps are described in detail by the accompanying drawings and specific embodiments:

(1) acquisition of a set W of 50000 official documents already written¹＝{w¹ ₁,w¹ ₂,...w¹ ₅₀₀₀₀And deconstructing documents in the judgment document set by adopting a regular expression (as shown in figure 1) to acquire criminal facts found by trial. Gathering the audited and ascertained crime facts of all cases to obtain a set W²＝{w² ₁,w² ₂,...w² ₅₀₀₀₀In which w² _iAs shown in fig. 2.

(2) Using jieba word segmentation tool to pair W²In each case crime fact, segmenting words, removing stop words and obtaining a segmented crime fact set W³＝{w³ ₁,w³ ₂,...w³ ₅₀₀₀₀}. The method comprises the following specific substeps:

step 2.1: constructing a legal dictionary and a stop word dictionary: on the basis of jieba word segmentation, 100 ten thousand professional legal words are newly added; newly adding about 3000 special stop words for law and other words which are irrelevant to case characteristic recognition and legal recommendation on the stop word list of the Harmony work, such as: a hospital, a court, etc.

Step 2.2: constructing and acquiring word segmentation data: obtaining a preliminary word segmentation result by utilizing jieba word segmentation based on a self-constructed legal professional dictionary; on the preliminary word segmentation result, the stop words and the vocabularies with the length less than 2 are removed by using the self-constructed stop word list, and the final word segmentation result is obtained as shown in fig. 3.

(3) Performing text data enhancement processing on the marked and unmarked samples in the step (2) by using a synonym and near synonym replacement method to obtain corresponding data enhancement samples W⁴＝{w⁴ ₁,w⁴ ₂,...w⁴ ₅₀₀₀₀}. The method comprises the following specific substeps:

step 3.1: and (3) training by using a word2 vent method according to the sample set in the step (2) to obtain synonym and near synonym models.

Step 3.2: constructing a synonym and near synonym model: and on the basis of the universal synonym and near synonym dictionary, constructing a synonym and near synonym word list of the extended judicial field by using the synonym and near synonym model generated in the step 3.1.

Step 3.3: and (3) carrying out synonym and near synonym replacement on the marked and unmarked samples in the step (2) to obtain n generalized samples corresponding to each sample.

Step 3.4: and (4) collecting the generalized samples generated in the step (3.3) and the original samples in the step (2) to obtain a data-enhanced sample data set.

(4) Generating word vectors W of the enhanced sample set using a gensim word vector generation model⁵＝{w⁵ ₁,w⁵ ₂,...w⁵ ₅₀₀₀₀The specific substeps are as follows:

step 4.1: setting a gensim word vector model parameter min _ count to 5, and filtering words with the word frequency lower than 5 and a word vector dimension size to 300.

Step 4.2: inputting the enhanced sample set generated in the step (3) into a gensim word vector model to obtain a word vector set W⁵＝{w⁵ ₁,w⁵ ₂,...w⁵ ₅₀₀₀₀And save the word vector model.

(5) Training a case feature extraction model based on genetic variation, semi-supervision and reinforcement learning, wherein the principle schematic diagram of the method is shown in FIG. 4, and the specific sub-steps are as follows:

step 5.1: a base model classifier is trained using all labeled data in the data set (first generation training set). The structure of the basic model classifier is shown in fig. 5, and the model hiding layer includes 3 convolution layers and 1 full-link layer. The sizes of convolution kernels of the 3 convolution layers are 3, 4 and 5 respectively, each convolution layer is provided with 32 convolution kernels, the maximum pooling is adopted in each pooling layer, and the size is 3; the number of neurons in the full junction layer was 64.

Step 5.2: and predicting the unlabeled data by using a base model classifier, giving the probability that each sample belongs to a positive sample and a negative sample, and respectively marking the probability as 1 or 0 to obtain a first generation of pseudo-marked sample set.

Step 5.3: and constructing a second generation training set, a verification set and a test set. And (3) all marked data are expressed according to the following ratio of 1: 1 are randomly divided into validation and test sets and the training set in step 5.1 is set to null. And the samples in the second generation training set are constructed by the pseudo-labeled data generated in the step 5.2 by utilizing the genetic variation principle. The method comprises the following specific steps:

step 5.3.1: and screening out positive samples in the pseudo-marked data generated in the step 5.2, and sorting according to the confidence degree.

Step 5.3.2: selecting N with the highest confidence from the first generation of pseudo-marked sample set_{2 standard}Adding 0.8 positive samples to the second generation sample set, and randomly selecting N from the first generation pseudo-marked samples according to the confidence degree_{2 standard}The/2 x 0.2 positive samples are added to the second generation sample set.

Step 5.3.3: selecting N with the highest confidence from the first generation of pseudo-marked sample set_{2 standard}Adding 0.8 negative samples to the second generation sample set, and randomly selecting N from the first generation pseudo-marked samples according to the confidence degree_{2 standard}The/2 x 0.2 negative samples are added to the second generation sample set.

Step 5.3.3: selecting N with the highest confidence from the first generation of pseudo-marked sample set_{2 standard}Adding 0.8 negative samples to the second generation sample set, and randomly selecting N from the first generation pseudo-marked samples according to the confidence degree_{2 standard}0.2/2Negative examples are added to the second generation sample set.

Step 5.4: and (4) extracting a model by using the second generation training set and the verification set generated in the step 5.3 to train the case characteristics, verifying the model by using the test set, the verification set and the training set, and judging whether to continue the third generation training or to adjust the category of pseudo-labeled data in the second generation training set. The method comprises the following specific steps:

5.4.1 step: and (4) extracting a model by using the second generation training set and the verification set generated in the step 5.3, and predicting and calculating the precision rate, the recall rate and the F1 value of the model on the training set, the verification set and the test set by using the model. When the precision rate, the recall rate and the F1 value on the training set, the verification set and the test set are all larger than or equal to the set precision rate, the recall rate and the F1 threshold value, the accuracy of the current training set and the model is considered to be higher, and the third-generation training can be carried out; otherwise, the accuracy of the current training set and the corresponding model is considered to be insufficient, and the category of the sample in the training set needs to be adjusted. The method for specifically adjusting the class of the training set samples is as follows:

5.4.1.1, step: when the precision rate of the training set is lower than a threshold value, the types of partial positive samples with lower confidence coefficient are adjusted into negative samples; when the recall rate of the training set is low, the category of the partial negative sample with low confidence coefficient is adjusted to be a positive sample. And when the precision rate and the recall rate of the training set are lower than the threshold values, the categories of the positive samples and the negative samples with lower confidence coefficients are adjusted simultaneously.

5.4.1.2, step: when the precision rate of the verification set or the test set is lower than a threshold value, the types of the partial positive samples with lower confidence coefficient are adjusted into negative samples; when the recall rate of the verification set or the test set is low, the category of the partial negative sample with low confidence coefficient is adjusted to be the positive sample. And when the precision rate and the recall rate of the verification set or the test set are lower than the threshold values, the categories of the positive samples and the negative samples with lower confidence coefficients are adjusted simultaneously.

5.4.1.3, step: and repeating 5.4.1.1 or 5.4.1.2 steps until the precision rate, the recall rate and the F1 value of the training set, the verification set and the test set are all larger than or equal to the set corresponding threshold values.

Step 5.5: and repeating the step 5.4 to gradually expand the training set until the algebra of the expansion iteration or the size of the training set meets the set algebra or size.

(6) And (3) acquiring the crime fact of the current case by adopting the document deconstruction mode which is the same as the step (1).

(7) And (3) obtaining the crime fact of the case handling pieces after word segmentation by adopting the same word segmentation and stop word removal mode as the step (2).

(8) Inputting the segmented crime facts of the case handling cases into the genim word vector model stored in the step (4) to generate word vectors of the current crime facts of the case handling cases.

(9) Word vector w to be in office document³ _mAnd (5) inputting the case feature extraction model trained in the step (5) to obtain a case feature extraction result of the document in office.

Based on the same inventive concept, another embodiment of the present invention provides a case feature extraction device based on genetic variation, semi-supervised and reinforcement learning, comprising:

The specific implementation process of each module is referred to the description of the method of the invention.

Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smartphone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps of the inventive method.

Based on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program, which when executed by a computer, performs the steps of the inventive method.

A case feature extraction method based on genetic variation, semi-supervised and reinforcement learning according to the present invention has been described in detail with reference to the embodiments and the accompanying drawings. The invention has the following advantages: the genetic algorithm is associated with semi-supervised learning and reinforcement learning, the problem that the traditional semi-supervised learning is easy to over-fit is solved through continuous trial-and-error learning under the condition of no artificial guidance, and the case characteristics are accurately extracted under the condition of less labeled data.

It is to be understood that the specific embodiments of the present invention and the attached drawings disclosed above are for purposes of promoting an understanding of the principles of the invention, as will be appreciated by those skilled in the art, and as may be embodied. The present embodiments are exemplary rather than limiting, and various alterations, modifications and variations are intended to be included within the scope of the invention, which is defined in the appended claims, without departing from the spirit and scope of the invention.

Claims

1. A case feature extraction model training method based on genetic variation, semi-supervision and reinforcement learning is characterized by comprising the following steps:

constructing an initial sample set containing labeled samples and unlabeled samples;

performing word segmentation on the labeled samples and the unlabeled samples in the initial sample set, and removing stop words to obtain a sample data set after word segmentation;

performing text data enhancement processing on labeled samples and unlabeled samples in the segmented sample data set to obtain a corresponding data enhancement sample set;

respectively calculating and generating word vector sets of labeled samples and unlabeled samples in the data enhancement sample set according to the word vector generation model;

constructing and training a case characteristic extraction model based on genetic variation, semi-supervision and reinforcement learning by utilizing the word vector set;

the training case feature extraction model based on genetic variation, semi-supervision and reinforcement learning comprises the following steps:

training a base model classifier by using all labeled samples in the data set;

training a case feature extraction model by using an nth generation training set and a verification set, verifying the case feature extraction model by using a test set, the verification set and the training set, and judging whether to continue training the (n + 1) th generation or to adjust the category of pseudo-labeled data in the nth generation training set;

repeating the steps, and gradually expanding the training set until the algebra of the expansion iteration or the size of the training set meets the set algebra or size;

wherein, the nth generation training set N is constructed by utilizing the genetic variation principle_{n standard}The method comprises the following steps:

and sequentially selecting samples of other categories from the (n-1) th generation of pseudo-labeled sample set, adding the samples into the nth generation of sample set, and constructing an nth generation of training set.

2. The method of claim 1, wherein the text data enhancement process comprises:

on the basis of a general synonym dictionary and a similar synonym dictionary, constructing a synonym and similar synonym word list in the judicial field by using the generated synonym and similar synonym model;

3. The method according to claim 1, wherein the training of the case feature extraction model using the nth generation training set and the validation set, and the verification of the case feature extraction model using the test set, the validation set, and the training set, and the determination of whether to continue the training of the (n + 1) th generation or to adjust the category of the pseudo-labeled data in the nth generation training set comprise:

training a case feature extraction model according to an nth generation training set and a verification set, and predicting and calculating the precision rate, the recall rate and the F1 value of the case feature extraction model on the training set, the verification set and the test set by using the case feature extraction model;

when the precision rate, the recall rate and the F1 value on the training set, the verification set and the test set are all larger than or equal to the set precision rate, the recall rate and the F1 threshold value, the accuracy of the current training set and the model is considered to be higher, and the (n + 1) th generation of training is carried out; otherwise, the accuracy of the current training set and the corresponding model is considered to be insufficient, and the class of the samples in the training set needs to be adjusted.

4. The method of claim 3, wherein the method of adjusting the class of the samples in the training set comprises:

when the precision rate of the verification set or the test set is lower than a threshold value, adjusting the types of partial positive samples with lower confidence coefficient into negative samples; when the recall rate of the verification set or the test set is low, the category of the partial negative sample with low confidence coefficient is adjusted to be a positive sample; when the precision rate and the recall rate of the verification set or the test set are lower than the threshold values, the categories of the positive samples and the negative samples with lower confidence degrees are adjusted simultaneously;

5. A case feature extraction method based on genetic variation, semi-supervision and reinforcement learning is characterized by comprising the following steps:

deconstructing a target document of a case handling case to acquire a crime fact of the case handling case;

obtaining a word vector of a target document in a case handling document by using a vector generation model according to a criminal fact of the case handling document;

and inputting the word vector of the target document into a case feature extraction model trained and completed by the method of any one of claims 1-4 to obtain a case feature extraction result of the target document.

6. A case feature extraction device based on genetic variation, semi-supervision and reinforcement learning is characterized by comprising:

a model training module for training a case feature extraction model based on genetic variation, semi-supervision and reinforcement learning by adopting the method of any one of claims 1-4;

a feature extraction module, configured to extract case features of the target document by using the method of claim 5 and using a trained case feature extraction model based on genetic variation, semi-supervised and reinforcement learning.

7. An electronic apparatus, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-5.

8. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1 to 5.