CN113378563B - Case feature extraction method and device based on genetic variation and semi-supervision - Google Patents

Case feature extraction method and device based on genetic variation and semi-supervision Download PDF

Info

Publication number
CN113378563B
CN113378563B CN202110163512.XA CN202110163512A CN113378563B CN 113378563 B CN113378563 B CN 113378563B CN 202110163512 A CN202110163512 A CN 202110163512A CN 113378563 B CN113378563 B CN 113378563B
Authority
CN
China
Prior art keywords
samples
training
case
feature extraction
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110163512.XA
Other languages
Chinese (zh)
Other versions
CN113378563A (en
Inventor
孙晓锐
艾中良
贾高峰
刘贤艳
杨哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Judicial Big Data Research Institute Co ltd
Original Assignee
China Judicial Big Data Research Institute Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Judicial Big Data Research Institute Co ltd filed Critical China Judicial Big Data Research Institute Co ltd
Priority to CN202110163512.XA priority Critical patent/CN113378563B/en
Publication of CN113378563A publication Critical patent/CN113378563A/en
Application granted granted Critical
Publication of CN113378563B publication Critical patent/CN113378563B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/086Learning methods using evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services; Handling legal documents

Abstract

The invention aims to solve the problem of case feature extraction under the condition of less labeled data, and provides a case feature extraction method and device based on genetic variation, semi-supervision and reinforcement learning. The method utilizes a genetic variation algorithm to bond semi-supervised learning and reinforcement learning together, overcomes the problem that the traditional semi-supervised learning is easy to be over-fitted through continuous trial and error learning under the condition of no artificial guidance, and realizes the accurate extraction of case characteristics under the condition of less labeled data. The method specifically comprises the steps of constructing a sample data set, utilizing a word segmentation tool and a word vector generation model to obtain an input word vector required by model training, injecting the word vector into a case characteristic extraction model based on genetic variation, semi-supervision and reinforcement learning, training and obtaining the case characteristic extraction model. And judging the case characteristics of the current case by inputting the on-handling documents.

Description

Case feature extraction method and device based on genetic variation and semi-supervision
Technical Field
The invention relates to a case feature extraction method, in particular to a case feature extraction method and device based on genetic variation, semi-supervision and reinforcement learning, and belongs to the field of natural language processing and deep learning.
Background
The case is characterized by being the basis of case examination by a judge, and the judge often combs the case context according to different case reasons and cases in the case examination process, thereby occupying a great deal of time and energy in the case examination process of the judge. Therefore, the method helps the officer to quickly comb the main case characteristics of the case, can greatly reduce the examination and treatment time of the officer, improves case handling efficiency, and is favorable for relieving the contradiction of a plurality of cases.
The earlier case feature extraction mainly depends on manually made rules, namely when the text content meets certain conditions, the corresponding case features are considered to exist. Although the staged achievement is achieved, the case feature extraction rules must be established in a manual mode, a large number of legal experts and knowledge engineers are involved, and the correctness and consistency of the combed knowledge and the rules are difficult to guarantee. With the development of machine learning technology, some case feature extraction methods based on supervised classification and unsupervised clustering appear. As two major branches of the classification method, supervised classification and unsupervised clustering have respective advantages and disadvantages. For supervised classification, a large number of samples with class labels exist as supervised learning information, and the prediction accuracy of the trained classification model on unlabelled samples is high. However, in the judicial field, the data with the category labels is not abundant, and a great deal of manpower and material resources are consumed for manually calibrating the data without the category labels. For unsupervised clustering, it does not need to know the class label of the data, however, due to the lack of guidance of a priori information, the performance thereof is still to be further improved. In this case, semi-supervised learning based on a small amount of labeled data and a large amount of unlabeled data arises. Therefore, in the invention, the semi-supervised learning and the reinforcement learning are bonded together by using the genetic variation algorithm, and the problem that the traditional semi-supervised learning is easy to over-fit is solved through continuous trial and error learning under the condition of no artificial guidance, so that the case characteristics are accurately extracted under the condition of less labeled data.
Disclosure of Invention
The invention relates to a case characteristic extraction method based on genetic variation, semi-supervision and reinforcement learning, which comprises the steps of constructing a sample data set, acquiring an input word vector required by model training by utilizing a word segmentation tool and a word vector generation model, injecting the word vector into a case characteristic extraction model based on the genetic variation, semi-supervision and reinforcement learning, training and acquiring a case characteristic extraction model. And judging whether the current case has a certain case characteristic or not by inputting the on-hand document. The method can comb the case situation characteristics of the current case from the aspect of semantic understanding, and can improve the case handling quality and efficiency of the judges.
A case feature extraction model training method based on genetic variation, semi-supervision and reinforcement learning comprises the following steps:
acquiring an initial sample set: the initial sample set includes a number of annotated samples and a number of unlabeled samples.
Step (2) acquiring a data set after word segmentation: and performing word segmentation and removal of stop words on the labeled samples and the unlabeled samples in the initial sample set to obtain a sample data set after word segmentation.
Step (3) enhancing the sample data: and performing text data enhancement processing on the labeled samples and the unlabeled samples in the segmented sample data set to obtain a corresponding data enhancement sample set.
Generating an input word vector of the model in the step (4): and respectively calculating word vector sets of labeled samples and unlabeled samples in the generated data enhanced sample set according to the word vector generation model.
And (5): and (3) constructing and training a case characteristic extraction model based on genetic variation, semi-supervision and reinforcement learning by utilizing the word vector set.
A case feature extraction method based on genetic variation, semi-supervision and reinforcement learning comprises the following steps:
and (6): deconstructing the target document of the case handling case to obtain the criminal fact of the case handling case.
And (7): and (4) obtaining a word vector of the target document of the case handling piece by using a vector generation model according to the crime fact of the case handling piece generated in the step (6).
And (8): and (5) inputting the word vectors of the target document generated in the step (7) into the case feature extraction model trained in the step (5) to obtain a case feature extraction result of the target document.
Further, the text data enhancement processing includes:
training by using a word2 vent method to obtain synonym and near synonym models;
and on the basis of the universal synonym and near synonym dictionary, constructing a synonym and near synonym word list in the judicial field by using the generated synonym and near synonym model.
Synonym and near synonym replacement is carried out on the marked samples and the unmarked samples to obtain n generalized samples corresponding to each sample, and the original samples and the generalized samples are collected to obtain a data-enhanced sample set.
Further, the training of case feature extraction model based on genetic variation, semi-supervision and reinforcement learning comprises:
training a base model classifier by using all labeled samples in the data set;
predicting the unlabeled samples by using a base model classifier, and giving the probability that each sample belongs to each class to obtain a first generation of pseudo-labeled sample set;
constructing a second generation training set, a verification set and a test set according to the genetic variation principle;
using the nth generation training set and the verification set to train the case characteristic extraction model, using the test set, the verification set and the training set to check the model, and judging whether to continue the (n + 1) th generation training or to adjust the category of pseudo-labeled data in the nth generation training set;
and repeating the steps, and gradually expanding the training set until the algebra of the expansion iteration or the size of the training set meets the set algebra or size.
Further, an nth generation training set N is constructed by utilizing the genetic variation principlen standardThe method comprises the following steps:
respectively screening out various classes of samples in the generated pseudo-marked data to obtain various classes of pseudo-marked sample sets W11、W12…W1mWherein m represents the number of categories;
pseudo-labeled sample set W of i class from the n-1 th generation1iTo select the N with the maximum confidencen standardAdding 0.8 copies to the N generation sample set, and randomly selecting N from the N-1 generation pseudo-marked samples according to the confidence degreen standardAdding/m 0.2 samples to the nth generation sample set;
and sequentially selecting samples of other categories from the (n-1) th generation of pseudo-labeled sample set by adopting the method, adding the samples into the nth generation of sample set, and constructing an nth generation of training set.
Further, the training of the case feature extraction model by using the nth generation training set and the verification set, and the verification of the model by using the test set, the verification set and the training set, and the determination of whether to continue the training of the (n + 1) th generation or to adjust the category of the pseudo-labeled data in the nth generation training set include:
extracting a model according to the characteristics of the training case of the nth generation training set and the verification set, and predicting and calculating the precision rate, the recall rate and the F1 value of the model on the training set, the verification set and the test set by using the model;
when the precision rate, the recall rate and the F1 values on the training set, the verification set and the test set are all larger than or equal to the set precision rate, the recall rate and the F1 threshold value, the accuracy of the current training set and the model is considered to be higher, and the (n + 1) th generation of training is carried out; otherwise, the accuracy of the current training set and the corresponding model is considered to be insufficient, and the class of the samples in the training set needs to be adjusted.
Further, the method for adjusting the class of the samples in the training set comprises:
when the precision rate of the training set is lower than a threshold value, the types of partial positive samples with lower confidence coefficient are adjusted into negative samples; when the recall rate of the training set is low, the category of the partial negative sample with low confidence coefficient is adjusted to be a positive sample; when the precision rate and the recall rate of the training set are lower than the threshold values, the categories of the positive samples and the negative samples with lower confidence degrees are adjusted at the same time;
when the precision rate of the verification set or the test set is lower than a threshold value, the types of the partial positive samples with lower confidence coefficient are adjusted into negative samples; when the recall rate of the verification set or the test set is low, the category of the partial negative sample with low confidence coefficient is adjusted to be a positive sample; when the precision rate and the recall rate of the verification set or the test set are lower than the threshold values, the categories of the positive samples and the negative samples with lower confidence degrees are adjusted simultaneously;
and repeating the steps until the precision rate, the recall rate and the F1 value of the training set, the verification set and the test set are all larger than or equal to the set corresponding threshold values.
Based on the same inventive concept, the invention also provides a case characteristic extraction device based on genetic variation, semi-supervision and reinforcement learning, which comprises:
the model training module is used for training a case characteristic extraction model based on genetic variation, semi-supervision and reinforcement learning by adopting the case characteristic extraction model training method based on genetic variation, semi-supervision and reinforcement learning;
and the characteristic extraction module is used for extracting the case characteristics of the target document by adopting the case characteristic extraction method based on genetic variation, semi-supervision and reinforcement learning.
Compared with the prior art, the invention has the following beneficial effects: the invention provides a case feature extraction method based on genetic variation, semi-supervised learning and reinforcement learning, which utilizes a genetic variation algorithm to bond the semi-supervised learning and the reinforcement learning together, overcomes the problem that the traditional semi-supervised learning is easy to be over-fitted through continuous trial and error learning under the condition of no artificial guidance, and realizes the accurate extraction of case features under the condition of less labeled data.
Drawings
FIG. 1 is a regular expression deconstructed by a crime fact segment document;
FIG. 2 is a crime fact;
FIG. 3 is a crime fact after word segmentation and word stop removal;
FIG. 4 is a schematic diagram of a case feature extraction method based on genetic variation, semi-supervised and reinforcement learning;
FIG. 5 is a schematic diagram of a neural network structure of a case feature extraction method based on genetic variation, semi-supervision and reinforcement learning.
Detailed Description
The invention aims to solve the problem of case feature extraction under the condition of less labeled data, and provides a case feature extraction method based on genetic variation, semi-supervision and reinforcement learning. And judging the case characteristics of the current case by inputting the on-handling documents.
To further illustrate the technical solution of the present invention, the above steps are described in detail by the accompanying drawings and specific embodiments:
(1) acquisition of a set W of 50000 official documents already written1={w1 1,w1 2,...w1 50000And deconstructing documents in the judgment document set by adopting a regular expression (as shown in figure 1) to acquire criminal facts found by trial. Gathering the audited and ascertained crime facts of all cases to obtain a set W2={w2 1,w2 2,...w2 50000In which w2 iAs shown in fig. 2.
(2) Using jieba word segmentation tool to pair W2In each case crime fact, segmenting words, removing stop words and obtaining a segmented crime fact set W3={w3 1,w3 2,...w3 50000}. The method comprises the following specific substeps:
step 2.1: constructing a legal dictionary and a stop word dictionary: on the basis of jieba word segmentation, 100 ten thousand professional legal words are newly added; newly adding about 3000 special stop words for law and other words which are irrelevant to case characteristic recognition and legal recommendation on the stop word list of the Harmony work, such as: a hospital, a court, etc.
Step 2.2: constructing and acquiring word segmentation data: obtaining a preliminary word segmentation result by utilizing jieba word segmentation based on a self-constructed legal professional dictionary; on the preliminary word segmentation result, the stop words and the vocabularies with the length less than 2 are removed by using the self-constructed stop word list, and the final word segmentation result is obtained as shown in fig. 3.
(3) Performing text data enhancement processing on the marked and unmarked samples in the step (2) by using a synonym and near synonym replacement method to obtain corresponding data enhancement samples W4={w4 1,w4 2,...w4 50000}. The method comprises the following specific substeps:
step 3.1: and (3) training by using a word2 vent method according to the sample set in the step (2) to obtain synonym and near synonym models.
Step 3.2: constructing a synonym and near synonym model: and on the basis of the universal synonym and near synonym dictionary, constructing a synonym and near synonym word list of the extended judicial field by using the synonym and near synonym model generated in the step 3.1.
Step 3.3: and (3) carrying out synonym and near synonym replacement on the marked and unmarked samples in the step (2) to obtain n generalized samples corresponding to each sample.
Step 3.4: and (4) collecting the generalized samples generated in the step (3.3) and the original samples in the step (2) to obtain a data-enhanced sample data set.
(4) Generating word vectors W of the enhanced sample set using a gensim word vector generation model5={w5 1,w5 2,...w5 50000The specific substeps are as follows:
step 4.1: setting a gensim word vector model parameter min _ count to 5, and filtering words with the word frequency lower than 5 and a word vector dimension size to 300.
Step 4.2: inputting the enhanced sample set generated in the step (3) into a gensim word vector model to obtain a word vector set W5={w5 1,w5 2,...w5 50000And save the word vector model.
(5) Training a case feature extraction model based on genetic variation, semi-supervision and reinforcement learning, wherein the principle schematic diagram of the method is shown in FIG. 4, and the specific sub-steps are as follows:
step 5.1: a base model classifier is trained using all labeled data in the data set (first generation training set). The structure of the basic model classifier is shown in fig. 5, and the model hiding layer includes 3 convolution layers and 1 full-link layer. The sizes of convolution kernels of the 3 convolution layers are 3, 4 and 5 respectively, each convolution layer is provided with 32 convolution kernels, the maximum pooling is adopted in each pooling layer, and the size is 3; the number of neurons in the full junction layer was 64.
Step 5.2: and predicting the unlabeled data by using a base model classifier, giving the probability that each sample belongs to a positive sample and a negative sample, and respectively marking the probability as 1 or 0 to obtain a first generation of pseudo-marked sample set.
Step 5.3: and constructing a second generation training set, a verification set and a test set. And (3) all marked data are expressed according to the following ratio of 1: 1 are randomly divided into validation and test sets and the training set in step 5.1 is set to null. And the samples in the second generation training set are constructed by the pseudo-labeled data generated in the step 5.2 by utilizing the genetic variation principle. The method comprises the following specific steps:
step 5.3.1: and screening out positive samples in the pseudo-marked data generated in the step 5.2, and sorting according to the confidence degree.
Step 5.3.2: selecting N with the highest confidence from the first generation of pseudo-marked sample set2 standardAdding 0.8 positive samples to the second generation sample set, and randomly selecting N from the first generation pseudo-marked samples according to the confidence degree2 standardThe/2 x 0.2 positive samples are added to the second generation sample set.
Step 5.3.3: selecting N with the highest confidence from the first generation of pseudo-marked sample set2 standardAdding 0.8 negative samples to the second generation sample set, and randomly selecting N from the first generation pseudo-marked samples according to the confidence degree2 standardThe/2 x 0.2 negative samples are added to the second generation sample set.
Step 5.3.3: selecting N with the highest confidence from the first generation of pseudo-marked sample set2 standardAdding 0.8 negative samples to the second generation sample set, and randomly selecting N from the first generation pseudo-marked samples according to the confidence degree2 standard0.2/2Negative examples are added to the second generation sample set.
Step 5.4: and (4) extracting a model by using the second generation training set and the verification set generated in the step 5.3 to train the case characteristics, verifying the model by using the test set, the verification set and the training set, and judging whether to continue the third generation training or to adjust the category of pseudo-labeled data in the second generation training set. The method comprises the following specific steps:
5.4.1 step: and (4) extracting a model by using the second generation training set and the verification set generated in the step 5.3, and predicting and calculating the precision rate, the recall rate and the F1 value of the model on the training set, the verification set and the test set by using the model. When the precision rate, the recall rate and the F1 value on the training set, the verification set and the test set are all larger than or equal to the set precision rate, the recall rate and the F1 threshold value, the accuracy of the current training set and the model is considered to be higher, and the third-generation training can be carried out; otherwise, the accuracy of the current training set and the corresponding model is considered to be insufficient, and the category of the sample in the training set needs to be adjusted. The method for specifically adjusting the class of the training set samples is as follows:
5.4.1.1, step: when the precision rate of the training set is lower than a threshold value, the types of partial positive samples with lower confidence coefficient are adjusted into negative samples; when the recall rate of the training set is low, the category of the partial negative sample with low confidence coefficient is adjusted to be a positive sample. And when the precision rate and the recall rate of the training set are lower than the threshold values, the categories of the positive samples and the negative samples with lower confidence coefficients are adjusted simultaneously.
5.4.1.2, step: when the precision rate of the verification set or the test set is lower than a threshold value, the types of the partial positive samples with lower confidence coefficient are adjusted into negative samples; when the recall rate of the verification set or the test set is low, the category of the partial negative sample with low confidence coefficient is adjusted to be the positive sample. And when the precision rate and the recall rate of the verification set or the test set are lower than the threshold values, the categories of the positive samples and the negative samples with lower confidence coefficients are adjusted simultaneously.
5.4.1.3, step: and repeating 5.4.1.1 or 5.4.1.2 steps until the precision rate, the recall rate and the F1 value of the training set, the verification set and the test set are all larger than or equal to the set corresponding threshold values.
Step 5.5: and repeating the step 5.4 to gradually expand the training set until the algebra of the expansion iteration or the size of the training set meets the set algebra or size.
(6) And (3) acquiring the crime fact of the current case by adopting the document deconstruction mode which is the same as the step (1).
(7) And (3) obtaining the crime fact of the case handling pieces after word segmentation by adopting the same word segmentation and stop word removal mode as the step (2).
(8) Inputting the segmented crime facts of the case handling cases into the genim word vector model stored in the step (4) to generate word vectors of the current crime facts of the case handling cases.
(9) Word vector w to be in office document3 mAnd (5) inputting the case feature extraction model trained in the step (5) to obtain a case feature extraction result of the document in office.
Based on the same inventive concept, another embodiment of the present invention provides a case feature extraction device based on genetic variation, semi-supervised and reinforcement learning, comprising:
the model training module is used for training a case characteristic extraction model based on genetic variation, semi-supervision and reinforcement learning by adopting the case characteristic extraction model training method based on genetic variation, semi-supervision and reinforcement learning;
and the characteristic extraction module is used for extracting the case characteristics of the target document by adopting the case characteristic extraction method based on genetic variation, semi-supervision and reinforcement learning.
The specific implementation process of each module is referred to the description of the method of the invention.
Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smartphone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps of the inventive method.
Based on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program, which when executed by a computer, performs the steps of the inventive method.
A case feature extraction method based on genetic variation, semi-supervised and reinforcement learning according to the present invention has been described in detail with reference to the embodiments and the accompanying drawings. The invention has the following advantages: the genetic algorithm is associated with semi-supervised learning and reinforcement learning, the problem that the traditional semi-supervised learning is easy to over-fit is solved through continuous trial-and-error learning under the condition of no artificial guidance, and the case characteristics are accurately extracted under the condition of less labeled data.
It is to be understood that the specific embodiments of the present invention and the attached drawings disclosed above are for purposes of promoting an understanding of the principles of the invention, as will be appreciated by those skilled in the art, and as may be embodied. The present embodiments are exemplary rather than limiting, and various alterations, modifications and variations are intended to be included within the scope of the invention, which is defined in the appended claims, without departing from the spirit and scope of the invention.

Claims (8)

1. A case feature extraction model training method based on genetic variation, semi-supervision and reinforcement learning is characterized by comprising the following steps:
constructing an initial sample set containing labeled samples and unlabeled samples;
performing word segmentation on the labeled samples and the unlabeled samples in the initial sample set, and removing stop words to obtain a sample data set after word segmentation;
performing text data enhancement processing on labeled samples and unlabeled samples in the segmented sample data set to obtain a corresponding data enhancement sample set;
respectively calculating and generating word vector sets of labeled samples and unlabeled samples in the data enhancement sample set according to the word vector generation model;
constructing and training a case characteristic extraction model based on genetic variation, semi-supervision and reinforcement learning by utilizing the word vector set;
the training case feature extraction model based on genetic variation, semi-supervision and reinforcement learning comprises the following steps:
training a base model classifier by using all labeled samples in the data set;
predicting the unlabeled samples by using a base model classifier, and giving the probability that each sample belongs to each class to obtain a first generation of pseudo-labeled sample set;
constructing a second generation training set, a verification set and a test set according to the genetic variation principle;
training a case feature extraction model by using an nth generation training set and a verification set, verifying the case feature extraction model by using a test set, the verification set and the training set, and judging whether to continue training the (n + 1) th generation or to adjust the category of pseudo-labeled data in the nth generation training set;
repeating the steps, and gradually expanding the training set until the algebra of the expansion iteration or the size of the training set meets the set algebra or size;
wherein, the nth generation training set N is constructed by utilizing the genetic variation principlen standardThe method comprises the following steps:
respectively screening out various classes of samples in the generated pseudo-marked data to obtain various classes of pseudo-marked sample sets W11、W12…W1mWherein m represents the number of categories;
pseudo-labeled sample set W of i class from the n-1 th generation1iTo select the N with the maximum confidencen standardAdding 0.8 copies to the N generation sample set, and randomly selecting N from the N-1 generation pseudo-marked samples according to the confidence degreen standardAdding/m 0.2 samples to the nth generation sample set;
and sequentially selecting samples of other categories from the (n-1) th generation of pseudo-labeled sample set, adding the samples into the nth generation of sample set, and constructing an nth generation of training set.
2. The method of claim 1, wherein the text data enhancement process comprises:
training by using a word2 vent method to obtain synonym and near synonym models;
on the basis of a general synonym dictionary and a similar synonym dictionary, constructing a synonym and similar synonym word list in the judicial field by using the generated synonym and similar synonym model;
synonym and near synonym replacement is carried out on the marked samples and the unmarked samples to obtain n generalized samples corresponding to each sample, and the original samples and the generalized samples are collected to obtain a data-enhanced sample set.
3. The method according to claim 1, wherein the training of the case feature extraction model using the nth generation training set and the validation set, and the verification of the case feature extraction model using the test set, the validation set, and the training set, and the determination of whether to continue the training of the (n + 1) th generation or to adjust the category of the pseudo-labeled data in the nth generation training set comprise:
training a case feature extraction model according to an nth generation training set and a verification set, and predicting and calculating the precision rate, the recall rate and the F1 value of the case feature extraction model on the training set, the verification set and the test set by using the case feature extraction model;
when the precision rate, the recall rate and the F1 value on the training set, the verification set and the test set are all larger than or equal to the set precision rate, the recall rate and the F1 threshold value, the accuracy of the current training set and the model is considered to be higher, and the (n + 1) th generation of training is carried out; otherwise, the accuracy of the current training set and the corresponding model is considered to be insufficient, and the class of the samples in the training set needs to be adjusted.
4. The method of claim 3, wherein the method of adjusting the class of the samples in the training set comprises:
when the precision rate of the training set is lower than a threshold value, the types of partial positive samples with lower confidence coefficient are adjusted into negative samples; when the recall rate of the training set is low, the category of the partial negative sample with low confidence coefficient is adjusted to be a positive sample; when the precision rate and the recall rate of the training set are lower than the threshold values, the categories of the positive samples and the negative samples with lower confidence degrees are adjusted at the same time;
when the precision rate of the verification set or the test set is lower than a threshold value, adjusting the types of partial positive samples with lower confidence coefficient into negative samples; when the recall rate of the verification set or the test set is low, the category of the partial negative sample with low confidence coefficient is adjusted to be a positive sample; when the precision rate and the recall rate of the verification set or the test set are lower than the threshold values, the categories of the positive samples and the negative samples with lower confidence degrees are adjusted simultaneously;
and repeating the steps until the precision rate, the recall rate and the F1 value of the training set, the verification set and the test set are all larger than or equal to the set corresponding threshold values.
5. A case feature extraction method based on genetic variation, semi-supervision and reinforcement learning is characterized by comprising the following steps:
deconstructing a target document of a case handling case to acquire a crime fact of the case handling case;
obtaining a word vector of a target document in a case handling document by using a vector generation model according to a criminal fact of the case handling document;
and inputting the word vector of the target document into a case feature extraction model trained and completed by the method of any one of claims 1-4 to obtain a case feature extraction result of the target document.
6. A case feature extraction device based on genetic variation, semi-supervision and reinforcement learning is characterized by comprising:
a model training module for training a case feature extraction model based on genetic variation, semi-supervision and reinforcement learning by adopting the method of any one of claims 1-4;
a feature extraction module, configured to extract case features of the target document by using the method of claim 5 and using a trained case feature extraction model based on genetic variation, semi-supervised and reinforcement learning.
7. An electronic apparatus, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-5.
8. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1 to 5.
CN202110163512.XA 2021-02-05 2021-02-05 Case feature extraction method and device based on genetic variation and semi-supervision Active CN113378563B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110163512.XA CN113378563B (en) 2021-02-05 2021-02-05 Case feature extraction method and device based on genetic variation and semi-supervision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110163512.XA CN113378563B (en) 2021-02-05 2021-02-05 Case feature extraction method and device based on genetic variation and semi-supervision

Publications (2)

Publication Number Publication Date
CN113378563A CN113378563A (en) 2021-09-10
CN113378563B true CN113378563B (en) 2022-05-17

Family

ID=77570570

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110163512.XA Active CN113378563B (en) 2021-02-05 2021-02-05 Case feature extraction method and device based on genetic variation and semi-supervision

Country Status (1)

Country Link
CN (1) CN113378563B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115952290B (en) * 2023-03-09 2023-06-02 太极计算机股份有限公司 Case characteristic labeling method, device and equipment based on active learning and semi-supervised learning

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107590195A (en) * 2017-08-14 2018-01-16 百度在线网络技术(北京)有限公司 Textual classification model training method, file classification method and its device
CN109241285A (en) * 2018-08-29 2019-01-18 东南大学 A kind of device of the judicial decision in a case of auxiliary based on machine learning
CN111126578A (en) * 2020-04-01 2020-05-08 阿尔法云计算(深圳)有限公司 Joint data processing method, device and system for model training
CN111723209A (en) * 2020-06-28 2020-09-29 上海携旅信息技术有限公司 Semi-supervised text classification model training method, text classification method, system, device and medium
CN111881654A (en) * 2020-08-01 2020-11-03 牡丹江师范学院 Penalty test data amplification method based on multi-objective optimization
CN112100212A (en) * 2020-09-04 2020-12-18 中国航天科工集团第二研究院 Case scenario extraction method based on machine learning and rule matching
CN112189235A (en) * 2018-03-29 2021-01-05 伯耐沃伦人工智能科技有限公司 Ensemble model creation and selection
CN112232416A (en) * 2020-10-16 2021-01-15 浙江大学 Semi-supervised learning method based on pseudo label weighting

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294593B (en) * 2016-07-28 2019-04-09 浙江大学 In conjunction with the Relation extraction method of subordinate clause grade remote supervisory and semi-supervised integrated study
US11205103B2 (en) * 2016-12-09 2021-12-21 The Research Foundation for the State University Semisupervised autoencoder for sentiment analysis
CN110532377B (en) * 2019-05-13 2021-09-14 南京大学 Semi-supervised text classification method based on confrontation training and confrontation learning network
US20200027021A1 (en) * 2019-09-27 2020-01-23 Kumara Sastry Reinforcement learning for multi-domain problems
CN111177382B (en) * 2019-12-23 2023-12-08 四川大学 Intelligent legal system recommendation auxiliary system based on FastText algorithm

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107590195A (en) * 2017-08-14 2018-01-16 百度在线网络技术(北京)有限公司 Textual classification model training method, file classification method and its device
CN112189235A (en) * 2018-03-29 2021-01-05 伯耐沃伦人工智能科技有限公司 Ensemble model creation and selection
CN109241285A (en) * 2018-08-29 2019-01-18 东南大学 A kind of device of the judicial decision in a case of auxiliary based on machine learning
CN111126578A (en) * 2020-04-01 2020-05-08 阿尔法云计算(深圳)有限公司 Joint data processing method, device and system for model training
CN111723209A (en) * 2020-06-28 2020-09-29 上海携旅信息技术有限公司 Semi-supervised text classification model training method, text classification method, system, device and medium
CN111881654A (en) * 2020-08-01 2020-11-03 牡丹江师范学院 Penalty test data amplification method based on multi-objective optimization
CN112100212A (en) * 2020-09-04 2020-12-18 中国航天科工集团第二研究院 Case scenario extraction method based on machine learning and rule matching
CN112232416A (en) * 2020-10-16 2021-01-15 浙江大学 Semi-supervised learning method based on pseudo label weighting

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Genetic algorithm based Semi-feature selection method;Hualong Bu et.al;《2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing》;20091225;第521-524页 *
基于半监督学习的涉及未成年人案件文书识别方法;杨圣豪等;《华南理工大学学报(自然科学版)》;20210131;第49卷(第1期);第29-38、46页 *
基于卷积神经网络的公安案件文本语义特征提取方法研究;林志宏等;《数学的实践与认识》;20170930;第47卷(第17期);第127-140页 *

Also Published As

Publication number Publication date
CN113378563A (en) 2021-09-10

Similar Documents

Publication Publication Date Title
CN109241530B (en) Chinese text multi-classification method based on N-gram vector and convolutional neural network
CN110083838B (en) Biomedical semantic relation extraction method based on multilayer neural network and external knowledge base
WO2018218708A1 (en) Deep-learning-based public opinion hotspot category classification method
CN113011533A (en) Text classification method and device, computer equipment and storage medium
CN112395393B (en) Remote supervision relation extraction method based on multitask and multiple examples
CN110188195B (en) Text intention recognition method, device and equipment based on deep learning
CN111767397A (en) Electric power system secondary equipment fault short text data classification method
CN108932950A (en) It is a kind of based on the tag amplified sound scenery recognition methods merged with multifrequency spectrogram
CN110097096B (en) Text classification method based on TF-IDF matrix and capsule network
CN109446423B (en) System and method for judging sentiment of news and texts
CN112070139B (en) Text classification method based on BERT and improved LSTM
CN110472245B (en) Multi-label emotion intensity prediction method based on hierarchical convolutional neural network
CN111985228A (en) Text keyword extraction method and device, computer equipment and storage medium
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN111191442A (en) Similar problem generation method, device, equipment and medium
CN109614611B (en) Emotion analysis method for fusion generation of non-antagonistic network and convolutional neural network
CN112052687A (en) Semantic feature processing method, device and medium based on deep separable convolution
CN113051887A (en) Method, system and device for extracting announcement information elements
CN111177010B (en) Software defect severity identification method
CN114528835A (en) Semi-supervised specialized term extraction method, medium and equipment based on interval discrimination
CN112541083A (en) Text classification method based on active learning hybrid neural network
CN116842194A (en) Electric power semantic knowledge graph system and method
CN116304020A (en) Industrial text entity extraction method based on semantic source analysis and span characteristics
CN113378563B (en) Case feature extraction method and device based on genetic variation and semi-supervision
CN112347247B (en) Specific category text title classification method based on LDA and Bert

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant