CN116304033B

CN116304033B - Complaint identification method based on semi-supervision and double-layer multi-classification

Info

Publication number: CN116304033B
Application number: CN202310171687.4A
Authority: CN
Inventors: 张凡凡; 谭晓颖; 李晓智; 刘贤艳; 孙晓锐; 李娜娜; 胡亚谦
Original assignee: China Judicial Big Data Research Institute Co ltd
Current assignee: China Judicial Big Data Research Institute Co ltd
Priority date: 2023-02-27
Filing date: 2023-02-27
Publication date: 2023-11-03
Anticipated expiration: 2043-02-27
Also published as: CN116304033A

Abstract

The invention discloses a complaint and please recognition method based on semi-supervision and double-layer multi-classification, which comprises the following steps: 1) Acquiring a training text set, wherein the training text set comprises a marked data set and an unmarked data set; 2) Training the teacher model by using the annotation data set and obtaining self-adaptive thresholds of all annotation categories; classifying and labeling each unlabeled litigation request data in the unlabeled data set by using the trained teacher model to obtain pseudo-label labeling data; 3) Training a student model by using the pseudo tag labeling data and labeling litigation request data to obtain the student model; 4) Classifying and predicting the data in the test set by using the student model, and calculating the F1 score of each category according to the classifying and predicting result; if the sample is lower than the set threshold, training the sample under the category to obtain a multi-task model; 5) The student model and the multitask model are connected in series to form a double-layer model; 6) And inputting the civil complaint text to be classified into a double-layer model to obtain the litigation request category.

Description

Complaint identification method based on semi-supervision and double-layer multi-classification

Technical Field

The invention relates to the field of deep learning text classification, in particular to a semi-supervised double-layer multi-classification complaint recognition method.

Background

The complaint in the complaint form is an important basis for the legal officer to examine the case, and in the process of actually accepting the case, the recognition and extraction of the complaint takes a lot of time and energy for legal staff, so that the intelligent recognition and extraction of the complaint based on the deep learning technology becomes important, and the case handling efficiency of the legal staff is improved.

In the past, the recognition and extraction of complaints are mainly manual check or manually formulated rule extraction, the former needs to be participated by legal professionals all the time, and the latter needs to continuously perfect rule systems based on legal systems and actual conditions, so that the efficiency of law enforcement is greatly influenced. With the rapid development of hardware resources, deep learning is also coming into the opportunity of rapid development, and the deep learning utilizes massive data and a complex network structure so as to be at a leading level in the current intelligent field. However, deep learning text classification requires a large amount of labeled data, and workers with expert knowledge are required to label the data in some specific fields, which is time-consuming and labor-consuming, so that the model is difficult to land in an industrial scene, and therefore, a model with stronger robustness can be obtained by using a small amount of samples, and accurate extraction of complaints is realized.

Disclosure of Invention

The invention provides a complaint and please recognition method based on semi-supervision and double-layer multi-classification in order to solve the technical problems. The invention relates to a complaint recognition method aiming at the effects that the quantity of marked data is small, and the model result can reach or even exceed the supervised learning effect through a large quantity of marked data.

A method based on semi-supervised double-layer multi-classification complaint recognition technology comprises the following steps:

s1, acquiring a training civil complaint text set: the method comprises the steps of including a labeling data set containing a small amount of labeling litigation request data and an unlabeled data set containing a large amount of unlabeled litigation request data;

s2, data enhancement is carried out on labeling litigation request data in the labeling data set, the labeling litigation request data are input into a teacher model, and the teacher model is trained to obtain self-adaptive thresholds of labeling categories; classifying each piece of unlabeled litigation request data in the unlabeled data set by using the trained teacher model to obtain the category of each piece of unlabeled litigation request data, and taking the category of each piece of unlabeled litigation request data as a pseudo tag corresponding to the unlabeled litigation request data to obtain pseudo tag labeling data; then, screening the corresponding pseudo tag marking data by using the self-adaptive threshold values of each class;

s3, training a student model by using the pseudo tag labeling data obtained by screening in the step S2 and the labeling litigation request data after data enhancement to obtain a student model1;

s4, selecting the first n categories in the probability value ranking of all prediction result error samples under a single category F1 score lower than a fixed threshold value from categories in the test set prediction result through analyzing the model1 to form m different n classification tasks, and training a multi-task model through the m n classification tasks to obtain a model2;

s5, connecting the model1 and the model2 in series to form a double-layer model3;

s6, inputting the civil complaint text into the model3 to obtain one or more litigation requests (categories).

Further, the method for data enhancement of the labeling data set in S2 includes: extracting time, place, name and amount in complaint data through a named entity recognition model trained in the legal special (or public) field, and processing time by the following steps: and (3) performing partial addition and subtraction operation on the extracted time, wherein the place processing method comprises the following steps of: the corresponding replacement is carried out by utilizing the existing place word stock in the field, and the name processing method comprises the following steps: the corresponding replacement is carried out by utilizing the existing name library in the field, and the method for processing the amount comprises the following steps: performing partial addition and subtraction operation on the extracted amount;

using simple rules to replace synonyms in the complaint data by using the synonyms in the existing synonym library in the legal field corresponding to the keywords in the complaint classification labels;

the part of data except the two steps is widely used as data through vocabulary replacement;

and (3) combining the multiple groups of data processed in the three steps in a crossing way to amplify the data quantity to a ratio of 1:10, and taking the amplified data quantity as final labeled enhanced data.

Further, the method for determining the various adaptive thresholds in S2 is as follows: training the teacher model for multiple rounds by using the labeled litigation request data, classifying samples in the verification set by using the teacher model after each round of training, and calculating the average value of probability values corresponding to the correctly classified samples in each classification; and then taking the maximum value in the various average values obtained by calculation after multiple rounds of training as the self-adaptive threshold value of the corresponding class.

Further, the multitasking model in S4 is obtained by training the unified framework UIE with information extraction.

Further, the method for obtaining the model2 in S4 includes:

wherein T represents a subtask in the multitasking model, and m T tasks are different from each other; the superscript c of t indicates a single category in the complaint recognition, the t subscript indicates the number of categories of n categories, and the n t categories are ordered; the data of m tasks (input data from model 1) is used as the input of the multi-task model, and the multi-task model2 is trained.

Further, the method of forming the model1 and the model2 into the dual-layer model3 in S5 in series includes:

the above is an inference structure representation of model3, which is a table of whether the inference data is required to enter model2 or not, and the result is obtained by the inference data through model1; wherein layer1 represents a class in which the single class F1 score of model1 in the test set prediction result is greater than a fixed threshold, ck is the kth class in layer1, k < = L (complaint about the number of classes of multiple classifications); layer2 represents a class in which the single class F1 score of model1 in the test set prediction result is smaller than a fixed threshold, and the multi-task model needs to be used, according to whether the single class F1 score of model1 in the test set prediction result is smaller than the fixed threshold as a basis for layering, if the single class F1 score is larger than the fixed threshold, the model only passes through the first layer (layer 1), and otherwise, the model passes through the first layer (layer 1) and the second layer (layer 2).

Further, the method for inputting the text of the civil complaint request into the model3 to obtain one or more litigation requests (categories) in S6 comprises the following steps: inputting the civil complaint text to be classified into the model3, and obtaining the litigation request category in the civil complaint text comprises the following steps: classifying the civil complaint text through the student model1, and outputting a classification result if the output class belongs to the class in layer 1; otherwise, n categories before ranking the probability value in the classification result of the civil complaint text by the student model1 form m different n classification tasks, and inputting the n classification tasks into the multi-task model2 to obtain litigation request categories in the civil complaint text; the class in layer1 is a class in which the single class F1 score of the student model1 in the test set classification prediction result is larger than a set threshold value.

For example, the predicted text is input into the model3, and is output as the class 1 through the model1, if the class 1 belongs to the layer1, the predicted text is not input into the model2 to directly end the prediction, and on the contrary, the predicted text is input, and the output of the model2 is taken as the final output;

the mode 2 is entered to pick the corresponding task in such a way that the top n categories in the last probability value ranking in the model1 are ordered (corresponding to T).

A server comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the steps of the above method.

A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor realizes the steps of the above method.

Compared with the prior art, the invention provides a method based on semi-supervised learning and double-layer multi-classification complaint recognition technology, which has the following beneficial effects:

compared with the existing data enhancement technology, the method has the advantage that the teacher model has stronger robustness through multidirectional accurate enhancement of the marked data.

The method has the advantage of 2, and the self-adaptive threshold value is used for screening the pseudo tag data so that the pseudo tag data can better utilize the value of the unlabeled data.

The invention has the advantage of 3 that a layer of multi-task model is added after the student model to form a double-layer combined model, and the subtasks of the second layer of multi-task model are the categories with poor classification in the student model, so that the categories with poor classification can be further classified and trained by directly carrying out n-1 categories which are easy to be confused on the categories, the invention has good improvement effect on the condition that the categories of a single model part are classified poorly, and the effect of quick complaint, identification and extraction under the condition of less labeling data in the current legal field is realized.

Drawings

FIG. 1 is a flow chart of enhancing tagged data.

Fig. 2 is a general flow chart of the present invention.

Detailed Description

The following describes the technical solution in the embodiment of the present invention in full with reference to the drawings in the implementation of the present invention.

The whole algorithm of the invention is shown in fig. 2, and the complaint recognition technology based on semi-supervision and double-layer multi-classification is characterized by comprising the following steps:

s1, acquiring a training text set: the method comprises the steps of including a small quantity of marked data sets and a large quantity of unmarked data sets;

s2, inputting the marked data into a teacher model for training after data enhancement (see FIG. 1) to obtain the teacher model and self-adaptive thresholds of each class, adding pseudo tags to the unmarked data by using the teacher model, and screening the pseudo tag data by using the self-adaptive thresholds;

s3, using a clearlabel tool to further remove noise data and difficult samples from the data screened in the S2;

s4, training to obtain a student model1 by using the sample obtained by the further processing of S3 and the labeled data enhanced by the data in S1 as the input of the student model;

s5, selecting n categories in the argmax () output value rank of all prediction result error samples under a single category F1 score lower than a fixed threshold value from categories in the test set prediction result by analyzing the model1 to form m different n classification tasks, and training a multi-task model through the m n classification tasks to obtain a model2;

s6, connecting the model1 and the model2 in series to form a double-layer model3;

s7, inputting the civil complaint text into the model3 to obtain one or more litigation requests (categories) in the complaint;

the method for carrying out data enhancement on the marked data set in the S2 comprises the following steps:

extracting time, place, name and amount in complaint data through a named entity recognition model trained in the legal special (or public) field, and processing time by the following steps: and (3) performing partial addition and subtraction operation on the extracted time, wherein the place processing method comprises the following steps of: the corresponding replacement is carried out by utilizing the existing place word stock in the field, and the name processing method comprises the following steps: the corresponding replacement is carried out by utilizing the existing name library in the field, and the method for processing the amount comprises the following steps: performing partial addition and subtraction operation on the extracted amount;

The method for determining various self-adaptive thresholds in S2 is that when S2 teacher model training is executed, the average value of probability values corresponding to all verification set reasoning correct samples of each model is obtained, and the maximum value of various average values in multiple rounds of training is taken as the self-adaptive threshold of the class.

The method of obtaining model2 in S5 includes:

wherein T represents a subtask in the multitasking model, and m T tasks are different from each other;

the superscript c of t indicates a single category in the complaint recognition, the t subscript indicates the number of categories of n categories, and the n t categories are ordered;

the data of m tasks (input data from model 1) is used as the input of the multi-task model, and the multi-task model2 is trained.

The method of concatenating model1 and model2 into a two-layer model3 in S6 includes:

wherein layer1 represents a class in which the single class F1 score of model1 in the test set prediction result is greater than a fixed threshold, k < = L (number of classes complaining of multiple classifications);

layer2 represents a class in which the single class F1 score of model1 in the test set prediction result is less than a fixed threshold, and a multitasking model is needed;

according to whether the single class F1 fraction of the model1 in the test set prediction result is smaller than a fixed threshold value as a basis for layering, if the single class F1 fraction is larger than the fixed threshold value, the model1 only passes through the first layer (layer 1), and otherwise, passes through the first layer (layer 1) and the second layer (layer 2);

the method for inputting the civil complaint text into the model3 to obtain one or more litigation requests (categories) in S7 comprises the following steps:

inputting a predicted text into a model3, outputting the predicted text into a class 1 through the model1, if the class 1 belongs to a layer1, directly ending prediction without entering the model2, and otherwise entering the predicted text, and taking the output of the model2 as a final output;

the mode 2 is entered to pick the corresponding task in such a way that the last argmax () output value in the mode 1 ranks the results (corresponding to T) of the top n categories.

It is to be understood that both the foregoing general description and the following detailed description of the present invention, and are intended to provide an overview of the invention claimed. It will be appreciated by those skilled in the art that various alternatives, modifications and variations may be made therein without departing from the spirit and scope of the invention as defined by the appended claims and their equivalents.

Claims

1. A complaint recognition method based on semi-supervision and double-layer multi-classification comprises the following steps:

1) Acquiring a training text set: the method comprises the steps of including a labeling data set containing a small amount of labeling litigation request data and an unlabeled data set containing a large amount of unlabeled litigation request data;

2) The labeling litigation request data in the labeling data set are input into a teacher model after data enhancement, and the teacher model is trained to obtain self-adaptive thresholds of labeling categories; classifying each piece of unlabeled litigation request data in the unlabeled data set by using the trained teacher model to obtain the category of each piece of unlabeled litigation request data, and taking the category of each piece of unlabeled litigation request data as a pseudo tag corresponding to the unlabeled litigation request data to obtain pseudo tag labeling data; then, screening the corresponding pseudo tag marking data by using the self-adaptive threshold values of each class;

3) Training a student model by using the pseudo tag labeling data obtained by screening in the step 2) and the labeling litigation request data after data enhancement to obtain a student model1;

4) Classifying and predicting the data in the test set by using the student model1, and calculating the F1 score of each category according to the classification and prediction result; if the F1 score of a category is lower than a set threshold, ranking the probability values of all the classification prediction result error samples under the category to form n different n classification tasks by n categories, and training a multi-task model through the n classification tasks to obtain a multi-task model2;

5) The student model1 and the multi-task model2 are connected in series to form a double-layer model3;

6) Inputting a civil complaint text to be classified into the model3, firstly classifying the civil complaint text through the student model1, and outputting a classification result if the output classification belongs to the classification in layer 1; otherwise, n categories before ranking the probability value in the classification result of the civil complaint text by the student model1 form m different n classification tasks, and inputting the n classification tasks into the multi-task model2 to obtain litigation request categories in the civil complaint text; the class in layer1 is a class in which the single class F1 score of the student model1 in the test set classification prediction result is larger than a set threshold value.

2. The method of claim 1, wherein the method of data enhancing the annotation data set comprises:

21 For each labeling litigation request data i in the labeling data set, extracting the time, place, name and amount in the labeling litigation request data i through a named entity recognition model in the legal field; then, carrying out partial addition and subtraction operation on the extracted time to obtain the processed time, carrying out corresponding replacement on the extracted place by utilizing an existing place word stock in the legal field to obtain the processed place, carrying out corresponding replacement on the extracted person name by utilizing an existing person name stock in the legal field to obtain the processed person name, and carrying out partial addition and subtraction operation on the extracted amount to obtain the processed amount;

22 Searching synonyms in the existing synonym library in the legal field according to the tag words in the labeling litigation request data i;

23 Searching synonyms or paraphrasing words of words except time, place, name, amount and tag words in the litigation request data i;

24 Cross-combining the results obtained in steps 21-23), and replacing the corresponding information in the litigation-labeled request data i to obtain the labeled reinforcement data of the litigation-labeled request data i.

3. The method according to claim 2, wherein the method for obtaining the adaptive threshold value of each labeling category is: training the teacher model for multiple rounds by using the labeled litigation request data, classifying samples in the verification set by using the teacher model after each round of training, and calculating the average value of probability values corresponding to the correctly classified samples in each classification; and taking the maximum value in the various average values obtained by calculation after multiple rounds of training as the self-adaptive threshold value of the corresponding class.

4. A server comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the steps of the method of any of claims 1 to 3.

5. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 3.