CN116304033B - Complaint identification method based on semi-supervision and double-layer multi-classification - Google Patents

Complaint identification method based on semi-supervision and double-layer multi-classification Download PDF

Info

Publication number
CN116304033B
CN116304033B CN202310171687.4A CN202310171687A CN116304033B CN 116304033 B CN116304033 B CN 116304033B CN 202310171687 A CN202310171687 A CN 202310171687A CN 116304033 B CN116304033 B CN 116304033B
Authority
CN
China
Prior art keywords
data
labeling
classification
model
request data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310171687.4A
Other languages
Chinese (zh)
Other versions
CN116304033A (en
Inventor
张凡凡
谭晓颖
李晓智
刘贤艳
孙晓锐
李娜娜
胡亚谦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Judicial Big Data Research Institute Co ltd
Original Assignee
China Judicial Big Data Research Institute Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Judicial Big Data Research Institute Co ltd filed Critical China Judicial Big Data Research Institute Co ltd
Priority to CN202310171687.4A priority Critical patent/CN116304033B/en
Publication of CN116304033A publication Critical patent/CN116304033A/en
Application granted granted Critical
Publication of CN116304033B publication Critical patent/CN116304033B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a complaint and please recognition method based on semi-supervision and double-layer multi-classification, which comprises the following steps: 1) Acquiring a training text set, wherein the training text set comprises a marked data set and an unmarked data set; 2) Training the teacher model by using the annotation data set and obtaining self-adaptive thresholds of all annotation categories; classifying and labeling each unlabeled litigation request data in the unlabeled data set by using the trained teacher model to obtain pseudo-label labeling data; 3) Training a student model by using the pseudo tag labeling data and labeling litigation request data to obtain the student model; 4) Classifying and predicting the data in the test set by using the student model, and calculating the F1 score of each category according to the classifying and predicting result; if the sample is lower than the set threshold, training the sample under the category to obtain a multi-task model; 5) The student model and the multitask model are connected in series to form a double-layer model; 6) And inputting the civil complaint text to be classified into a double-layer model to obtain the litigation request category.

Description

Complaint identification method based on semi-supervision and double-layer multi-classification
Technical Field
The invention relates to the field of deep learning text classification, in particular to a semi-supervised double-layer multi-classification complaint recognition method.
Background
The complaint in the complaint form is an important basis for the legal officer to examine the case, and in the process of actually accepting the case, the recognition and extraction of the complaint takes a lot of time and energy for legal staff, so that the intelligent recognition and extraction of the complaint based on the deep learning technology becomes important, and the case handling efficiency of the legal staff is improved.
In the past, the recognition and extraction of complaints are mainly manual check or manually formulated rule extraction, the former needs to be participated by legal professionals all the time, and the latter needs to continuously perfect rule systems based on legal systems and actual conditions, so that the efficiency of law enforcement is greatly influenced. With the rapid development of hardware resources, deep learning is also coming into the opportunity of rapid development, and the deep learning utilizes massive data and a complex network structure so as to be at a leading level in the current intelligent field. However, deep learning text classification requires a large amount of labeled data, and workers with expert knowledge are required to label the data in some specific fields, which is time-consuming and labor-consuming, so that the model is difficult to land in an industrial scene, and therefore, a model with stronger robustness can be obtained by using a small amount of samples, and accurate extraction of complaints is realized.
Disclosure of Invention
The invention provides a complaint and please recognition method based on semi-supervision and double-layer multi-classification in order to solve the technical problems. The invention relates to a complaint recognition method aiming at the effects that the quantity of marked data is small, and the model result can reach or even exceed the supervised learning effect through a large quantity of marked data.
A method based on semi-supervised double-layer multi-classification complaint recognition technology comprises the following steps:
s1, acquiring a training civil complaint text set: the method comprises the steps of including a labeling data set containing a small amount of labeling litigation request data and an unlabeled data set containing a large amount of unlabeled litigation request data;
s2, data enhancement is carried out on labeling litigation request data in the labeling data set, the labeling litigation request data are input into a teacher model, and the teacher model is trained to obtain self-adaptive thresholds of labeling categories; classifying each piece of unlabeled litigation request data in the unlabeled data set by using the trained teacher model to obtain the category of each piece of unlabeled litigation request data, and taking the category of each piece of unlabeled litigation request data as a pseudo tag corresponding to the unlabeled litigation request data to obtain pseudo tag labeling data; then, screening the corresponding pseudo tag marking data by using the self-adaptive threshold values of each class;
s3, training a student model by using the pseudo tag labeling data obtained by screening in the step S2 and the labeling litigation request data after data enhancement to obtain a student model1;
s4, selecting the first n categories in the probability value ranking of all prediction result error samples under a single category F1 score lower than a fixed threshold value from categories in the test set prediction result through analyzing the model1 to form m different n classification tasks, and training a multi-task model through the m n classification tasks to obtain a model2;
s5, connecting the model1 and the model2 in series to form a double-layer model3;
s6, inputting the civil complaint text into the model3 to obtain one or more litigation requests (categories).
Further, the method for data enhancement of the labeling data set in S2 includes: extracting time, place, name and amount in complaint data through a named entity recognition model trained in the legal special (or public) field, and processing time by the following steps: and (3) performing partial addition and subtraction operation on the extracted time, wherein the place processing method comprises the following steps of: the corresponding replacement is carried out by utilizing the existing place word stock in the field, and the name processing method comprises the following steps: the corresponding replacement is carried out by utilizing the existing name library in the field, and the method for processing the amount comprises the following steps: performing partial addition and subtraction operation on the extracted amount;
using simple rules to replace synonyms in the complaint data by using the synonyms in the existing synonym library in the legal field corresponding to the keywords in the complaint classification labels;
the part of data except the two steps is widely used as data through vocabulary replacement;
and (3) combining the multiple groups of data processed in the three steps in a crossing way to amplify the data quantity to a ratio of 1:10, and taking the amplified data quantity as final labeled enhanced data.
Further, the method for determining the various adaptive thresholds in S2 is as follows: training the teacher model for multiple rounds by using the labeled litigation request data, classifying samples in the verification set by using the teacher model after each round of training, and calculating the average value of probability values corresponding to the correctly classified samples in each classification; and then taking the maximum value in the various average values obtained by calculation after multiple rounds of training as the self-adaptive threshold value of the corresponding class.
Further, the multitasking model in S4 is obtained by training the unified framework UIE with information extraction.
Further, the method for obtaining the model2 in S4 includes:
wherein T represents a subtask in the multitasking model, and m T tasks are different from each other; the superscript c of t indicates a single category in the complaint recognition, the t subscript indicates the number of categories of n categories, and the n t categories are ordered; the data of m tasks (input data from model 1) is used as the input of the multi-task model, and the multi-task model2 is trained.
Further, the method of forming the model1 and the model2 into the dual-layer model3 in S5 in series includes:
the above is an inference structure representation of model3, which is a table of whether the inference data is required to enter model2 or not, and the result is obtained by the inference data through model1; wherein layer1 represents a class in which the single class F1 score of model1 in the test set prediction result is greater than a fixed threshold, ck is the kth class in layer1, k < = L (complaint about the number of classes of multiple classifications); layer2 represents a class in which the single class F1 score of model1 in the test set prediction result is smaller than a fixed threshold, and the multi-task model needs to be used, according to whether the single class F1 score of model1 in the test set prediction result is smaller than the fixed threshold as a basis for layering, if the single class F1 score is larger than the fixed threshold, the model only passes through the first layer (layer 1), and otherwise, the model passes through the first layer (layer 1) and the second layer (layer 2).
Further, the method for inputting the text of the civil complaint request into the model3 to obtain one or more litigation requests (categories) in S6 comprises the following steps: inputting the civil complaint text to be classified into the model3, and obtaining the litigation request category in the civil complaint text comprises the following steps: classifying the civil complaint text through the student model1, and outputting a classification result if the output class belongs to the class in layer 1; otherwise, n categories before ranking the probability value in the classification result of the civil complaint text by the student model1 form m different n classification tasks, and inputting the n classification tasks into the multi-task model2 to obtain litigation request categories in the civil complaint text; the class in layer1 is a class in which the single class F1 score of the student model1 in the test set classification prediction result is larger than a set threshold value.
For example, the predicted text is input into the model3, and is output as the class 1 through the model1, if the class 1 belongs to the layer1, the predicted text is not input into the model2 to directly end the prediction, and on the contrary, the predicted text is input, and the output of the model2 is taken as the final output;
the mode 2 is entered to pick the corresponding task in such a way that the top n categories in the last probability value ranking in the model1 are ordered (corresponding to T).
A server comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the steps of the above method.
A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor realizes the steps of the above method.
Compared with the prior art, the invention provides a method based on semi-supervised learning and double-layer multi-classification complaint recognition technology, which has the following beneficial effects:
compared with the existing data enhancement technology, the method has the advantage that the teacher model has stronger robustness through multidirectional accurate enhancement of the marked data.
The method has the advantage of 2, and the self-adaptive threshold value is used for screening the pseudo tag data so that the pseudo tag data can better utilize the value of the unlabeled data.
The invention has the advantage of 3 that a layer of multi-task model is added after the student model to form a double-layer combined model, and the subtasks of the second layer of multi-task model are the categories with poor classification in the student model, so that the categories with poor classification can be further classified and trained by directly carrying out n-1 categories which are easy to be confused on the categories, the invention has good improvement effect on the condition that the categories of a single model part are classified poorly, and the effect of quick complaint, identification and extraction under the condition of less labeling data in the current legal field is realized.
Drawings
FIG. 1 is a flow chart of enhancing tagged data.
Fig. 2 is a general flow chart of the present invention.
Detailed Description
The following describes the technical solution in the embodiment of the present invention in full with reference to the drawings in the implementation of the present invention.
The whole algorithm of the invention is shown in fig. 2, and the complaint recognition technology based on semi-supervision and double-layer multi-classification is characterized by comprising the following steps:
s1, acquiring a training text set: the method comprises the steps of including a small quantity of marked data sets and a large quantity of unmarked data sets;
s2, inputting the marked data into a teacher model for training after data enhancement (see FIG. 1) to obtain the teacher model and self-adaptive thresholds of each class, adding pseudo tags to the unmarked data by using the teacher model, and screening the pseudo tag data by using the self-adaptive thresholds;
s3, using a clearlabel tool to further remove noise data and difficult samples from the data screened in the S2;
s4, training to obtain a student model1 by using the sample obtained by the further processing of S3 and the labeled data enhanced by the data in S1 as the input of the student model;
s5, selecting n categories in the argmax () output value rank of all prediction result error samples under a single category F1 score lower than a fixed threshold value from categories in the test set prediction result by analyzing the model1 to form m different n classification tasks, and training a multi-task model through the m n classification tasks to obtain a model2;
s6, connecting the model1 and the model2 in series to form a double-layer model3;
s7, inputting the civil complaint text into the model3 to obtain one or more litigation requests (categories) in the complaint;
the method for carrying out data enhancement on the marked data set in the S2 comprises the following steps:
extracting time, place, name and amount in complaint data through a named entity recognition model trained in the legal special (or public) field, and processing time by the following steps: and (3) performing partial addition and subtraction operation on the extracted time, wherein the place processing method comprises the following steps of: the corresponding replacement is carried out by utilizing the existing place word stock in the field, and the name processing method comprises the following steps: the corresponding replacement is carried out by utilizing the existing name library in the field, and the method for processing the amount comprises the following steps: performing partial addition and subtraction operation on the extracted amount;
using simple rules to replace synonyms in the complaint data by using the synonyms in the existing synonym library in the legal field corresponding to the keywords in the complaint classification labels;
the part of data except the two steps is widely used as data through vocabulary replacement;
and (3) combining the multiple groups of data processed in the three steps in a crossing way to amplify the data quantity to a ratio of 1:10, and taking the amplified data quantity as final labeled enhanced data.
The method for determining various self-adaptive thresholds in S2 is that when S2 teacher model training is executed, the average value of probability values corresponding to all verification set reasoning correct samples of each model is obtained, and the maximum value of various average values in multiple rounds of training is taken as the self-adaptive threshold of the class.
The method of obtaining model2 in S5 includes:
wherein T represents a subtask in the multitasking model, and m T tasks are different from each other;
the superscript c of t indicates a single category in the complaint recognition, the t subscript indicates the number of categories of n categories, and the n t categories are ordered;
the data of m tasks (input data from model 1) is used as the input of the multi-task model, and the multi-task model2 is trained.
The method of concatenating model1 and model2 into a two-layer model3 in S6 includes:
wherein layer1 represents a class in which the single class F1 score of model1 in the test set prediction result is greater than a fixed threshold, k < = L (number of classes complaining of multiple classifications);
layer2 represents a class in which the single class F1 score of model1 in the test set prediction result is less than a fixed threshold, and a multitasking model is needed;
according to whether the single class F1 fraction of the model1 in the test set prediction result is smaller than a fixed threshold value as a basis for layering, if the single class F1 fraction is larger than the fixed threshold value, the model1 only passes through the first layer (layer 1), and otherwise, passes through the first layer (layer 1) and the second layer (layer 2);
the method for inputting the civil complaint text into the model3 to obtain one or more litigation requests (categories) in S7 comprises the following steps:
inputting a predicted text into a model3, outputting the predicted text into a class 1 through the model1, if the class 1 belongs to a layer1, directly ending prediction without entering the model2, and otherwise entering the predicted text, and taking the output of the model2 as a final output;
the mode 2 is entered to pick the corresponding task in such a way that the last argmax () output value in the mode 1 ranks the results (corresponding to T) of the top n categories.
It is to be understood that both the foregoing general description and the following detailed description of the present invention, and are intended to provide an overview of the invention claimed. It will be appreciated by those skilled in the art that various alternatives, modifications and variations may be made therein without departing from the spirit and scope of the invention as defined by the appended claims and their equivalents.

Claims (5)

1. A complaint recognition method based on semi-supervision and double-layer multi-classification comprises the following steps:
1) Acquiring a training text set: the method comprises the steps of including a labeling data set containing a small amount of labeling litigation request data and an unlabeled data set containing a large amount of unlabeled litigation request data;
2) The labeling litigation request data in the labeling data set are input into a teacher model after data enhancement, and the teacher model is trained to obtain self-adaptive thresholds of labeling categories; classifying each piece of unlabeled litigation request data in the unlabeled data set by using the trained teacher model to obtain the category of each piece of unlabeled litigation request data, and taking the category of each piece of unlabeled litigation request data as a pseudo tag corresponding to the unlabeled litigation request data to obtain pseudo tag labeling data; then, screening the corresponding pseudo tag marking data by using the self-adaptive threshold values of each class;
3) Training a student model by using the pseudo tag labeling data obtained by screening in the step 2) and the labeling litigation request data after data enhancement to obtain a student model1;
4) Classifying and predicting the data in the test set by using the student model1, and calculating the F1 score of each category according to the classification and prediction result; if the F1 score of a category is lower than a set threshold, ranking the probability values of all the classification prediction result error samples under the category to form n different n classification tasks by n categories, and training a multi-task model through the n classification tasks to obtain a multi-task model2;
5) The student model1 and the multi-task model2 are connected in series to form a double-layer model3;
6) Inputting a civil complaint text to be classified into the model3, firstly classifying the civil complaint text through the student model1, and outputting a classification result if the output classification belongs to the classification in layer 1; otherwise, n categories before ranking the probability value in the classification result of the civil complaint text by the student model1 form m different n classification tasks, and inputting the n classification tasks into the multi-task model2 to obtain litigation request categories in the civil complaint text; the class in layer1 is a class in which the single class F1 score of the student model1 in the test set classification prediction result is larger than a set threshold value.
2. The method of claim 1, wherein the method of data enhancing the annotation data set comprises:
21 For each labeling litigation request data i in the labeling data set, extracting the time, place, name and amount in the labeling litigation request data i through a named entity recognition model in the legal field; then, carrying out partial addition and subtraction operation on the extracted time to obtain the processed time, carrying out corresponding replacement on the extracted place by utilizing an existing place word stock in the legal field to obtain the processed place, carrying out corresponding replacement on the extracted person name by utilizing an existing person name stock in the legal field to obtain the processed person name, and carrying out partial addition and subtraction operation on the extracted amount to obtain the processed amount;
22 Searching synonyms in the existing synonym library in the legal field according to the tag words in the labeling litigation request data i;
23 Searching synonyms or paraphrasing words of words except time, place, name, amount and tag words in the litigation request data i;
24 Cross-combining the results obtained in steps 21-23), and replacing the corresponding information in the litigation-labeled request data i to obtain the labeled reinforcement data of the litigation-labeled request data i.
3. The method according to claim 2, wherein the method for obtaining the adaptive threshold value of each labeling category is: training the teacher model for multiple rounds by using the labeled litigation request data, classifying samples in the verification set by using the teacher model after each round of training, and calculating the average value of probability values corresponding to the correctly classified samples in each classification; and taking the maximum value in the various average values obtained by calculation after multiple rounds of training as the self-adaptive threshold value of the corresponding class.
4. A server comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the steps of the method of any of claims 1 to 3.
5. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 3.
CN202310171687.4A 2023-02-27 2023-02-27 Complaint identification method based on semi-supervision and double-layer multi-classification Active CN116304033B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310171687.4A CN116304033B (en) 2023-02-27 2023-02-27 Complaint identification method based on semi-supervision and double-layer multi-classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310171687.4A CN116304033B (en) 2023-02-27 2023-02-27 Complaint identification method based on semi-supervision and double-layer multi-classification

Publications (2)

Publication Number Publication Date
CN116304033A CN116304033A (en) 2023-06-23
CN116304033B true CN116304033B (en) 2023-11-03

Family

ID=86817899

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310171687.4A Active CN116304033B (en) 2023-02-27 2023-02-27 Complaint identification method based on semi-supervision and double-layer multi-classification

Country Status (1)

Country Link
CN (1) CN116304033B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117372819B (en) * 2023-12-07 2024-02-20 神思电子技术股份有限公司 Target detection increment learning method, device and medium for limited model space

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109829058A (en) * 2019-01-17 2019-05-31 西北大学 A kind of classifying identification method improving accent recognition accuracy rate based on multi-task learning
CN112966701A (en) * 2019-12-12 2021-06-15 北京沃东天骏信息技术有限公司 Method and device for classifying objects
CN114399683A (en) * 2022-01-18 2022-04-26 南京甄视智能科技有限公司 End-to-end semi-supervised target detection method based on improved yolov5

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109829058A (en) * 2019-01-17 2019-05-31 西北大学 A kind of classifying identification method improving accent recognition accuracy rate based on multi-task learning
CN112966701A (en) * 2019-12-12 2021-06-15 北京沃东天骏信息技术有限公司 Method and device for classifying objects
CN114399683A (en) * 2022-01-18 2022-04-26 南京甄视智能科技有限公司 End-to-end semi-supervised target detection method based on improved yolov5

Also Published As

Publication number Publication date
CN116304033A (en) 2023-06-23

Similar Documents

Publication Publication Date Title
CN108073568B (en) Keyword extraction method and device
CN112711953A (en) Text multi-label classification method and system based on attention mechanism and GCN
CN110532398B (en) Automatic family map construction method based on multi-task joint neural network model
CN107103363B (en) A kind of construction method of the software fault expert system based on LDA
CN113742733B (en) Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type
CN112070138A (en) Multi-label mixed classification model construction method, news classification method and system
US20180336507A1 (en) Cognitive risk analysis system for risk identification, modeling and assessment
Maurya et al. Developing classifiers through machine learning algorithms for student placement prediction based on academic performance
Miok et al. Prediction uncertainty estimation for hate speech classification
CN110705255A (en) Method and device for detecting association relation between sentences
CN113434688B (en) Data processing method and device for public opinion classification model training
CN116304033B (en) Complaint identification method based on semi-supervision and double-layer multi-classification
CN111507089A (en) Document classification method and device based on deep learning model and computer equipment
CN111767390A (en) Skill word evaluation method and device, electronic equipment and computer readable medium
CN111709225A (en) Event cause and effect relationship judging method and device and computer readable storage medium
Goel et al. Social Media Analysis: A Tool for Popularity Prediction Using Machine Learning Classifiers
CN114298314A (en) Multi-granularity causal relationship reasoning method based on electronic medical record
CN117911079A (en) Personalized merchant marketing intelligent recommendation method and system
CN117251777A (en) Data processing method, device, computer equipment and storage medium
CN114297390B (en) Aspect category identification method and system in long tail distribution scene
CN116011810A (en) Regional risk identification method, device, equipment and storage medium
CN115757779A (en) Financial risk early warning method and device based on deep learning model
CN110727767B (en) Method and system for expanding text sample
Roelands et al. Classifying businesses by economic activity using web-based text mining
CN113095589A (en) Population attribute determination method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant