CN112685999A - Intelligent grading labeling method - Google Patents

Intelligent grading labeling method Download PDF

Info

Publication number
CN112685999A
CN112685999A CN202110073101.1A CN202110073101A CN112685999A CN 112685999 A CN112685999 A CN 112685999A CN 202110073101 A CN202110073101 A CN 202110073101A CN 112685999 A CN112685999 A CN 112685999A
Authority
CN
China
Prior art keywords
label
model
data
primary
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110073101.1A
Other languages
Chinese (zh)
Inventor
赵志航
张睿智
尹旭
翟盛龙
朱亚静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Cloud Information Technology Co Ltd
Original Assignee
Inspur Cloud Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Cloud Information Technology Co Ltd filed Critical Inspur Cloud Information Technology Co Ltd
Priority to CN202110073101.1A priority Critical patent/CN112685999A/en
Publication of CN112685999A publication Critical patent/CN112685999A/en
Pending legal-status Critical Current

Links

Abstract

The invention relates to the technical field of artificial intelligence text classification, and particularly provides an intelligent hierarchical labeling method which comprises a model training stage and a model prediction stage, wherein a primary label labeling model is trained in the model training stage, and then a secondary label corresponding to each primary label is trained; in the model prediction stage, data are sequentially input into each primary label marking model, when the threshold condition of the corresponding model is met, the primary label is marked, the data are sequentially input into a secondary label corresponding to the label, and when the threshold condition of the secondary label is met, the secondary label is marked. Compared with the prior art, the method can realize the function of carrying out hierarchical multi-label intelligent labeling on the structured data field according to a self-defined label classification system, and can increase label categories, set classification threshold values and sampling numbers according to needs to realize intelligent hierarchical labeling.

Description

Intelligent grading labeling method
Technical Field
The invention relates to the technical field of artificial intelligent text classification, and particularly provides an intelligent hierarchical labeling method.
Background
With the development of artificial intelligence technology, as one of the most classical use scenarios in the field of natural language processing, the text classification problem has evolved from a method based on traditional machine learning to a method based on deep learning. The main content of the former includes artificial feature engineering and shallow classification model, and the text classification problem is divided into two parts of feature engineering and classifier. The period mainly focuses on the distribution of data, and how to design more feature models from the distribution of texts is the mainstream of the period.
With the improvement of computing power, the computation of the neural network is not limited any more, deep learning develops rapidly, and new algorithm models emerge continuously. In the deep learning era, the neural network can automatically mine characteristics from data, people are liberated from complex characteristic engineering, and innovation of an algorithm model and breakthrough of theory can be more focused. How to solve the problem of automatic classification of structured data fields is a problem that needs to be solved urgently by those skilled in the art.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides an intelligent grading labeling method with strong practicability.
The technical scheme adopted by the invention for solving the technical problems is as follows:
an intelligent grading labeling method comprises a model training stage and a model prediction stage, wherein in the model training stage, a primary label labeling model is trained, and then a secondary label corresponding to each primary label is trained;
in the model prediction stage, data are sequentially input into each primary label marking model, when the threshold condition of the corresponding model is met, the primary label is marked, the data are sequentially input into a secondary label corresponding to the label, and when the threshold condition of the secondary label is met, the secondary label is marked.
Further, when each first-level label and each second-level label construct a training set, a verification set and a test set, the data of the label is used as a positive sample, the data outside the label is used as a negative sample, and the quantity ratio of the positive sample to the negative sample is 1: 1.
further, a word list and a hash mapping table are constructed by taking a tuple, a binary tuple and a triple as characteristics on the granularity of a single word, each sample is mapped into a corresponding digital sequence to be used as the input of a fastText model, the model is trained by taking the label as a target value, and the two classification models of each primary label and each secondary label are respectively obtained.
Further, in the model prediction stage, field data is sampled, the sampled data is input into a primary classification model, and when the threshold condition of the model is met, the data is labeled, then the data is sequentially input into each secondary classification label model corresponding to the primary label, and when the threshold condition of the corresponding label is met, the data is labeled.
Further, in a model training stage, preparing a primary label classification data set, taking the label data as a positive sample, marking the label as 1, taking the data outside the label as a negative sample, marking the label as 0, and setting the number ratio of the positive sample to the negative sample to be 1: 1;
constructing a word table according to the sample to obtain a uni-gram mapping table on a single word granularity, and performing hash mapping on the bi-gram and the tri-gram respectively to obtain mapping tables of the bi-gram and the tri-gram;
processing each sample into a uni-gram index sequence, a bi-gram index sequence, a tri-gram index sequence and a label on word granularity, setting the length seq _ len of each sample index sequence to be 32, only reserving the first 32 bits with the length being more than 32, and filling the samples with the length being less than 32;
dividing a data set into a training set, a verification set and a test set according to the ratio of 8:1: 1;
and the secondary label classification data set is processed by adopting the same method as the primary label classification data set.
Preferably, the optimizers used in the model training phase are the Adam optimizer and the kaiming initialization method.
Further, in the model training phase, setting the batch _ size to be 128, recording the loss, accuracy, recall and f1 value every 100 batches, using the verification set for verification, and recording the loss, accuracy, recall and f1 value of the model on the verification set;
and if the accuracy is not improved after 1000 batchs, stopping training and storing the final model.
Furthermore, in the model prediction stage data preprocessing, the number of samples is set to sample the field data to obtain a sample to be predicted, the data set preparation in the same model training stage is carried out, and the data is processed to obtain model input data.
Compared with the prior art, the intelligent grading marking method has the following outstanding beneficial effects:
the invention can realize the function of carrying out hierarchical multi-label intelligent labeling on the structured data field according to a self-defined label classification system, and can increase label categories, set classification threshold values, sample numbers and the like according to requirements to realize intelligent hierarchical labeling.
Detailed Description
The present invention will be described in further detail with reference to specific embodiments in order to better understand the technical solutions of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A preferred embodiment is given below:
in the embodiment of the intelligent hierarchical labeling method, the whole process includes a model training stage and a model prediction stage.
In the model training stage, each primary label labeling model is trained, and then a secondary label corresponding to each primary label is trained.
In the model prediction stage, data are sequentially input into each primary label marking model, when the threshold condition of the corresponding model is met, the primary label is marked, the data are sequentially input into a secondary label corresponding to the label, and when the threshold condition of the secondary label is met, the secondary label is marked.
The concrete description is as follows: in the model training stage, each first-level label and each second-level label are trained according to a self-defined two-level label classification system. When each first-level label and each second-level label construct a training set, a verification set and a test set, the label data is taken as a positive example, other label data is taken as a negative example, and the number ratio of the positive sample to the negative sample is 1: 1. And constructing a word list and a hash mapping table by taking a one-tuple, a two-tuple and a triple as characteristics on the granularity of a single word, and mapping each sample into a corresponding digital sequence as the input of the fastText model. And training the model by taking the label as a target value to respectively obtain a two-classification model of each primary label and each secondary label.
And in the model prediction stage, field data are sampled, the sampled data are input into each primary classification model, and the label is marked when the threshold condition of the model is met. Then, the data is sequentially input into each secondary classification label model corresponding to the primary label, and the label is marked when the threshold value condition of the corresponding label is met. Thus, the labels of the primary label and the secondary label of the structured field are obtained.
The detailed putting method comprises the following steps:
(1) data set preparation:
preparing a first-level label classification data set, wherein the label data is used as a positive sample, the label is marked as 1, the data outside the label is used as a negative sample, the label is marked as 0, and the quantity ratio of the positive sample to the negative sample is 1: 1; constructing a word table according to the sample to obtain a uni-gram mapping table on a single word granularity, and performing hash mapping on the bi-gram and the tri-gram respectively to obtain mapping tables of the bi-gram and the tri-gram; processing each sample into a uni-gram index sequence, a bi-gram index sequence, a tri-gram index sequence and a label on word granularity, setting the length seq _ len of each sample index sequence to be 32, only reserving the first 32 bits with the length being more than 32, and filling the samples with the length being less than 32; the data set is divided into a training set, a validation set and a test set according to the ratio of 8:1: 1. Similarly, a secondary label classification dataset is prepared.
Constructing a model: and building a fastText model structure.
An optimizer: and adopting an Adam optimizer and a kaiming initialization method.
Model training: set batch _ size to 128, record loss, accuracy, recall, f1 values per 100 batches, validate with the validation set, and record loss, accuracy, recall, f1 values of the model on the validation set. And if the accuracy is not improved after 1000 batchs, stopping training and storing the final model.
And (3) testing a model: and inputting the test set data into the trained model for testing to obtain loss, accuracy, recall rate, f1 value and the like.
(2) A model prediction stage:
data preprocessing:
and setting the number of samples (such as 512) to sample the field data to obtain samples to be predicted (the labels of the samples obtained through model prediction are labels of the field). And preparing a data set in the same model training stage, and processing the data to obtain model input data.
Predicting a primary model:
and sequentially inputting the processed data into each primary label prediction model, and if the processed data meet the set threshold condition, marking the label.
And (3) secondary model prediction:
and sequentially inputting the data into a secondary label prediction model corresponding to each primary label obtained by the prediction of the primary model, and if a certain secondary label threshold condition is met, printing the secondary label. Thus, the two-level classification label of the structured data is obtained.
The intelligent hierarchical labeling method can realize the function of hierarchical multi-label intelligent labeling on the structured data field according to a user-defined label classification system, can increase label categories, set classification threshold values, sample numbers and the like according to needs, and realizes intelligent hierarchical labeling.
The above embodiments are only specific cases of the present invention, and the protection scope of the present invention includes but is not limited to the above embodiments, and any suitable changes or substitutions that are consistent with the claims of the intelligent hierarchical labeling method of the present invention and are made by those skilled in the art should fall within the protection scope of the present invention.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (8)

1. An intelligent grading labeling method is characterized by comprising a model training stage and a model prediction stage, wherein a primary label labeling model is trained in the model training stage, and then a secondary label corresponding to each primary label is trained;
in the model prediction stage, data are sequentially input into each primary label marking model, when the threshold condition of the corresponding model is met, the primary label is marked, the data are sequentially input into a secondary label corresponding to the label, and when the threshold condition of the secondary label is met, the secondary label is marked.
2. The intelligent hierarchical labeling method according to claim 1, wherein when each of the primary labels and the secondary labels constructs a training set, a validation set and a test set, the data of the label is used as a positive sample, the data outside the label is used as a negative sample, and the number ratio of the positive sample to the negative sample is 1: 1.
3. the intelligent hierarchical labeling method of claim 2, wherein a word list and a hash mapping table are constructed by taking a tuple, a bituple and a triple as features on the granularity of a single word, each sample is mapped into a corresponding number sequence as input of a fastText model, the model is trained by taking a label as a target value, and a binary classification model of each primary label and each secondary label is obtained respectively.
4. The intelligent hierarchical labeling method according to claim 3, characterized in that in a model prediction stage, field data is sampled, the sampled data is input into a primary classification model and labeled when a model threshold condition is met, then the data is sequentially input into each secondary classification label model corresponding to the primary label and labeled when a corresponding label threshold condition is met.
5. The intelligent hierarchical labeling method according to claim 4, characterized in that in the model training phase, a primary label classification data set is prepared, the label data is used as a positive sample, the label is marked as 1, the data outside the label is used as a negative sample, the label is marked as 0, and the number ratio of the positive sample to the negative sample is 1: 1;
constructing a word table according to the sample to obtain a uni-gram mapping table on a single word granularity, and performing hash mapping on the bi-gram and the tri-gram respectively to obtain mapping tables of the bi-gram and the tri-gram;
processing each sample into a uni-gram index sequence, a bi-gram index sequence, a tri-gram index sequence and a label on word granularity, setting the length seq _ len of each sample index sequence to be 32, only reserving the first 32 bits with the length being more than 32, and filling the samples with the length being less than 32;
dividing a data set into a training set, a verification set and a test set according to the ratio of 8:1: 1;
and the secondary label classification data set is processed by adopting the same method as the primary label classification data set.
6. An intelligent hierarchical tagging method according to claim 5 wherein the optimizers used in the model training phase are the Adam optimizer and the kaiming initialization method.
7. The intelligent hierarchical tagging method of claim 6, wherein in the model training phase, batch _ size is set to 128, every 100 batches are recorded with loss, accuracy, recall and f1 values and verified with a verification set, and loss, accuracy, recall and f1 values of the model on the verification set are recorded;
and if the accuracy is not improved after 1000 batchs, stopping training and storing the final model.
8. The intelligent hierarchical labeling method according to claim 7, characterized in that in the model prediction phase data preprocessing, the number of samples is set to sample the field data to obtain the sample to be predicted, the data set preparation in the same model training phase is performed, and the data is processed to obtain the model input data.
CN202110073101.1A 2021-01-20 2021-01-20 Intelligent grading labeling method Pending CN112685999A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110073101.1A CN112685999A (en) 2021-01-20 2021-01-20 Intelligent grading labeling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110073101.1A CN112685999A (en) 2021-01-20 2021-01-20 Intelligent grading labeling method

Publications (1)

Publication Number Publication Date
CN112685999A true CN112685999A (en) 2021-04-20

Family

ID=75458647

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110073101.1A Pending CN112685999A (en) 2021-01-20 2021-01-20 Intelligent grading labeling method

Country Status (1)

Country Link
CN (1) CN112685999A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3144821A1 (en) * 2015-09-18 2017-03-22 Frank Sax Method for generating electronic documents
CN110427461A (en) * 2019-08-06 2019-11-08 腾讯科技(深圳)有限公司 Intelligent answer information processing method, electronic equipment and computer readable storage medium
CN110490221A (en) * 2019-07-05 2019-11-22 平安科技(深圳)有限公司 Multi-tag classification method, electronic device and computer readable storage medium
CN112000779A (en) * 2020-10-29 2020-11-27 北京值得买科技股份有限公司 Automatic review and labeling system
CN112036166A (en) * 2020-07-22 2020-12-04 大箴(杭州)科技有限公司 Data labeling method and device, storage medium and computer equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3144821A1 (en) * 2015-09-18 2017-03-22 Frank Sax Method for generating electronic documents
CN110490221A (en) * 2019-07-05 2019-11-22 平安科技(深圳)有限公司 Multi-tag classification method, electronic device and computer readable storage medium
CN110427461A (en) * 2019-08-06 2019-11-08 腾讯科技(深圳)有限公司 Intelligent answer information processing method, electronic equipment and computer readable storage medium
CN112036166A (en) * 2020-07-22 2020-12-04 大箴(杭州)科技有限公司 Data labeling method and device, storage medium and computer equipment
CN112000779A (en) * 2020-10-29 2020-11-27 北京值得买科技股份有限公司 Automatic review and labeling system

Similar Documents

Publication Publication Date Title
CN110826331B (en) Intelligent construction method of place name labeling corpus based on interactive and iterative learning
CN109635108B (en) Man-machine interaction based remote supervision entity relationship extraction method
CN108710894B (en) Active learning labeling method and device based on clustering representative points
CN110532554A (en) A kind of Chinese abstraction generating method, system and storage medium
CN111274814B (en) Novel semi-supervised text entity information extraction method
CN111124487B (en) Code clone detection method and device and electronic equipment
CN112906397B (en) Short text entity disambiguation method
CN112560486A (en) Power entity identification method based on multilayer neural network, storage medium and equipment
CN110633365A (en) Word vector-based hierarchical multi-label text classification method and system
CN109858025B (en) Word segmentation method and system for address standardized corpus
CN112836509A (en) Expert system knowledge base construction method and system
CN107357895A (en) A kind of processing method of the text representation based on bag of words
CN111709225B (en) Event causal relationship discriminating method, device and computer readable storage medium
CN111125295A (en) Method and system for obtaining food safety question answers based on LSTM
CN115587594A (en) Network security unstructured text data extraction model training method and system
Chu et al. Co-training based on semi-supervised ensemble classification approach for multi-label data stream
CN114691525A (en) Test case selection method and device
CN110866169A (en) Learning-based Internet of things entity message analysis method
CN112579777B (en) Semi-supervised classification method for unlabeled text
CN112036179B (en) Electric power plan information extraction method based on text classification and semantic frame
CN109857892A (en) Semi-supervised cross-module state Hash search method based on category transmitting
CN115795060B (en) Entity alignment method based on knowledge enhancement
CN117390198A (en) Method, device, equipment and medium for constructing scientific and technological knowledge graph in electric power field
CN112685999A (en) Intelligent grading labeling method
CN115936003A (en) Software function point duplicate checking method, device, equipment and medium based on neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210420

RJ01 Rejection of invention patent application after publication