CN112685999A

CN112685999A - Intelligent grading labeling method

Info

Publication number: CN112685999A
Application number: CN202110073101.1A
Authority: CN
Inventors: 赵志航; 张睿智; 尹旭; 翟盛龙; 朱亚静
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2021-01-20
Filing date: 2021-01-20
Publication date: 2021-04-20

Abstract

The invention relates to the technical field of artificial intelligence text classification, and particularly provides an intelligent hierarchical labeling method which comprises a model training stage and a model prediction stage, wherein a primary label labeling model is trained in the model training stage, and then a secondary label corresponding to each primary label is trained; in the model prediction stage, data are sequentially input into each primary label marking model, when the threshold condition of the corresponding model is met, the primary label is marked, the data are sequentially input into a secondary label corresponding to the label, and when the threshold condition of the secondary label is met, the secondary label is marked. Compared with the prior art, the method can realize the function of carrying out hierarchical multi-label intelligent labeling on the structured data field according to a self-defined label classification system, and can increase label categories, set classification threshold values and sampling numbers according to needs to realize intelligent hierarchical labeling.

Description

Intelligent grading labeling method

Technical Field

The invention relates to the technical field of artificial intelligent text classification, and particularly provides an intelligent hierarchical labeling method.

Background

With the development of artificial intelligence technology, as one of the most classical use scenarios in the field of natural language processing, the text classification problem has evolved from a method based on traditional machine learning to a method based on deep learning. The main content of the former includes artificial feature engineering and shallow classification model, and the text classification problem is divided into two parts of feature engineering and classifier. The period mainly focuses on the distribution of data, and how to design more feature models from the distribution of texts is the mainstream of the period.

With the improvement of computing power, the computation of the neural network is not limited any more, deep learning develops rapidly, and new algorithm models emerge continuously. In the deep learning era, the neural network can automatically mine characteristics from data, people are liberated from complex characteristic engineering, and innovation of an algorithm model and breakthrough of theory can be more focused. How to solve the problem of automatic classification of structured data fields is a problem that needs to be solved urgently by those skilled in the art.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an intelligent grading labeling method with strong practicability.

The technical scheme adopted by the invention for solving the technical problems is as follows:

an intelligent grading labeling method comprises a model training stage and a model prediction stage, wherein in the model training stage, a primary label labeling model is trained, and then a secondary label corresponding to each primary label is trained;

in the model prediction stage, data are sequentially input into each primary label marking model, when the threshold condition of the corresponding model is met, the primary label is marked, the data are sequentially input into a secondary label corresponding to the label, and when the threshold condition of the secondary label is met, the secondary label is marked.

Further, when each first-level label and each second-level label construct a training set, a verification set and a test set, the data of the label is used as a positive sample, the data outside the label is used as a negative sample, and the quantity ratio of the positive sample to the negative sample is 1: 1.

further, a word list and a hash mapping table are constructed by taking a tuple, a binary tuple and a triple as characteristics on the granularity of a single word, each sample is mapped into a corresponding digital sequence to be used as the input of a fastText model, the model is trained by taking the label as a target value, and the two classification models of each primary label and each secondary label are respectively obtained.

Further, in the model prediction stage, field data is sampled, the sampled data is input into a primary classification model, and when the threshold condition of the model is met, the data is labeled, then the data is sequentially input into each secondary classification label model corresponding to the primary label, and when the threshold condition of the corresponding label is met, the data is labeled.

Further, in a model training stage, preparing a primary label classification data set, taking the label data as a positive sample, marking the label as 1, taking the data outside the label as a negative sample, marking the label as 0, and setting the number ratio of the positive sample to the negative sample to be 1: 1;

constructing a word table according to the sample to obtain a uni-gram mapping table on a single word granularity, and performing hash mapping on the bi-gram and the tri-gram respectively to obtain mapping tables of the bi-gram and the tri-gram;

processing each sample into a uni-gram index sequence, a bi-gram index sequence, a tri-gram index sequence and a label on word granularity, setting the length seq _ len of each sample index sequence to be 32, only reserving the first 32 bits with the length being more than 32, and filling the samples with the length being less than 32;

dividing a data set into a training set, a verification set and a test set according to the ratio of 8:1: 1;

and the secondary label classification data set is processed by adopting the same method as the primary label classification data set.

Preferably, the optimizers used in the model training phase are the Adam optimizer and the kaiming initialization method.

Further, in the model training phase, setting the batch _ size to be 128, recording the loss, accuracy, recall and f1 value every 100 batches, using the verification set for verification, and recording the loss, accuracy, recall and f1 value of the model on the verification set;

and if the accuracy is not improved after 1000 batchs, stopping training and storing the final model.

Furthermore, in the model prediction stage data preprocessing, the number of samples is set to sample the field data to obtain a sample to be predicted, the data set preparation in the same model training stage is carried out, and the data is processed to obtain model input data.

Compared with the prior art, the intelligent grading marking method has the following outstanding beneficial effects:

the invention can realize the function of carrying out hierarchical multi-label intelligent labeling on the structured data field according to a self-defined label classification system, and can increase label categories, set classification threshold values, sample numbers and the like according to requirements to realize intelligent hierarchical labeling.

Detailed Description

The present invention will be described in further detail with reference to specific embodiments in order to better understand the technical solutions of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A preferred embodiment is given below:

in the embodiment of the intelligent hierarchical labeling method, the whole process includes a model training stage and a model prediction stage.

In the model training stage, each primary label labeling model is trained, and then a secondary label corresponding to each primary label is trained.

The concrete description is as follows: in the model training stage, each first-level label and each second-level label are trained according to a self-defined two-level label classification system. When each first-level label and each second-level label construct a training set, a verification set and a test set, the label data is taken as a positive example, other label data is taken as a negative example, and the number ratio of the positive sample to the negative sample is 1: 1. And constructing a word list and a hash mapping table by taking a one-tuple, a two-tuple and a triple as characteristics on the granularity of a single word, and mapping each sample into a corresponding digital sequence as the input of the fastText model. And training the model by taking the label as a target value to respectively obtain a two-classification model of each primary label and each secondary label.

And in the model prediction stage, field data are sampled, the sampled data are input into each primary classification model, and the label is marked when the threshold condition of the model is met. Then, the data is sequentially input into each secondary classification label model corresponding to the primary label, and the label is marked when the threshold value condition of the corresponding label is met. Thus, the labels of the primary label and the secondary label of the structured field are obtained.

The detailed putting method comprises the following steps:

(1) data set preparation:

preparing a first-level label classification data set, wherein the label data is used as a positive sample, the label is marked as 1, the data outside the label is used as a negative sample, the label is marked as 0, and the quantity ratio of the positive sample to the negative sample is 1: 1; constructing a word table according to the sample to obtain a uni-gram mapping table on a single word granularity, and performing hash mapping on the bi-gram and the tri-gram respectively to obtain mapping tables of the bi-gram and the tri-gram; processing each sample into a uni-gram index sequence, a bi-gram index sequence, a tri-gram index sequence and a label on word granularity, setting the length seq _ len of each sample index sequence to be 32, only reserving the first 32 bits with the length being more than 32, and filling the samples with the length being less than 32; the data set is divided into a training set, a validation set and a test set according to the ratio of 8:1: 1. Similarly, a secondary label classification dataset is prepared.

Constructing a model: and building a fastText model structure.

An optimizer: and adopting an Adam optimizer and a kaiming initialization method.

Model training: set batch _ size to 128, record loss, accuracy, recall, f1 values per 100 batches, validate with the validation set, and record loss, accuracy, recall, f1 values of the model on the validation set. And if the accuracy is not improved after 1000 batchs, stopping training and storing the final model.

And (3) testing a model: and inputting the test set data into the trained model for testing to obtain loss, accuracy, recall rate, f1 value and the like.

(2) A model prediction stage:

data preprocessing:

and setting the number of samples (such as 512) to sample the field data to obtain samples to be predicted (the labels of the samples obtained through model prediction are labels of the field). And preparing a data set in the same model training stage, and processing the data to obtain model input data.

Predicting a primary model:

and sequentially inputting the processed data into each primary label prediction model, and if the processed data meet the set threshold condition, marking the label.

And (3) secondary model prediction:

and sequentially inputting the data into a secondary label prediction model corresponding to each primary label obtained by the prediction of the primary model, and if a certain secondary label threshold condition is met, printing the secondary label. Thus, the two-level classification label of the structured data is obtained.

The intelligent hierarchical labeling method can realize the function of hierarchical multi-label intelligent labeling on the structured data field according to a user-defined label classification system, can increase label categories, set classification threshold values, sample numbers and the like according to needs, and realizes intelligent hierarchical labeling.

The above embodiments are only specific cases of the present invention, and the protection scope of the present invention includes but is not limited to the above embodiments, and any suitable changes or substitutions that are consistent with the claims of the intelligent hierarchical labeling method of the present invention and are made by those skilled in the art should fall within the protection scope of the present invention.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. An intelligent grading labeling method is characterized by comprising a model training stage and a model prediction stage, wherein a primary label labeling model is trained in the model training stage, and then a secondary label corresponding to each primary label is trained;

2. The intelligent hierarchical labeling method according to claim 1, wherein when each of the primary labels and the secondary labels constructs a training set, a validation set and a test set, the data of the label is used as a positive sample, the data outside the label is used as a negative sample, and the number ratio of the positive sample to the negative sample is 1: 1.

3. the intelligent hierarchical labeling method of claim 2, wherein a word list and a hash mapping table are constructed by taking a tuple, a bituple and a triple as features on the granularity of a single word, each sample is mapped into a corresponding number sequence as input of a fastText model, the model is trained by taking a label as a target value, and a binary classification model of each primary label and each secondary label is obtained respectively.

4. The intelligent hierarchical labeling method according to claim 3, characterized in that in a model prediction stage, field data is sampled, the sampled data is input into a primary classification model and labeled when a model threshold condition is met, then the data is sequentially input into each secondary classification label model corresponding to the primary label and labeled when a corresponding label threshold condition is met.

5. The intelligent hierarchical labeling method according to claim 4, characterized in that in the model training phase, a primary label classification data set is prepared, the label data is used as a positive sample, the label is marked as 1, the data outside the label is used as a negative sample, the label is marked as 0, and the number ratio of the positive sample to the negative sample is 1: 1;

6. An intelligent hierarchical tagging method according to claim 5 wherein the optimizers used in the model training phase are the Adam optimizer and the kaiming initialization method.

7. The intelligent hierarchical tagging method of claim 6, wherein in the model training phase, batch _ size is set to 128, every 100 batches are recorded with loss, accuracy, recall and f1 values and verified with a verification set, and loss, accuracy, recall and f1 values of the model on the verification set are recorded;

8. The intelligent hierarchical labeling method according to claim 7, characterized in that in the model prediction phase data preprocessing, the number of samples is set to sample the field data to obtain the sample to be predicted, the data set preparation in the same model training phase is performed, and the data is processed to obtain the model input data.