CN113590764B - Training sample construction method and device, electronic equipment and storage medium - Google Patents
Training sample construction method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN113590764B CN113590764B CN202111132630.0A CN202111132630A CN113590764B CN 113590764 B CN113590764 B CN 113590764B CN 202111132630 A CN202111132630 A CN 202111132630A CN 113590764 B CN113590764 B CN 113590764B
- Authority
- CN
- China
- Prior art keywords
- unlabeled
- sample
- text
- samples
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012549 training Methods 0.000 title claims abstract description 128
- 238000010276 construction Methods 0.000 title claims abstract description 40
- 238000013145 classification model Methods 0.000 claims abstract description 57
- 238000012216 screening Methods 0.000 claims abstract description 44
- 238000000034 method Methods 0.000 claims abstract description 33
- 238000002372 labelling Methods 0.000 claims abstract description 30
- 238000004590 computer program Methods 0.000 claims description 11
- 239000000523 sample Substances 0.000 description 148
- 230000000694 effects Effects 0.000 description 10
- 238000010586 diagram Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 5
- 230000007547 defect Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 235000003642 hunger Nutrition 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000037351 starvation Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
- G06F40/117—Tagging; Marking up; Designating a block; Setting of attributes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a training sample construction method, a training sample construction device, electronic equipment and a storage medium, wherein the method comprises the following steps: classifying the unlabeled texts based on the trained initial classification model to obtain the classification result of the unlabeled texts; screening difficult samples and/or first candidate sparse samples from the unlabeled texts based on the classification result of the unlabeled texts; and labeling the difficult sample and/or the first candidate sparse sample to obtain a training sample. According to the training sample construction method, the training sample construction device, the electronic equipment and the storage medium, the trained initial classification model is used for classifying the unlabeled texts, so that the unlabeled texts are screened according to the classification result of each unlabeled text, difficult samples and first candidate sparse samples are obtained, labeling is carried out on the basis of the screening result, the training samples are obtained, the construction efficiency of the training samples can be greatly improved, and meanwhile the number of the obtained sparse samples is effectively increased.
Description
Technical Field
The invention relates to the technical field of natural language processing, in particular to a training sample construction method and device, electronic equipment and a storage medium.
Background
In the deep learning project, the quality of the training set often directly influences the training effect of the model, so that the establishment of a good training set has an important role in optimizing the model effect.
However, in an actual application scenario, constructing an effective training set requires slow accumulation in a specific task, which is time-consuming and requires manual labeling of each sample one by one, and thus, the labor cost is high. Especially when some classes of training samples are sparsely distributed throughout the data set, the labor cost and time cost for acquiring such sparse samples will be very high, and the number of sparse samples collected is very limited. The sparse samples are samples with a small proportion of the number of samples corresponding to the type of the sparse samples to the total number of samples. For example, the proportion of certain data violating the platform regulations in the whole data set is lower than 0.1%, so that the violating data are sparse samples, and in order to collect enough sparse samples as a training set, annotating personnel needs to accumulate data for a long time, which is time-consuming, labor-consuming and high in cost.
Disclosure of Invention
The invention provides a training sample construction method and device, electronic equipment and a storage medium, which are used for solving the defects of high difficulty, low efficiency and high labor cost of sparse sample construction in the prior art.
The invention provides a training sample construction method, which comprises the following steps:
classifying the unlabeled texts based on the trained initial classification model to obtain the classification result of the unlabeled texts;
screening difficult samples and/or first candidate sparse samples from the unlabeled texts based on the classification result of the unlabeled texts;
and labeling the difficult sample and/or the first candidate sparse sample to obtain a training sample.
According to the training sample construction method provided by the invention, the screening of the difficult samples from the unlabeled texts based on the classification results of the unlabeled texts specifically comprises the following steps:
calculating the confusion degree of the unlabeled text based on the probability distribution in the classification result of the unlabeled text; the more similar the probability of each type in the probability distribution of the unlabeled text is, the higher the confusion degree of the unlabeled text is;
and screening the difficult sample from the unlabeled text based on the confusion degree of the unlabeled text.
According to the training sample construction method provided by the invention, the calculating the confusion degree of the unlabeled text based on the probability distribution in the classification result of the unlabeled text specifically comprises the following steps:
and calculating the entropy of the probability distribution of the unlabeled text as the confusion degree of the unlabeled text.
According to the training sample construction method provided by the invention, based on the classification result of the unlabeled text, a first candidate sparse sample is screened from the unlabeled text, and the method specifically comprises the following steps:
dividing the unlabeled text into a plurality of probability sections based on the probability corresponding to the sparse type in the classification result of the unlabeled text;
and selecting one or more unlabeled texts from each probability segment higher than a preset threshold value as the first candidate sparse samples.
According to the training sample construction method provided by the invention, the initial classification model is obtained by carrying out countermeasure training on the labeled text.
According to the training sample construction method provided by the invention, the confrontation training of the labeled text specifically comprises the following steps:
generating disturbance based on the input sample at the previous moment and the gradient of the previous moment to obtain the input sample at the current moment;
determining a gradient of the current time based on the input samples of the current time;
updating the initial classification model based on the gradient of the current moment;
wherein the input sample at the initial moment is the marked text.
The training sample construction method provided by the invention further comprises the following steps:
screening a second candidate sparse sample from the unlabeled text based on the classification result of the unlabeled text;
and performing keyword matching on the second candidate sparse sample and a preset keyword, and taking the successfully matched second candidate sparse sample as a training sample.
The invention also provides a training sample construction device, comprising:
the classification unit is used for classifying the unlabeled texts based on the trained initial classification model to obtain the classification result of the unlabeled texts;
the sample screening unit is used for screening a difficult sample and/or a first candidate sparse sample from the unlabeled text based on the classification result of the unlabeled text;
and the marking unit is used for marking the difficult sample and/or the first candidate sparse sample to obtain a training sample.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the training sample construction method.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the training sample construction method as described in any of the above.
According to the training sample construction method, the training sample construction device, the electronic equipment and the storage medium, the trained initial classification model is used for classifying the unlabeled texts, so that the unlabeled texts are screened according to the classification result of each unlabeled text, difficult samples and first candidate sparse samples are obtained, labeling is carried out on the basis of the screening result, the training samples are obtained, the construction efficiency of the training samples can be greatly improved, and meanwhile the number of the obtained sparse samples is effectively increased.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a training sample construction method according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of an initial classification model provided by an embodiment of the present invention;
FIG. 3 is a schematic diagram of an anti-exercise method according to an embodiment of the present invention;
FIG. 4 is a detailed flowchart of a training sample construction method according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a data screening method provided by an embodiment of the invention;
FIG. 6 is a schematic structural diagram of a training sample constructing apparatus according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In the embodiment of the present application, the term "and/or" describes an association relationship of associated objects, and means that there may be three relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
In the embodiments of the present application, the term "plurality" means two or more, and other terms are similar thereto.
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiment of the application provides a training sample construction method and device, electronic equipment and a storage medium, and aims to overcome the defects that in the prior art, when a sample set is constructed under a data sparse situation, manual labeling needs to be carried out on each sample one by one, the efficiency is low, and the number of samples acquired is small, the construction efficiency of training samples can be greatly improved, and meanwhile, the number of acquired sparse samples is effectively improved.
The method and the device are based on the same application concept, and because the principles of solving the problems of the method and the device are similar, the implementation of the device and the method can be mutually referred, and repeated parts are not repeated.
Fig. 1 is a schematic flowchart of a training sample construction method provided in an embodiment of the present invention, and as shown in fig. 1, the method includes:
and step 130, labeling the difficult samples and/or the first candidate sparse samples to obtain training samples.
Specifically, the trained initial classification model is used for classifying texts which are not manually labeled and the types of the texts are not known for the moment, so that the classification result of each unlabeled text is obtained. The initial classification model may be any type of text classification model, such as a combination of a text semantic extraction network (e.g., Roberta or BERT, etc.) and a classification network (as shown in fig. 2). The classification result of any unlabeled text may include the possibility that the unlabeled text corresponds to each text type, for example, in an illegal text classification scenario, the classification result of any unlabeled text may include the possibility that the unlabeled text is an illegal text and the possibility that the unlabeled text is a non-illegal text.
And according to the classification result of each unlabeled text, screening out difficult samples from the unlabeled texts. The difficult samples are samples which are difficult to learn for the current initial classification model and have relatively low possibility of classifying the samples into the correct types. The appearance of the difficult samples reflects the problem of sample imbalance to a certain extent, that is, the number of sparse samples is small, the initial classification model learns the sparse samples insufficiently, and the classification effect on the sparse samples is poor, so that the difficult samples may contain the sparse samples. Therefore, the difficult samples can be screened out to construct training samples, the number of sparse samples in the training samples is increased, the training effect of subsequent models is improved, and the classification accuracy of the sparse samples is improved. Even if the difficult samples are not sparse samples, the learning difficulty is high, and the difficult samples are brought into a training set for training, so that the training effect of a subsequent model is improved, and the classification accuracy of the similar samples is improved.
In addition, according to the classification result of classifying the unlabeled texts by the initial classification model, the unlabeled texts which are considered as sparse samples by the initial classification model are directly screened out to serve as first candidate sparse samples. For example, according to the probability that any unlabeled text corresponds to the sparse type, the unlabeled text with higher probability of corresponding to the sparse type can be screened out as the first candidate sparse sample. The sparse type is a type corresponding to the sparse sample, that is, the proportion of the number of corresponding samples to the total number of samples is less than a preset threshold. For example, in the illegal text classification scenario, the proportion of the illegal text is small, and thus the type of the illegal text is a sparse type. In addition, even if the first candidate sparse sample is not a sparse sample, which indicates that the initial classification model has a classification error, and is brought into the training set for learning, the training effect of the subsequent model is improved, and the classification accuracy is improved.
Here, because the training degree of the initial classification model is not perfect enough, the classification accuracy is not high, and the classification result of each unlabeled text may be inaccurate, further labeling may be performed on the screened difficult sample and the true type of the first candidate sparse sample, so as to obtain a training sample for training and learning of the initial classification model or other text classification models.
The unlabeled texts are classified by utilizing the trained initial classification model, so that the unlabeled texts are screened according to the classification result of each unlabeled text, difficult samples and first candidate sparse samples which are more likely to be sparse samples are screened, and labeling is performed on the basis of the screening result, the range of manual labeling is greatly reduced, the workload of manual labeling is reduced, the construction efficiency of training samples is improved, the sparse samples can be quickly acquired, the number of sparse samples in training concentration is increased, and the training effect and the text classification accuracy of the text classification model are improved.
According to the method provided by the embodiment of the invention, the trained initial classification model is used for classifying the unlabeled texts, so that the unlabeled texts are screened according to the classification result of each unlabeled text to obtain the difficult sample and the first candidate sparse sample, and labeling is carried out on the basis of the screening result to obtain the training sample, so that the construction efficiency of the training sample can be greatly improved, and the number of the obtained sparse samples can be effectively increased.
Based on the above embodiment, in step 120, based on the classification result of the unlabeled text, screening a difficult sample from the unlabeled text specifically includes:
calculating the confusion degree of the unlabeled text based on the probability distribution in the classification result of the unlabeled text; the more similar the probability of each type in the probability distribution of the unlabeled text is, the higher the confusion degree of the unlabeled text is;
and screening the difficult sample from the unlabeled text based on the confusion degree of the unlabeled text.
Specifically, the confusion degree of each unlabeled text is calculated according to the probability distribution in the classification result of each unlabeled text. The more similar the probability of each type in the probability distribution of any unlabeled text is, the higher the confusion degree of the unlabeled text is. Here, the more similar the probabilities corresponding to the types in the probability distribution of any unlabeled text, the less sufficient the initial classification model learns the semantic features of the unlabeled text, so that the initial classification model cannot clearly and certainly indicate the type of the unlabeled text, and thus the higher the confusion degree of the unlabeled text with respect to the initial classification model is.
The higher the confusion degree of the unlabeled text is, the higher the learning difficulty of the unlabeled text to the initial classification model is, so that the unlabeled text with higher confusion degree can be screened out to be used as a difficult sample to construct a training sample, thereby improving the learning effect of a subsequent model and improving the Recall rate (Recall) and accuracy rate (Precision) of the model.
According to the method provided by the embodiment of the invention, the confusion degree of the unlabeled text is calculated according to the probability distribution in the classification result of the unlabeled text, and the difficult sample is screened based on the confusion degree of the unlabeled text, so that the difficult sample with high learning difficulty can be effectively obtained, the quality of the training sample is optimized, and the learning effect of a subsequent model and the recall rate and accuracy rate of the model are favorably improved.
Based on any of the above embodiments, calculating the confusion degree of the unlabeled text based on the probability distribution in the classification result of the unlabeled text specifically includes:
and calculating the entropy of the probability distribution of the unlabeled text as the confusion degree of the unlabeled text.
Specifically, the entropy of each unlabeled text can be used as its corresponding confusion degree by calculating the entropy of the probability distribution of each unlabeled text. For example, in a binary classification scenario, the probability distribution of any unlabeled text includes a probability p1 corresponding to type 1 and a probability p2 corresponding to type 2, and then the entropy (i.e., the confusion degree) of the unlabeled text can be calculated as follows:
-p0*log(p0)-p1*log(p1)
based on any of the above embodiments, in step 120, based on the classification result of the unlabeled text, screening a first candidate sparse sample from the unlabeled text specifically includes:
dividing the unlabeled text into a plurality of probability sections based on the probability corresponding to the sparse type in the classification result of the unlabeled text;
and selecting one or more unlabeled texts from each probability segment higher than a preset threshold value as the first candidate sparse samples.
Specifically, each unlabeled text may be divided into a plurality of probability segments according to the probability corresponding to the sparse type in the classification result of each unlabeled text. Here, it is considered that the unlabeled text with higher probability corresponding to the sparse type is likely to be a sparse sample, and therefore, the unlabeled text with higher probability corresponding to the sparse type can be selected and divided into corresponding probability segments. For example, the unlabeled texts can be divided into five probability segments, namely 0.5-0.6, 0.6-0.7, 0.7-0.8, 0.8-0.9, 0.9-1 and the like, according to the probability corresponding to the sparse type in the classification result of each unlabeled text.
And then, selecting one or more unlabeled texts from each probability segment higher than a preset threshold value as first candidate sparse samples. Here, a plurality of unlabeled texts are respectively extracted from each probability segment meeting the condition, so that data are prevented from being sampled in the same probability segment, and the diversity of subsequently constructed training samples can be ensured.
According to the method provided by the embodiment of the invention, the unlabeled texts are divided into a plurality of probability segments, and one or more unlabeled texts are respectively selected from each probability segment higher than the preset threshold value as the first candidate sparse samples, so that the diversity of subsequently constructed training samples is ensured.
Based on any of the above embodiments, the initial classification model is obtained by performing countermeasure training on the labeled text.
Specifically, the type of part of the unlabeled text can be labeled manually by a labeling person to obtain a labeled text, and the initial classification model is trained by using the labeled text. Because the data labeling cost is high and the data collection is difficult, only a small amount of texts can be labeled to train an initial classification model, so that the labor cost is saved and the efficiency is improved.
Since the initial classification model has less training data in the training stage and is easy to be over-fitted, a confrontation training mode can be introduced in the training process of the model to relieve the over-fitting risk. The initial classification model is subjected to countermeasure training, namely, a disturbance sample is added in the model training process, and the disturbance sample is combined into original training data to form new training data, so that data enhancement is realized, and the problem of model overfitting is avoided.
According to the method provided by the embodiment of the invention, the initial classification model is obtained by performing countermeasure training on the labeled text, so that the problem of overfitting of the model can be avoided, and the text classification capability of the initial classification model is improved.
Based on any one of the above embodiments, performing countermeasure training on the labeled text specifically includes:
generating disturbance based on the input sample at the previous moment and the gradient of the previous moment to obtain the input sample at the current moment;
determining a gradient of the current time based on the input samples of the current time;
updating the initial classification model based on the gradient of the current moment;
wherein the input sample at the initial moment is the marked text.
Specifically, as shown in fig. 3, at an initial time, the labeled text is input into the initial classification model, so that a gradient corresponding to the labeled text at the initial time can be obtained, and thus, the model parameters of the initial classification model are updated according to the gradient. Then, a perturbation is generated according to the input sample (i.e. the labeled text x) at the previous moment and the gradient at the previous moment, and the input sample (i.e. the perturbation data x 1) at the current moment is formed. The input sample at the current moment is input into the initial classification model, so that the gradient of the current moment can be obtained, and the model parameters of the initial classification model are further updated according to the gradient. And repeating the operations, inputting the next marked text into the initial classification model after the preset number of disturbance data are generated, and repeating the operations again until the training is finished. For each labeled text, a plurality of disturbance data are generated in the model training process to further update the model parameters, and the problem of model overfitting caused by less training data is effectively solved.
Here, the perturbation may be generated in a manner such as fgsm (fast Gradient signal method), fgm (fast Gradient method), and pgd (project Gradient Descent). For example, the perturbation may be generated as follows:
wherein,xtIs a sample obtained by the t-th perturbation, xt+1Is to xtSample obtained after perturbation, g (x)t) For the gradient at the current time, Project () is a disturbance constraint, i.e. a disturbance function, α is a disturbance coefficient, and is a constant, and if the disturbance is too large and exceeds the constraint space of Project (), it is mapped back to the boundary of the constraint space to ensure that the disturbance is not too large. d is the differential, xt-1Is the sample generated by the last perturbation or the marked sample at the initial time, L is the loss function,is the coefficient of the initial classification model, and y is the sample label table.
Based on any of the above embodiments, the method further comprises:
screening a second candidate sparse sample from the unlabeled text based on the classification result of the unlabeled text;
and performing keyword matching on the second candidate sparse sample and a preset keyword, and taking the successfully matched second candidate sparse sample as a training sample.
Specifically, the number of the sparse samples obtained in the above manner may still be limited, and therefore, in order to further expand the number of the sparse samples to improve the identification accuracy of the text classification model on the sparse samples, more sparse samples can be obtained from the unlabeled text in a keyword matching manner. Specifically, a second candidate sparse sample which is possibly a sparse sample can be screened from the unlabeled text according to the classification result of each unlabeled text. Here, according to the probability of the corresponding sparse type in the classification result of each unlabeled text, a plurality of unlabeled samples with the highest probability may be selected as the second candidate sparse samples. For example, after the unlabeled texts are sorted in a manner that the probability corresponding to the sparse type is reduced from high to low, the front part of the unlabeled texts is selected as a second candidate sparse sample.
Subsequently, each second candidate sparse sample and a preset keyword are subjected to keyword matching (for example, regular matching). The preset keywords can be obtained by performing common-word analysis on characters, words or phrases in existing manually labeled sparse samples, and can be adjusted correspondingly according to specific service scenarios. For example, in a violation text classification scenario, by analyzing an existing sparse sample and determining that "vpn" is a violation high-frequency word, the "vpn" may be stored as a preset keyword. If any second candidate sparse sample is successfully matched with any preset keyword, namely the preset keyword appears in the second candidate sparse sample, the second candidate sparse sample is possibly a sparse sample, and therefore the second candidate sparse sample can be used as a training sample to be expanded into a training set. In addition, if the plurality of second candidate sparse samples are successfully matched with the same preset keyword, an appropriate number of second candidate sparse samples can be selected from the plurality of second candidate sparse samples as training samples according to an actual application scene, so that the risk of model overfitting caused by too many training samples matched with the same preset keyword and the risk of insufficient model training caused by too few training samples matched with the same preset keyword are avoided.
According to the method provided by the embodiment of the invention, the second candidate sparse sample primarily screened from the unlabeled text is matched with the preset keyword, and the successfully matched second candidate sparse sample is used as the training sample and is expanded into the training set, so that the expansion of the number of sparse samples is realized, and the classification accuracy of the model on the sparse samples is improved.
Based on any of the above embodiments, fig. 4 is a detailed flowchart schematic diagram of a training sample construction method provided by an embodiment of the present invention, as shown in fig. 4, the method includes:
at an initial stage, a small amount of labeled text is manually marked, step 410.
And step 420, taking the marked text in the step 410 as a training set, and finely adjusting an initial classification model formed by a pre-training model Roberta + classification network. As training data at this stage are less, and the model is easy to be over-fitted, a countertraining PGD method is introduced in the training process, and the over-fitting risk is relieved.
And 430, predicting the unlabeled text (namely, the test set) by using the initial classification model obtained in the step 420 to obtain a classification result of the unlabeled text, namely, probability distribution of each type corresponding to the unlabeled text.
And 440, screening data of the unlabeled text by using an active learning strategy, making up the problem of insufficient samples in the previous step, and filtering the unlabeled sparse data to improve the mark-out rate of the data. As shown in fig. 5, the screening strategy in this step includes two strategies:
1) and calculating entropy of the probability distribution of the unlabeled text to serve as the confusion degree of the unlabeled text. Sorting the confusion degree of the unlabeled texts, and screening the unlabeled texts with high confusion degree (for example, sorting based on entropy, taking the first 30% of data) as difficult samples to be placed in a candidate labeling set for subsequent labeling.
2) For the sparse type, the unlabeled text output by the model is divided into a plurality of probability sections according to the probability of the corresponding coefficient type, such as probability sections of 0.5-0.6, 0.6-0.7. And then, randomly extracting a plurality of unlabeled texts from each probability segment, and placing the unlabeled texts as first candidate sparse samples in a candidate labeling set for subsequent labeling so as to avoid sampling data in the same probability segment, thereby ensuring the diversity of the data.
Wherein, the screening methods of 1) and 2) can be carried out simultaneously, and the data obtained after screening is subjected to deduplication operation.
And 450, labeling the difficult samples and/or the first candidate sparse samples screened in the step 440 to obtain training samples.
Step 420 and 450 are looped until enough training samples are accumulated. The training data set is constructed through the steps, the efficiency is high, and meanwhile, the data quality and the diversity can be guaranteed.
Step 420-450 speeds up the efficiency of training sample acquisition, but may still face the problem of sample starvation. Thus, sample expansion can be performed using the following steps:
In addition, since the accuracy of the partial extended samples may not be high, the loss weight corresponding to the partial samples may be appropriately reduced in the training process, and the weight may be determined by the accuracy of the extended samples.
Through the mode, the training samples can be quickly accumulated, the manual labeling pressure and cost are reduced, and the diversity and the number of the samples can be guaranteed.
Based on any of the above embodiments, fig. 6 is a schematic structural diagram of a training sample constructing apparatus provided in an embodiment of the present invention, and as shown in fig. 6, the apparatus includes: a classification unit 610, a sample screening unit 620 and a labeling unit 630.
The classification unit 610 is configured to classify an unlabeled text based on a trained initial classification model to obtain a classification result of the unlabeled text;
the sample screening unit 620 is used for screening a difficult sample and/or a first candidate sparse sample from the unlabeled text based on the classification result of the unlabeled text;
the labeling unit 630 is configured to label the difficult sample and/or the first candidate sparse sample to obtain a training sample.
According to the device provided by the embodiment of the invention, the trained initial classification model is used for classifying the unlabeled texts, so that the unlabeled texts are screened according to the classification result of each unlabeled text to obtain the difficult sample and the first candidate sparse sample, and labeling is carried out on the basis of the screening result to obtain the training sample, so that the construction efficiency of the training sample can be greatly improved, and the number of the obtained sparse samples can be effectively increased.
Based on any of the above embodiments, the sample screening unit 620 is specifically configured to:
calculating the confusion degree of the unlabeled text based on the probability distribution in the classification result of the unlabeled text; the more similar the probability of each type in the probability distribution of the unlabeled text is, the higher the confusion degree of the unlabeled text is;
and screening the difficult sample from the unlabeled text based on the confusion degree of the unlabeled text.
According to the device provided by the embodiment of the invention, the confusion degree of the unlabeled text is calculated according to the probability distribution in the classification result of the unlabeled text, and the difficult sample is screened based on the confusion degree of the unlabeled text, so that the difficult sample with high learning difficulty can be effectively obtained, the quality of the training sample is optimized, and the learning effect of a subsequent model and the recall rate and accuracy rate of the model are favorably improved.
Based on any of the above embodiments, calculating the confusion degree of the unlabeled text based on the probability distribution in the classification result of the unlabeled text specifically includes:
and calculating the entropy of the probability distribution of the unlabeled text as the confusion degree of the unlabeled text.
Based on any of the above embodiments, the sample screening unit 620 is specifically configured to:
dividing the unlabeled text into a plurality of probability sections based on the probability corresponding to the sparse type in the classification result of the unlabeled text;
and selecting one or more unlabeled texts from each probability segment higher than a preset threshold value as the first candidate sparse samples.
According to the device provided by the embodiment of the invention, the unlabeled texts are divided into the probability sections, and one or more unlabeled texts are respectively selected from each probability section higher than the preset threshold value as the first candidate sparse samples, so that the diversity of subsequently constructed training samples is ensured.
Based on any of the above embodiments, the initial classification model is obtained by performing countermeasure training on the labeled text.
According to the device provided by the embodiment of the invention, the initial classification model is obtained by performing countermeasure training on the labeled text, so that the problem of overfitting of the model can be avoided, and the text classification capability of the initial classification model is improved.
Based on any one of the above embodiments, performing countermeasure training on the labeled text specifically includes:
generating disturbance based on the input sample at the previous moment and the gradient of the previous moment to obtain the input sample at the current moment;
determining a gradient of the current time based on the input samples of the current time;
updating the initial classification model based on the gradient of the current moment;
wherein the input sample at the initial moment is the marked text.
Based on any of the above embodiments, the apparatus further comprises a sample expansion unit, wherein the sample expansion unit is configured to:
screening a second candidate sparse sample from the unlabeled text based on the classification result of the unlabeled text;
and performing keyword matching on the second candidate sparse sample and a preset keyword, and taking the successfully matched second candidate sparse sample as a training sample.
According to the device provided by the embodiment of the invention, the second candidate sparse sample primarily screened from the unlabeled text is matched with the preset keyword, and the successfully matched second candidate sparse sample is used as the training sample and is expanded into the training set, so that the expansion of the number of sparse samples is realized, and the classification accuracy of the model on the sparse samples is improved.
Fig. 7 illustrates a physical structure diagram of an electronic device, and as shown in fig. 7, the electronic device may include: a processor (processor)710, a communication Interface (Communications Interface)720, a memory (memory)730, and a communication bus 740, wherein the processor 710, the communication Interface 720, and the memory 730 communicate with each other via the communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform a training sample construction method comprising: classifying the unlabeled texts based on the trained initial classification model to obtain the classification result of the unlabeled texts; screening difficult samples and/or first candidate sparse samples from the unlabeled texts based on the classification result of the unlabeled texts; and labeling the difficult sample and/or the first candidate sparse sample to obtain a training sample.
In addition, the logic instructions in the memory 730 can be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being stored on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, the computer is capable of executing the training sample construction method provided by the above methods, and the method includes: classifying the unlabeled texts based on the trained initial classification model to obtain the classification result of the unlabeled texts; screening difficult samples and/or first candidate sparse samples from the unlabeled texts based on the classification result of the unlabeled texts; and labeling the difficult sample and/or the first candidate sparse sample to obtain a training sample.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program, when executed by a processor, implementing a training sample construction method provided by the above methods, the method including: classifying the unlabeled texts based on the trained initial classification model to obtain the classification result of the unlabeled texts; screening difficult samples and/or first candidate sparse samples from the unlabeled texts based on the classification result of the unlabeled texts; and labeling the difficult sample and/or the first candidate sparse sample to obtain a training sample.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (8)
1. A training sample construction method is characterized by comprising the following steps:
classifying the unlabeled texts based on the trained initial classification model to obtain the classification result of the unlabeled texts;
screening difficult samples and/or first candidate sparse samples from the unlabeled texts based on the classification result of the unlabeled texts;
labeling the difficult sample and/or the first candidate sparse sample to obtain a training sample;
the screening of the difficult samples from the unlabeled texts based on the classification result of the unlabeled texts specifically comprises:
calculating the confusion degree of the unlabeled text based on the probability distribution in the classification result of the unlabeled text; the more similar the probability of each type in the probability distribution of the unlabeled text is, the higher the confusion degree of the unlabeled text is;
screening the difficult sample from the unlabeled text based on the confusion degree of the unlabeled text;
screening a first candidate sparse sample from the unlabeled text based on the classification result of the unlabeled text, which specifically comprises:
dividing the unlabeled text into a plurality of probability sections based on the probability corresponding to the sparse type in the classification result of the unlabeled text;
and selecting one or more unlabeled texts from each probability segment higher than a preset threshold value as the first candidate sparse samples.
2. The method for constructing training samples according to claim 1, wherein the calculating the confusion degree of the unlabeled text based on the probability distribution in the classification result of the unlabeled text specifically comprises:
and calculating the entropy of the probability distribution of the unlabeled text as the confusion degree of the unlabeled text.
3. The method of claim 1, wherein the initial classification model is obtained by performing countermeasure training on labeled text.
4. The method for constructing training samples according to claim 3, wherein the performing countermeasure training on the labeled text specifically comprises:
generating disturbance based on the input sample at the previous moment and the gradient of the previous moment to obtain the input sample at the current moment;
determining a gradient of the current time based on the input samples of the current time;
updating the initial classification model based on the gradient of the current moment;
wherein the input sample at the initial moment is the marked text.
5. The training sample construction method according to any one of claims 1 to 4, characterized by further comprising:
screening a second candidate sparse sample from the unlabeled text based on the classification result of the unlabeled text;
and performing keyword matching on the second candidate sparse sample and a preset keyword, and taking the successfully matched second candidate sparse sample as a training sample.
6. A training sample construction apparatus, comprising:
the classification unit is used for classifying the unlabeled texts based on the trained initial classification model to obtain the classification result of the unlabeled texts;
the sample screening unit is used for screening a difficult sample and/or a first candidate sparse sample from the unlabeled text based on the classification result of the unlabeled text;
the labeling unit is used for labeling the difficult samples and/or the first candidate sparse samples to obtain training samples;
the screening of the difficult samples from the unlabeled texts based on the classification result of the unlabeled texts specifically comprises:
calculating the confusion degree of the unlabeled text based on the probability distribution in the classification result of the unlabeled text; the more similar the probability of each type in the probability distribution of the unlabeled text is, the higher the confusion degree of the unlabeled text is;
screening the difficult sample from the unlabeled text based on the confusion degree of the unlabeled text;
screening a first candidate sparse sample from the unlabeled text based on the classification result of the unlabeled text, which specifically comprises:
dividing the unlabeled text into a plurality of probability sections based on the probability corresponding to the sparse type in the classification result of the unlabeled text;
and selecting one or more unlabeled texts from each probability segment higher than a preset threshold value as the first candidate sparse samples.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the training sample construction method according to any of claims 1 to 5 are implemented when the program is executed by the processor.
8. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the training sample construction method according to any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111132630.0A CN113590764B (en) | 2021-09-27 | 2021-09-27 | Training sample construction method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111132630.0A CN113590764B (en) | 2021-09-27 | 2021-09-27 | Training sample construction method and device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113590764A CN113590764A (en) | 2021-11-02 |
CN113590764B true CN113590764B (en) | 2021-12-21 |
Family
ID=78242330
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111132630.0A Active CN113590764B (en) | 2021-09-27 | 2021-09-27 | Training sample construction method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113590764B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114091595A (en) * | 2021-11-15 | 2022-02-25 | 南京中兴新软件有限责任公司 | Sample processing method, apparatus and computer-readable storage medium |
CN114219046B (en) * | 2022-01-26 | 2023-07-28 | 北京百度网讯科技有限公司 | Model training method, matching method, device, system, electronic equipment and medium |
CN114648980A (en) * | 2022-03-03 | 2022-06-21 | 科大讯飞股份有限公司 | Data classification and voice recognition method and device, electronic equipment and storage medium |
CN115640808B (en) * | 2022-12-05 | 2023-03-21 | 苏州浪潮智能科技有限公司 | Text labeling method and device, electronic equipment and readable storage medium |
CN117574146B (en) * | 2023-11-15 | 2024-05-28 | 广州方舟信息科技有限公司 | Text classification labeling method, device, electronic equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111310846A (en) * | 2020-02-28 | 2020-06-19 | 平安科技(深圳)有限公司 | Method, device, storage medium and server for selecting sample image |
CN112256823A (en) * | 2020-10-29 | 2021-01-22 | 山东众阳健康科技集团有限公司 | Corpus data sampling method and system based on adjacency density |
CN112308144A (en) * | 2020-10-30 | 2021-02-02 | 江苏云从曦和人工智能有限公司 | Method, system, equipment and medium for screening samples |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130156300A1 (en) * | 2011-12-20 | 2013-06-20 | Fatih Porikli | Multi-Class Classification Method |
US10614379B2 (en) * | 2016-09-27 | 2020-04-07 | Disney Enterprises, Inc. | Robust classification by pre-conditioned lasso and transductive diffusion component analysis |
CN111104510B (en) * | 2019-11-15 | 2023-05-09 | 南京中新赛克科技有限责任公司 | Text classification training sample expansion method based on word embedding |
-
2021
- 2021-09-27 CN CN202111132630.0A patent/CN113590764B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111310846A (en) * | 2020-02-28 | 2020-06-19 | 平安科技(深圳)有限公司 | Method, device, storage medium and server for selecting sample image |
CN112256823A (en) * | 2020-10-29 | 2021-01-22 | 山东众阳健康科技集团有限公司 | Corpus data sampling method and system based on adjacency density |
CN112308144A (en) * | 2020-10-30 | 2021-02-02 | 江苏云从曦和人工智能有限公司 | Method, system, equipment and medium for screening samples |
Also Published As
Publication number | Publication date |
---|---|
CN113590764A (en) | 2021-11-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113590764B (en) | Training sample construction method and device, electronic equipment and storage medium | |
CN102411563B (en) | Method, device and system for identifying target words | |
CN108573047A (en) | A kind of training method and device of Module of Automatic Chinese Documents Classification | |
US20180357302A1 (en) | Method and device for processing a topic | |
KR20200127020A (en) | Computer-readable storage medium storing method, apparatus and instructions for matching semantic text data with tags | |
CN107545038B (en) | Text classification method and equipment | |
CN107688630B (en) | Semantic-based weakly supervised microbo multi-emotion dictionary expansion method | |
CN112070138A (en) | Multi-label mixed classification model construction method, news classification method and system | |
CN113780007A (en) | Corpus screening method, intention recognition model optimization method, equipment and storage medium | |
CN111984792A (en) | Website classification method and device, computer equipment and storage medium | |
CN104850617A (en) | Short text processing method and apparatus | |
CN110928981A (en) | Method, system and storage medium for establishing and perfecting iteration of text label system | |
CN109918648B (en) | Rumor depth detection method based on dynamic sliding window feature score | |
CN107357895A (en) | A kind of processing method of the text representation based on bag of words | |
CN110910175A (en) | Tourist ticket product portrait generation method | |
CN107341142B (en) | Enterprise relation calculation method and system based on keyword extraction and analysis | |
CN111368534A (en) | Application log noise reduction method and device | |
CN114691525A (en) | Test case selection method and device | |
CN115758183A (en) | Training method and device for log anomaly detection model | |
CN115146062A (en) | Intelligent event analysis method and system fusing expert recommendation and text clustering | |
CN115062621A (en) | Label extraction method and device, electronic equipment and storage medium | |
CN114020904A (en) | Test question file screening method, model training method, device, equipment and medium | |
CN113095723A (en) | Coupon recommendation method and device | |
CN109947932B (en) | Push information classification method and system | |
CN108717637B (en) | Automatic mining method and system for E-commerce safety related entities |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |