CN113590764B - Training sample construction method and device, electronic equipment and storage medium - Google Patents

Training sample construction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113590764B
CN113590764B CN202111132630.0A CN202111132630A CN113590764B CN 113590764 B CN113590764 B CN 113590764B CN 202111132630 A CN202111132630 A CN 202111132630A CN 113590764 B CN113590764 B CN 113590764B
Authority
CN
China
Prior art keywords
unlabeled
sample
text
samples
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111132630.0A
Other languages
Chinese (zh)
Other versions
CN113590764A (en
Inventor
吴杨龙
刘兆来
李大海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhizhe Sihai Beijing Technology Co Ltd
Original Assignee
Zhizhe Sihai Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhizhe Sihai Beijing Technology Co Ltd filed Critical Zhizhe Sihai Beijing Technology Co Ltd
Priority to CN202111132630.0A priority Critical patent/CN113590764B/en
Publication of CN113590764A publication Critical patent/CN113590764A/en
Application granted granted Critical
Publication of CN113590764B publication Critical patent/CN113590764B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a training sample construction method, a training sample construction device, electronic equipment and a storage medium, wherein the method comprises the following steps: classifying the unlabeled texts based on the trained initial classification model to obtain the classification result of the unlabeled texts; screening difficult samples and/or first candidate sparse samples from the unlabeled texts based on the classification result of the unlabeled texts; and labeling the difficult sample and/or the first candidate sparse sample to obtain a training sample. According to the training sample construction method, the training sample construction device, the electronic equipment and the storage medium, the trained initial classification model is used for classifying the unlabeled texts, so that the unlabeled texts are screened according to the classification result of each unlabeled text, difficult samples and first candidate sparse samples are obtained, labeling is carried out on the basis of the screening result, the training samples are obtained, the construction efficiency of the training samples can be greatly improved, and meanwhile the number of the obtained sparse samples is effectively increased.

Description

Training sample construction method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of natural language processing, in particular to a training sample construction method and device, electronic equipment and a storage medium.
Background
In the deep learning project, the quality of the training set often directly influences the training effect of the model, so that the establishment of a good training set has an important role in optimizing the model effect.
However, in an actual application scenario, constructing an effective training set requires slow accumulation in a specific task, which is time-consuming and requires manual labeling of each sample one by one, and thus, the labor cost is high. Especially when some classes of training samples are sparsely distributed throughout the data set, the labor cost and time cost for acquiring such sparse samples will be very high, and the number of sparse samples collected is very limited. The sparse samples are samples with a small proportion of the number of samples corresponding to the type of the sparse samples to the total number of samples. For example, the proportion of certain data violating the platform regulations in the whole data set is lower than 0.1%, so that the violating data are sparse samples, and in order to collect enough sparse samples as a training set, annotating personnel needs to accumulate data for a long time, which is time-consuming, labor-consuming and high in cost.
Disclosure of Invention
The invention provides a training sample construction method and device, electronic equipment and a storage medium, which are used for solving the defects of high difficulty, low efficiency and high labor cost of sparse sample construction in the prior art.
The invention provides a training sample construction method, which comprises the following steps:
classifying the unlabeled texts based on the trained initial classification model to obtain the classification result of the unlabeled texts;
screening difficult samples and/or first candidate sparse samples from the unlabeled texts based on the classification result of the unlabeled texts;
and labeling the difficult sample and/or the first candidate sparse sample to obtain a training sample.
According to the training sample construction method provided by the invention, the screening of the difficult samples from the unlabeled texts based on the classification results of the unlabeled texts specifically comprises the following steps:
calculating the confusion degree of the unlabeled text based on the probability distribution in the classification result of the unlabeled text; the more similar the probability of each type in the probability distribution of the unlabeled text is, the higher the confusion degree of the unlabeled text is;
and screening the difficult sample from the unlabeled text based on the confusion degree of the unlabeled text.
According to the training sample construction method provided by the invention, the calculating the confusion degree of the unlabeled text based on the probability distribution in the classification result of the unlabeled text specifically comprises the following steps:
and calculating the entropy of the probability distribution of the unlabeled text as the confusion degree of the unlabeled text.
According to the training sample construction method provided by the invention, based on the classification result of the unlabeled text, a first candidate sparse sample is screened from the unlabeled text, and the method specifically comprises the following steps:
dividing the unlabeled text into a plurality of probability sections based on the probability corresponding to the sparse type in the classification result of the unlabeled text;
and selecting one or more unlabeled texts from each probability segment higher than a preset threshold value as the first candidate sparse samples.
According to the training sample construction method provided by the invention, the initial classification model is obtained by carrying out countermeasure training on the labeled text.
According to the training sample construction method provided by the invention, the confrontation training of the labeled text specifically comprises the following steps:
generating disturbance based on the input sample at the previous moment and the gradient of the previous moment to obtain the input sample at the current moment;
determining a gradient of the current time based on the input samples of the current time;
updating the initial classification model based on the gradient of the current moment;
wherein the input sample at the initial moment is the marked text.
The training sample construction method provided by the invention further comprises the following steps:
screening a second candidate sparse sample from the unlabeled text based on the classification result of the unlabeled text;
and performing keyword matching on the second candidate sparse sample and a preset keyword, and taking the successfully matched second candidate sparse sample as a training sample.
The invention also provides a training sample construction device, comprising:
the classification unit is used for classifying the unlabeled texts based on the trained initial classification model to obtain the classification result of the unlabeled texts;
the sample screening unit is used for screening a difficult sample and/or a first candidate sparse sample from the unlabeled text based on the classification result of the unlabeled text;
and the marking unit is used for marking the difficult sample and/or the first candidate sparse sample to obtain a training sample.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the training sample construction method.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the training sample construction method as described in any of the above.
According to the training sample construction method, the training sample construction device, the electronic equipment and the storage medium, the trained initial classification model is used for classifying the unlabeled texts, so that the unlabeled texts are screened according to the classification result of each unlabeled text, difficult samples and first candidate sparse samples are obtained, labeling is carried out on the basis of the screening result, the training samples are obtained, the construction efficiency of the training samples can be greatly improved, and meanwhile the number of the obtained sparse samples is effectively increased.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a training sample construction method according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of an initial classification model provided by an embodiment of the present invention;
FIG. 3 is a schematic diagram of an anti-exercise method according to an embodiment of the present invention;
FIG. 4 is a detailed flowchart of a training sample construction method according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a data screening method provided by an embodiment of the invention;
FIG. 6 is a schematic structural diagram of a training sample constructing apparatus according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In the embodiment of the present application, the term "and/or" describes an association relationship of associated objects, and means that there may be three relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
In the embodiments of the present application, the term "plurality" means two or more, and other terms are similar thereto.
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiment of the application provides a training sample construction method and device, electronic equipment and a storage medium, and aims to overcome the defects that in the prior art, when a sample set is constructed under a data sparse situation, manual labeling needs to be carried out on each sample one by one, the efficiency is low, and the number of samples acquired is small, the construction efficiency of training samples can be greatly improved, and meanwhile, the number of acquired sparse samples is effectively improved.
The method and the device are based on the same application concept, and because the principles of solving the problems of the method and the device are similar, the implementation of the device and the method can be mutually referred, and repeated parts are not repeated.
Fig. 1 is a schematic flowchart of a training sample construction method provided in an embodiment of the present invention, and as shown in fig. 1, the method includes:
step 110, classifying the unlabeled texts based on the trained initial classification model to obtain the classification result of the unlabeled texts;
step 120, screening a difficult sample and/or a first candidate sparse sample from the unlabeled text based on the classification result of the unlabeled text;
and step 130, labeling the difficult samples and/or the first candidate sparse samples to obtain training samples.
Specifically, the trained initial classification model is used for classifying texts which are not manually labeled and the types of the texts are not known for the moment, so that the classification result of each unlabeled text is obtained. The initial classification model may be any type of text classification model, such as a combination of a text semantic extraction network (e.g., Roberta or BERT, etc.) and a classification network (as shown in fig. 2). The classification result of any unlabeled text may include the possibility that the unlabeled text corresponds to each text type, for example, in an illegal text classification scenario, the classification result of any unlabeled text may include the possibility that the unlabeled text is an illegal text and the possibility that the unlabeled text is a non-illegal text.
And according to the classification result of each unlabeled text, screening out difficult samples from the unlabeled texts. The difficult samples are samples which are difficult to learn for the current initial classification model and have relatively low possibility of classifying the samples into the correct types. The appearance of the difficult samples reflects the problem of sample imbalance to a certain extent, that is, the number of sparse samples is small, the initial classification model learns the sparse samples insufficiently, and the classification effect on the sparse samples is poor, so that the difficult samples may contain the sparse samples. Therefore, the difficult samples can be screened out to construct training samples, the number of sparse samples in the training samples is increased, the training effect of subsequent models is improved, and the classification accuracy of the sparse samples is improved. Even if the difficult samples are not sparse samples, the learning difficulty is high, and the difficult samples are brought into a training set for training, so that the training effect of a subsequent model is improved, and the classification accuracy of the similar samples is improved.
In addition, according to the classification result of classifying the unlabeled texts by the initial classification model, the unlabeled texts which are considered as sparse samples by the initial classification model are directly screened out to serve as first candidate sparse samples. For example, according to the probability that any unlabeled text corresponds to the sparse type, the unlabeled text with higher probability of corresponding to the sparse type can be screened out as the first candidate sparse sample. The sparse type is a type corresponding to the sparse sample, that is, the proportion of the number of corresponding samples to the total number of samples is less than a preset threshold. For example, in the illegal text classification scenario, the proportion of the illegal text is small, and thus the type of the illegal text is a sparse type. In addition, even if the first candidate sparse sample is not a sparse sample, which indicates that the initial classification model has a classification error, and is brought into the training set for learning, the training effect of the subsequent model is improved, and the classification accuracy is improved.
Here, because the training degree of the initial classification model is not perfect enough, the classification accuracy is not high, and the classification result of each unlabeled text may be inaccurate, further labeling may be performed on the screened difficult sample and the true type of the first candidate sparse sample, so as to obtain a training sample for training and learning of the initial classification model or other text classification models.
The unlabeled texts are classified by utilizing the trained initial classification model, so that the unlabeled texts are screened according to the classification result of each unlabeled text, difficult samples and first candidate sparse samples which are more likely to be sparse samples are screened, and labeling is performed on the basis of the screening result, the range of manual labeling is greatly reduced, the workload of manual labeling is reduced, the construction efficiency of training samples is improved, the sparse samples can be quickly acquired, the number of sparse samples in training concentration is increased, and the training effect and the text classification accuracy of the text classification model are improved.
According to the method provided by the embodiment of the invention, the trained initial classification model is used for classifying the unlabeled texts, so that the unlabeled texts are screened according to the classification result of each unlabeled text to obtain the difficult sample and the first candidate sparse sample, and labeling is carried out on the basis of the screening result to obtain the training sample, so that the construction efficiency of the training sample can be greatly improved, and the number of the obtained sparse samples can be effectively increased.
Based on the above embodiment, in step 120, based on the classification result of the unlabeled text, screening a difficult sample from the unlabeled text specifically includes:
calculating the confusion degree of the unlabeled text based on the probability distribution in the classification result of the unlabeled text; the more similar the probability of each type in the probability distribution of the unlabeled text is, the higher the confusion degree of the unlabeled text is;
and screening the difficult sample from the unlabeled text based on the confusion degree of the unlabeled text.
Specifically, the confusion degree of each unlabeled text is calculated according to the probability distribution in the classification result of each unlabeled text. The more similar the probability of each type in the probability distribution of any unlabeled text is, the higher the confusion degree of the unlabeled text is. Here, the more similar the probabilities corresponding to the types in the probability distribution of any unlabeled text, the less sufficient the initial classification model learns the semantic features of the unlabeled text, so that the initial classification model cannot clearly and certainly indicate the type of the unlabeled text, and thus the higher the confusion degree of the unlabeled text with respect to the initial classification model is.
The higher the confusion degree of the unlabeled text is, the higher the learning difficulty of the unlabeled text to the initial classification model is, so that the unlabeled text with higher confusion degree can be screened out to be used as a difficult sample to construct a training sample, thereby improving the learning effect of a subsequent model and improving the Recall rate (Recall) and accuracy rate (Precision) of the model.
According to the method provided by the embodiment of the invention, the confusion degree of the unlabeled text is calculated according to the probability distribution in the classification result of the unlabeled text, and the difficult sample is screened based on the confusion degree of the unlabeled text, so that the difficult sample with high learning difficulty can be effectively obtained, the quality of the training sample is optimized, and the learning effect of a subsequent model and the recall rate and accuracy rate of the model are favorably improved.
Based on any of the above embodiments, calculating the confusion degree of the unlabeled text based on the probability distribution in the classification result of the unlabeled text specifically includes:
and calculating the entropy of the probability distribution of the unlabeled text as the confusion degree of the unlabeled text.
Specifically, the entropy of each unlabeled text can be used as its corresponding confusion degree by calculating the entropy of the probability distribution of each unlabeled text. For example, in a binary classification scenario, the probability distribution of any unlabeled text includes a probability p1 corresponding to type 1 and a probability p2 corresponding to type 2, and then the entropy (i.e., the confusion degree) of the unlabeled text can be calculated as follows:
-p0*log(p0)-p1*log(p1)
based on any of the above embodiments, in step 120, based on the classification result of the unlabeled text, screening a first candidate sparse sample from the unlabeled text specifically includes:
dividing the unlabeled text into a plurality of probability sections based on the probability corresponding to the sparse type in the classification result of the unlabeled text;
and selecting one or more unlabeled texts from each probability segment higher than a preset threshold value as the first candidate sparse samples.
Specifically, each unlabeled text may be divided into a plurality of probability segments according to the probability corresponding to the sparse type in the classification result of each unlabeled text. Here, it is considered that the unlabeled text with higher probability corresponding to the sparse type is likely to be a sparse sample, and therefore, the unlabeled text with higher probability corresponding to the sparse type can be selected and divided into corresponding probability segments. For example, the unlabeled texts can be divided into five probability segments, namely 0.5-0.6, 0.6-0.7, 0.7-0.8, 0.8-0.9, 0.9-1 and the like, according to the probability corresponding to the sparse type in the classification result of each unlabeled text.
And then, selecting one or more unlabeled texts from each probability segment higher than a preset threshold value as first candidate sparse samples. Here, a plurality of unlabeled texts are respectively extracted from each probability segment meeting the condition, so that data are prevented from being sampled in the same probability segment, and the diversity of subsequently constructed training samples can be ensured.
According to the method provided by the embodiment of the invention, the unlabeled texts are divided into a plurality of probability segments, and one or more unlabeled texts are respectively selected from each probability segment higher than the preset threshold value as the first candidate sparse samples, so that the diversity of subsequently constructed training samples is ensured.
Based on any of the above embodiments, the initial classification model is obtained by performing countermeasure training on the labeled text.
Specifically, the type of part of the unlabeled text can be labeled manually by a labeling person to obtain a labeled text, and the initial classification model is trained by using the labeled text. Because the data labeling cost is high and the data collection is difficult, only a small amount of texts can be labeled to train an initial classification model, so that the labor cost is saved and the efficiency is improved.
Since the initial classification model has less training data in the training stage and is easy to be over-fitted, a confrontation training mode can be introduced in the training process of the model to relieve the over-fitting risk. The initial classification model is subjected to countermeasure training, namely, a disturbance sample is added in the model training process, and the disturbance sample is combined into original training data to form new training data, so that data enhancement is realized, and the problem of model overfitting is avoided.
According to the method provided by the embodiment of the invention, the initial classification model is obtained by performing countermeasure training on the labeled text, so that the problem of overfitting of the model can be avoided, and the text classification capability of the initial classification model is improved.
Based on any one of the above embodiments, performing countermeasure training on the labeled text specifically includes:
generating disturbance based on the input sample at the previous moment and the gradient of the previous moment to obtain the input sample at the current moment;
determining a gradient of the current time based on the input samples of the current time;
updating the initial classification model based on the gradient of the current moment;
wherein the input sample at the initial moment is the marked text.
Specifically, as shown in fig. 3, at an initial time, the labeled text is input into the initial classification model, so that a gradient corresponding to the labeled text at the initial time can be obtained, and thus, the model parameters of the initial classification model are updated according to the gradient. Then, a perturbation is generated according to the input sample (i.e. the labeled text x) at the previous moment and the gradient at the previous moment, and the input sample (i.e. the perturbation data x 1) at the current moment is formed. The input sample at the current moment is input into the initial classification model, so that the gradient of the current moment can be obtained, and the model parameters of the initial classification model are further updated according to the gradient. And repeating the operations, inputting the next marked text into the initial classification model after the preset number of disturbance data are generated, and repeating the operations again until the training is finished. For each labeled text, a plurality of disturbance data are generated in the model training process to further update the model parameters, and the problem of model overfitting caused by less training data is effectively solved.
Here, the perturbation may be generated in a manner such as fgsm (fast Gradient signal method), fgm (fast Gradient method), and pgd (project Gradient Descent). For example, the perturbation may be generated as follows:
Figure P_210919102214461_461078001
Figure P_210919102214509_509423001
wherein,xtIs a sample obtained by the t-th perturbation, xt+1Is to xtSample obtained after perturbation, g (x)t) For the gradient at the current time, Project () is a disturbance constraint, i.e. a disturbance function, α is a disturbance coefficient, and is a constant, and if the disturbance is too large and exceeds the constraint space of Project (), it is mapped back to the boundary of the constraint space to ensure that the disturbance is not too large. d is the differential, xt-1Is the sample generated by the last perturbation or the marked sample at the initial time, L is the loss function,
Figure M_210919102214556_556321001
is the coefficient of the initial classification model, and y is the sample label table.
Based on any of the above embodiments, the method further comprises:
screening a second candidate sparse sample from the unlabeled text based on the classification result of the unlabeled text;
and performing keyword matching on the second candidate sparse sample and a preset keyword, and taking the successfully matched second candidate sparse sample as a training sample.
Specifically, the number of the sparse samples obtained in the above manner may still be limited, and therefore, in order to further expand the number of the sparse samples to improve the identification accuracy of the text classification model on the sparse samples, more sparse samples can be obtained from the unlabeled text in a keyword matching manner. Specifically, a second candidate sparse sample which is possibly a sparse sample can be screened from the unlabeled text according to the classification result of each unlabeled text. Here, according to the probability of the corresponding sparse type in the classification result of each unlabeled text, a plurality of unlabeled samples with the highest probability may be selected as the second candidate sparse samples. For example, after the unlabeled texts are sorted in a manner that the probability corresponding to the sparse type is reduced from high to low, the front part of the unlabeled texts is selected as a second candidate sparse sample.
Subsequently, each second candidate sparse sample and a preset keyword are subjected to keyword matching (for example, regular matching). The preset keywords can be obtained by performing common-word analysis on characters, words or phrases in existing manually labeled sparse samples, and can be adjusted correspondingly according to specific service scenarios. For example, in a violation text classification scenario, by analyzing an existing sparse sample and determining that "vpn" is a violation high-frequency word, the "vpn" may be stored as a preset keyword. If any second candidate sparse sample is successfully matched with any preset keyword, namely the preset keyword appears in the second candidate sparse sample, the second candidate sparse sample is possibly a sparse sample, and therefore the second candidate sparse sample can be used as a training sample to be expanded into a training set. In addition, if the plurality of second candidate sparse samples are successfully matched with the same preset keyword, an appropriate number of second candidate sparse samples can be selected from the plurality of second candidate sparse samples as training samples according to an actual application scene, so that the risk of model overfitting caused by too many training samples matched with the same preset keyword and the risk of insufficient model training caused by too few training samples matched with the same preset keyword are avoided.
According to the method provided by the embodiment of the invention, the second candidate sparse sample primarily screened from the unlabeled text is matched with the preset keyword, and the successfully matched second candidate sparse sample is used as the training sample and is expanded into the training set, so that the expansion of the number of sparse samples is realized, and the classification accuracy of the model on the sparse samples is improved.
Based on any of the above embodiments, fig. 4 is a detailed flowchart schematic diagram of a training sample construction method provided by an embodiment of the present invention, as shown in fig. 4, the method includes:
at an initial stage, a small amount of labeled text is manually marked, step 410.
And step 420, taking the marked text in the step 410 as a training set, and finely adjusting an initial classification model formed by a pre-training model Roberta + classification network. As training data at this stage are less, and the model is easy to be over-fitted, a countertraining PGD method is introduced in the training process, and the over-fitting risk is relieved.
And 430, predicting the unlabeled text (namely, the test set) by using the initial classification model obtained in the step 420 to obtain a classification result of the unlabeled text, namely, probability distribution of each type corresponding to the unlabeled text.
And 440, screening data of the unlabeled text by using an active learning strategy, making up the problem of insufficient samples in the previous step, and filtering the unlabeled sparse data to improve the mark-out rate of the data. As shown in fig. 5, the screening strategy in this step includes two strategies:
1) and calculating entropy of the probability distribution of the unlabeled text to serve as the confusion degree of the unlabeled text. Sorting the confusion degree of the unlabeled texts, and screening the unlabeled texts with high confusion degree (for example, sorting based on entropy, taking the first 30% of data) as difficult samples to be placed in a candidate labeling set for subsequent labeling.
2) For the sparse type, the unlabeled text output by the model is divided into a plurality of probability sections according to the probability of the corresponding coefficient type, such as probability sections of 0.5-0.6, 0.6-0.7. And then, randomly extracting a plurality of unlabeled texts from each probability segment, and placing the unlabeled texts as first candidate sparse samples in a candidate labeling set for subsequent labeling so as to avoid sampling data in the same probability segment, thereby ensuring the diversity of the data.
Wherein, the screening methods of 1) and 2) can be carried out simultaneously, and the data obtained after screening is subjected to deduplication operation.
And 450, labeling the difficult samples and/or the first candidate sparse samples screened in the step 440 to obtain training samples.
Step 420 and 450 are looped until enough training samples are accumulated. The training data set is constructed through the steps, the efficiency is high, and meanwhile, the data quality and the diversity can be guaranteed.
Step 420-450 speeds up the efficiency of training sample acquisition, but may still face the problem of sample starvation. Thus, sample expansion can be performed using the following steps:
step 460, according to the probability distribution of each unlabeled text, a second candidate sparse sample which may be a sparse sample (for example, the unlabeled text corresponding to the sparse type and having a higher probability) is screened out. And matching keywords with the second candidate sparse sample, and labeling the successfully matched second candidate sparse sample to be used as a training sample.
In addition, since the accuracy of the partial extended samples may not be high, the loss weight corresponding to the partial samples may be appropriately reduced in the training process, and the weight may be determined by the accuracy of the extended samples.
Through the mode, the training samples can be quickly accumulated, the manual labeling pressure and cost are reduced, and the diversity and the number of the samples can be guaranteed.
Based on any of the above embodiments, fig. 6 is a schematic structural diagram of a training sample constructing apparatus provided in an embodiment of the present invention, and as shown in fig. 6, the apparatus includes: a classification unit 610, a sample screening unit 620 and a labeling unit 630.
The classification unit 610 is configured to classify an unlabeled text based on a trained initial classification model to obtain a classification result of the unlabeled text;
the sample screening unit 620 is used for screening a difficult sample and/or a first candidate sparse sample from the unlabeled text based on the classification result of the unlabeled text;
the labeling unit 630 is configured to label the difficult sample and/or the first candidate sparse sample to obtain a training sample.
According to the device provided by the embodiment of the invention, the trained initial classification model is used for classifying the unlabeled texts, so that the unlabeled texts are screened according to the classification result of each unlabeled text to obtain the difficult sample and the first candidate sparse sample, and labeling is carried out on the basis of the screening result to obtain the training sample, so that the construction efficiency of the training sample can be greatly improved, and the number of the obtained sparse samples can be effectively increased.
Based on any of the above embodiments, the sample screening unit 620 is specifically configured to:
calculating the confusion degree of the unlabeled text based on the probability distribution in the classification result of the unlabeled text; the more similar the probability of each type in the probability distribution of the unlabeled text is, the higher the confusion degree of the unlabeled text is;
and screening the difficult sample from the unlabeled text based on the confusion degree of the unlabeled text.
According to the device provided by the embodiment of the invention, the confusion degree of the unlabeled text is calculated according to the probability distribution in the classification result of the unlabeled text, and the difficult sample is screened based on the confusion degree of the unlabeled text, so that the difficult sample with high learning difficulty can be effectively obtained, the quality of the training sample is optimized, and the learning effect of a subsequent model and the recall rate and accuracy rate of the model are favorably improved.
Based on any of the above embodiments, calculating the confusion degree of the unlabeled text based on the probability distribution in the classification result of the unlabeled text specifically includes:
and calculating the entropy of the probability distribution of the unlabeled text as the confusion degree of the unlabeled text.
Based on any of the above embodiments, the sample screening unit 620 is specifically configured to:
dividing the unlabeled text into a plurality of probability sections based on the probability corresponding to the sparse type in the classification result of the unlabeled text;
and selecting one or more unlabeled texts from each probability segment higher than a preset threshold value as the first candidate sparse samples.
According to the device provided by the embodiment of the invention, the unlabeled texts are divided into the probability sections, and one or more unlabeled texts are respectively selected from each probability section higher than the preset threshold value as the first candidate sparse samples, so that the diversity of subsequently constructed training samples is ensured.
Based on any of the above embodiments, the initial classification model is obtained by performing countermeasure training on the labeled text.
According to the device provided by the embodiment of the invention, the initial classification model is obtained by performing countermeasure training on the labeled text, so that the problem of overfitting of the model can be avoided, and the text classification capability of the initial classification model is improved.
Based on any one of the above embodiments, performing countermeasure training on the labeled text specifically includes:
generating disturbance based on the input sample at the previous moment and the gradient of the previous moment to obtain the input sample at the current moment;
determining a gradient of the current time based on the input samples of the current time;
updating the initial classification model based on the gradient of the current moment;
wherein the input sample at the initial moment is the marked text.
Based on any of the above embodiments, the apparatus further comprises a sample expansion unit, wherein the sample expansion unit is configured to:
screening a second candidate sparse sample from the unlabeled text based on the classification result of the unlabeled text;
and performing keyword matching on the second candidate sparse sample and a preset keyword, and taking the successfully matched second candidate sparse sample as a training sample.
According to the device provided by the embodiment of the invention, the second candidate sparse sample primarily screened from the unlabeled text is matched with the preset keyword, and the successfully matched second candidate sparse sample is used as the training sample and is expanded into the training set, so that the expansion of the number of sparse samples is realized, and the classification accuracy of the model on the sparse samples is improved.
Fig. 7 illustrates a physical structure diagram of an electronic device, and as shown in fig. 7, the electronic device may include: a processor (processor)710, a communication Interface (Communications Interface)720, a memory (memory)730, and a communication bus 740, wherein the processor 710, the communication Interface 720, and the memory 730 communicate with each other via the communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform a training sample construction method comprising: classifying the unlabeled texts based on the trained initial classification model to obtain the classification result of the unlabeled texts; screening difficult samples and/or first candidate sparse samples from the unlabeled texts based on the classification result of the unlabeled texts; and labeling the difficult sample and/or the first candidate sparse sample to obtain a training sample.
In addition, the logic instructions in the memory 730 can be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being stored on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, the computer is capable of executing the training sample construction method provided by the above methods, and the method includes: classifying the unlabeled texts based on the trained initial classification model to obtain the classification result of the unlabeled texts; screening difficult samples and/or first candidate sparse samples from the unlabeled texts based on the classification result of the unlabeled texts; and labeling the difficult sample and/or the first candidate sparse sample to obtain a training sample.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program, when executed by a processor, implementing a training sample construction method provided by the above methods, the method including: classifying the unlabeled texts based on the trained initial classification model to obtain the classification result of the unlabeled texts; screening difficult samples and/or first candidate sparse samples from the unlabeled texts based on the classification result of the unlabeled texts; and labeling the difficult sample and/or the first candidate sparse sample to obtain a training sample.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (8)

1. A training sample construction method is characterized by comprising the following steps:
classifying the unlabeled texts based on the trained initial classification model to obtain the classification result of the unlabeled texts;
screening difficult samples and/or first candidate sparse samples from the unlabeled texts based on the classification result of the unlabeled texts;
labeling the difficult sample and/or the first candidate sparse sample to obtain a training sample;
the screening of the difficult samples from the unlabeled texts based on the classification result of the unlabeled texts specifically comprises:
calculating the confusion degree of the unlabeled text based on the probability distribution in the classification result of the unlabeled text; the more similar the probability of each type in the probability distribution of the unlabeled text is, the higher the confusion degree of the unlabeled text is;
screening the difficult sample from the unlabeled text based on the confusion degree of the unlabeled text;
screening a first candidate sparse sample from the unlabeled text based on the classification result of the unlabeled text, which specifically comprises:
dividing the unlabeled text into a plurality of probability sections based on the probability corresponding to the sparse type in the classification result of the unlabeled text;
and selecting one or more unlabeled texts from each probability segment higher than a preset threshold value as the first candidate sparse samples.
2. The method for constructing training samples according to claim 1, wherein the calculating the confusion degree of the unlabeled text based on the probability distribution in the classification result of the unlabeled text specifically comprises:
and calculating the entropy of the probability distribution of the unlabeled text as the confusion degree of the unlabeled text.
3. The method of claim 1, wherein the initial classification model is obtained by performing countermeasure training on labeled text.
4. The method for constructing training samples according to claim 3, wherein the performing countermeasure training on the labeled text specifically comprises:
generating disturbance based on the input sample at the previous moment and the gradient of the previous moment to obtain the input sample at the current moment;
determining a gradient of the current time based on the input samples of the current time;
updating the initial classification model based on the gradient of the current moment;
wherein the input sample at the initial moment is the marked text.
5. The training sample construction method according to any one of claims 1 to 4, characterized by further comprising:
screening a second candidate sparse sample from the unlabeled text based on the classification result of the unlabeled text;
and performing keyword matching on the second candidate sparse sample and a preset keyword, and taking the successfully matched second candidate sparse sample as a training sample.
6. A training sample construction apparatus, comprising:
the classification unit is used for classifying the unlabeled texts based on the trained initial classification model to obtain the classification result of the unlabeled texts;
the sample screening unit is used for screening a difficult sample and/or a first candidate sparse sample from the unlabeled text based on the classification result of the unlabeled text;
the labeling unit is used for labeling the difficult samples and/or the first candidate sparse samples to obtain training samples;
the screening of the difficult samples from the unlabeled texts based on the classification result of the unlabeled texts specifically comprises:
calculating the confusion degree of the unlabeled text based on the probability distribution in the classification result of the unlabeled text; the more similar the probability of each type in the probability distribution of the unlabeled text is, the higher the confusion degree of the unlabeled text is;
screening the difficult sample from the unlabeled text based on the confusion degree of the unlabeled text;
screening a first candidate sparse sample from the unlabeled text based on the classification result of the unlabeled text, which specifically comprises:
dividing the unlabeled text into a plurality of probability sections based on the probability corresponding to the sparse type in the classification result of the unlabeled text;
and selecting one or more unlabeled texts from each probability segment higher than a preset threshold value as the first candidate sparse samples.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the training sample construction method according to any of claims 1 to 5 are implemented when the program is executed by the processor.
8. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the training sample construction method according to any one of claims 1 to 5.
CN202111132630.0A 2021-09-27 2021-09-27 Training sample construction method and device, electronic equipment and storage medium Active CN113590764B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111132630.0A CN113590764B (en) 2021-09-27 2021-09-27 Training sample construction method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111132630.0A CN113590764B (en) 2021-09-27 2021-09-27 Training sample construction method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113590764A CN113590764A (en) 2021-11-02
CN113590764B true CN113590764B (en) 2021-12-21

Family

ID=78242330

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111132630.0A Active CN113590764B (en) 2021-09-27 2021-09-27 Training sample construction method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113590764B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114091595A (en) * 2021-11-15 2022-02-25 南京中兴新软件有限责任公司 Sample processing method, apparatus and computer-readable storage medium
CN114219046B (en) * 2022-01-26 2023-07-28 北京百度网讯科技有限公司 Model training method, matching method, device, system, electronic equipment and medium
CN114648980A (en) * 2022-03-03 2022-06-21 科大讯飞股份有限公司 Data classification and voice recognition method and device, electronic equipment and storage medium
CN115640808B (en) * 2022-12-05 2023-03-21 苏州浪潮智能科技有限公司 Text labeling method and device, electronic equipment and readable storage medium
CN117574146B (en) * 2023-11-15 2024-05-28 广州方舟信息科技有限公司 Text classification labeling method, device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111310846A (en) * 2020-02-28 2020-06-19 平安科技(深圳)有限公司 Method, device, storage medium and server for selecting sample image
CN112256823A (en) * 2020-10-29 2021-01-22 山东众阳健康科技集团有限公司 Corpus data sampling method and system based on adjacency density
CN112308144A (en) * 2020-10-30 2021-02-02 江苏云从曦和人工智能有限公司 Method, system, equipment and medium for screening samples

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130156300A1 (en) * 2011-12-20 2013-06-20 Fatih Porikli Multi-Class Classification Method
US10614379B2 (en) * 2016-09-27 2020-04-07 Disney Enterprises, Inc. Robust classification by pre-conditioned lasso and transductive diffusion component analysis
CN111104510B (en) * 2019-11-15 2023-05-09 南京中新赛克科技有限责任公司 Text classification training sample expansion method based on word embedding

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111310846A (en) * 2020-02-28 2020-06-19 平安科技(深圳)有限公司 Method, device, storage medium and server for selecting sample image
CN112256823A (en) * 2020-10-29 2021-01-22 山东众阳健康科技集团有限公司 Corpus data sampling method and system based on adjacency density
CN112308144A (en) * 2020-10-30 2021-02-02 江苏云从曦和人工智能有限公司 Method, system, equipment and medium for screening samples

Also Published As

Publication number Publication date
CN113590764A (en) 2021-11-02

Similar Documents

Publication Publication Date Title
CN113590764B (en) Training sample construction method and device, electronic equipment and storage medium
CN102411563B (en) Method, device and system for identifying target words
CN108573047A (en) A kind of training method and device of Module of Automatic Chinese Documents Classification
US20180357302A1 (en) Method and device for processing a topic
KR20200127020A (en) Computer-readable storage medium storing method, apparatus and instructions for matching semantic text data with tags
CN107545038B (en) Text classification method and equipment
CN107688630B (en) Semantic-based weakly supervised microbo multi-emotion dictionary expansion method
CN112070138A (en) Multi-label mixed classification model construction method, news classification method and system
CN113780007A (en) Corpus screening method, intention recognition model optimization method, equipment and storage medium
CN111984792A (en) Website classification method and device, computer equipment and storage medium
CN104850617A (en) Short text processing method and apparatus
CN110928981A (en) Method, system and storage medium for establishing and perfecting iteration of text label system
CN109918648B (en) Rumor depth detection method based on dynamic sliding window feature score
CN107357895A (en) A kind of processing method of the text representation based on bag of words
CN110910175A (en) Tourist ticket product portrait generation method
CN107341142B (en) Enterprise relation calculation method and system based on keyword extraction and analysis
CN111368534A (en) Application log noise reduction method and device
CN114691525A (en) Test case selection method and device
CN115758183A (en) Training method and device for log anomaly detection model
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN115062621A (en) Label extraction method and device, electronic equipment and storage medium
CN114020904A (en) Test question file screening method, model training method, device, equipment and medium
CN113095723A (en) Coupon recommendation method and device
CN109947932B (en) Push information classification method and system
CN108717637B (en) Automatic mining method and system for E-commerce safety related entities

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant