CN115840884A - Sample selection method, device, equipment and medium - Google Patents

Sample selection method, device, equipment and medium Download PDF

Info

Publication number
CN115840884A
CN115840884A CN202211606921.3A CN202211606921A CN115840884A CN 115840884 A CN115840884 A CN 115840884A CN 202211606921 A CN202211606921 A CN 202211606921A CN 115840884 A CN115840884 A CN 115840884A
Authority
CN
China
Prior art keywords
sample
samples
confidence
enhanced
clean
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211606921.3A
Other languages
Chinese (zh)
Inventor
蒋盛益
林晓钿
林楠铠
付颖雯
杨子渝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Foreign Studies
Original Assignee
Guangdong University of Foreign Studies
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Foreign Studies filed Critical Guangdong University of Foreign Studies
Priority to CN202211606921.3A priority Critical patent/CN115840884A/en
Publication of CN115840884A publication Critical patent/CN115840884A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a method, a device, equipment and a medium for selecting a sample, wherein the method comprises the steps of generating a clean sample and a noise sample by a classified data enhancement strategy, screening the clean sample with high confidence coefficient to be used as a high-quality sample, and reselecting the clean sample with low confidence coefficient such as the noise sample and the clean sample with low confidence coefficient to supplement the high-quality low confidence coefficient sample in the clean sample with high confidence coefficient, so that the screening of the high-quality sample in the enhanced sample is completed. The method not only can effectively screen the high-quality samples generated in the data enhancement samples, but also increases the diversity of the data enhancement samples, so that the model can learn more modes to improve the performance of the model, thereby further improving the generalization of the model. Correspondingly, the invention also provides a sample selection device, equipment and a medium.

Description

Sample selection method, device, equipment and medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a sample selection method, a sample selection device, sample selection equipment and a sample selection medium.
Background
Text classification is the basis of many Natural Language Processing (NLP) tasks, and is widely applied in various fields such as emotion analysis and intelligent question and answer. Generally, training a classifier with strong generalization capability usually requires a large amount of labeled data, but the construction of a large corpus requires high manual labeling cost and a large amount of time and energy, which are often difficult for people to bear. In order to solve the problem, a Data Augmentation (Data Augmentation) strategy is proposed, and Data Augmentation can greatly increase the Data volume and alleviate the Data deficiency so as to improve the generalization capability of the model. However, in the field of natural language processing, data enhancement faces a great challenge, and besides the discontinuity of text data, the great reason is that the anti-interference capability of the language itself is weak, and the random modification of the language data is likely to destroy the semantics thereof, generate low-quality samples, greatly influence the judgment of a classifier, and thus generate a negative feedback effect on the model.
Therefore, how to select high quality samples from the data enhancement samples is important.
Disclosure of Invention
Various aspects of embodiments of the present invention provide a method, an apparatus, a device, and a medium for selecting a sample, which can effectively screen out a high-quality sample generated by a data enhancement policy.
A first aspect of an embodiment of the present invention provides a sample selection method, including:
acquiring an enhanced sample, predicting class probability distribution of the enhanced sample based on a trained pre-training model, and obtaining a pseudo label of the enhanced sample according to the class probability distribution of the enhanced sample; the enhanced sample is an unmarked sample generated after the original marked sample data is enhanced;
classifying the enhanced sample into a clean sample and a noise sample according to a comparison result of the label of the original labeled sample corresponding to the enhanced sample and the pseudo label of the enhanced sample;
introducing class probability distribution of the clean sample predicted by a pre-training model after Monte Carlo sampling training under different model parameters to obtain confidence coefficient of the clean sample according to the class probability distribution under different model parameters, and classifying the clean sample into a high-confidence sample and a low-confidence sample according to the confidence coefficient;
confirming the recall sample according to the comparison result of the vocabulary similarity of the sample to be recalled and a set vocabulary similarity threshold value and the semantic fluency of the sample to be recalled and the set semantic fluency threshold value; wherein the samples to be recalled comprise low confidence samples and noise samples;
taking the high confidence sample and the recall sample as final selected samples.
A second aspect of the embodiments of the present invention provides a sample selection apparatus, including:
the pseudo label obtaining module is used for obtaining an enhanced sample, predicting class probability distribution of the enhanced sample based on a trained pre-training model, and obtaining a pseudo label of the enhanced sample according to the class probability distribution of the enhanced sample; the enhanced sample is an unmarked sample generated after the original marked sample data is enhanced;
the first classification module is used for classifying the enhanced samples into clean samples and noise samples according to the comparison result of the labels of the original marked samples corresponding to the enhanced samples and the pseudo labels of the enhanced samples;
the second classification module is used for introducing class probability distribution of the clean sample predicted by a pre-training model after Monte Carlo sampling training under different model parameters, obtaining confidence coefficient of the clean sample according to the class probability distribution under different model parameters, and classifying the clean sample into a high-confidence sample and a low-confidence sample according to the confidence coefficient;
the recall module is used for confirming the recall sample according to the comparison result of the vocabulary similarity of the sample to be recalled and a set vocabulary similarity threshold value and the semantic fluency of the sample to be recalled and the set semantic fluency threshold value; wherein the samples to be recalled comprise low confidence samples and noise samples;
a selection module to take the high confidence sample and the recall sample as final selected samples. A third aspect of the embodiments of the present invention provides a terminal device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, where the processor executes the computer program to implement the sample selection method provided in the foregoing embodiments.
A fourth aspect of the embodiments of the present invention provides a storage medium, where the storage medium includes a stored computer program, and when the computer program runs, a device on which the storage medium is located is controlled to execute the sample selection method provided in the foregoing embodiments.
Compared with the prior art, the sample selection method provided by the embodiment of the invention generates the clean samples and the noise samples by the classified data enhancement strategy, screens the samples with high confidence coefficient from the clean samples as high-quality samples, and reselects the low confidence coefficient samples such as the noise samples and the clean samples with low confidence coefficient to supplement the high-quality low confidence coefficient samples from the clean samples with high confidence coefficient, thereby completing the screening of the high-quality samples in the enhanced samples. According to the embodiment, the high-quality samples generated by the data enhancement strategy can be effectively screened out, and the diversity of the data enhancement samples is increased by recalling the low-reliability samples, so that the model can learn more modes, the performance of the model is improved, and the generalization of the model is further improved. Correspondingly, the embodiment of the invention also provides a sample selection device, equipment and a medium.
Drawings
FIG. 1 is a block diagram of a sample selection framework provided by an embodiment of the present invention;
fig. 2 is a schematic flow chart of a sample selection method according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The data enhancement strategy is very popular in the research field of solving the problem of data shortage, and refers to a process of generating new data by training data through certain transformation operation, so that limited data can generate a value equivalent to more data, and the generalization capability and robustness of the model are improved. Data enhancement strategies are also widely used in the field of natural language processing. However, in the field of natural language processing, data enhancement faces a great challenge, and besides the discontinuity of text data, the great reason is that the language itself has weak anti-interference capability, and the random modification of language data is likely to destroy the semantics thereof, generate low-quality samples, greatly affect the judgment of a classifier, and thus generate negative feedback effect on a model.
Although data enhancement can alleviate the data shortage problem of model training to some extent, the enhanced samples with low quality may have negative feedback effect on the model. Therefore, it is important to filter the noise of the enhanced samples, i.e., sample selection. Currently, there are related studies for evaluating the quality of a sample, such as data evaluation based on classifier discrimination and data evaluation based on text similarity.
The data evaluation method based on classifier discrimination is to train a text classification model by using labeled data and then classify or predict label-free data by using a text classifier, however, the method singly uses the classifier to evaluate the generated data, which results in that the screened data distribution fits the classifier, i.e. the original data distribution is maintained, thus the diversity of the screened samples is not high, and the performance of the classifier cannot be improved.
The selection method based on the text similarity has no great difficulty in technical realization, and the text similarity is generally judged by calculating the text distance. In natural language processing, the problem of how to measure the similarity of two texts is often involved. How to measure the similarity between sentences or phrases is particularly important in questions such as dialogue systems (Dialog systems) and Information retrieval (Information retrieval). And screening the samples generated by the model through the text coverage between the original samples and the generated samples. However, the strategy only considers the similarity between the generated sample and the original sample from the text vocabulary level, and lacks information at the semantic level.
The sample evaluation method only evaluates from a single dimension, and does not screen the data-enhanced samples from multiple dimensions, so that the overall quality of the selected samples is not high.
In order to solve the problem of negative gain brought by harmful samples of a data enhancement strategy to a model, the embodiment of the application provides a sample selection method, a device, equipment and a medium, clean samples and noise samples generated by the data enhancement strategy are classified, samples with high confidence coefficient are screened from the clean samples to be used as high-quality samples, and low-confidence coefficient samples such as the noise samples and the clean samples with low confidence coefficient are reselected to supplement the high-quality low-confidence coefficient samples from the high-confidence coefficient clean samples, so that screening of the high-quality samples in the enhancement samples is completed, diversity of the data enhancement samples is increased, the model can learn more modes, the performance of the model is improved, and the generalization of the model is further improved.
For example, referring to fig. 1, fig. 1 is a block diagram of a sample selection framework employed in a sample selection method, apparatus, device, and medium according to an embodiment of the present application.
As shown in fig. 1, the frame diagram is composed of four modules: (1) a pre-training module; (2) a semi-supervised training module; (3) a sample selection module; and (4) a sample recall module. In the semi-supervised training module, original labeled sample D is utilized l (X, Y) training a pre-training model (an example of the pre-training module selecting RoBERTa is shown in the figure); in the sample selection module, the pre-training model which is trained well is used for predicting the enhanced sample D g (X, Y) class probability distribution, obtaining enhanced sample D from class probability distribution result g Pseudo label of (X, Y), and will enhance sample D g Comparing the false label of (X, Y) with its real label, classifying the clean sample D clean And noise samples D noisy And calculating a clean sample D through MC dropout clean Uncertainty of prediction in a pre-training model is selected by adopting an entropy-based strategy to carry out self-training on uncertain samples of the model, and a high-confidence sample D is obtained easy And low confidence samples D hard . Finally, in the sample recall module, from two dimensions of vocabulary similarity and semantic fluency, automatically selecting a low-reliability sample (D) hard And D noisy ) Recall some high quality confidencable degree samples.
Embodiments of the present application will be described in detail below with reference to the accompanying drawings.
Referring to fig. 2, fig. 2 is a schematic flow chart of a sample selection method according to an embodiment of the present invention. The sample selection method of the present embodiment includes steps S11 to S15:
s11, obtaining an enhanced sample, predicting class probability distribution of the enhanced sample based on a trained pre-training model, and obtaining a pseudo label of the enhanced sample according to the class probability distribution of the enhanced sample; and the enhanced sample is a label-free sample generated after the original label sample data is enhanced.
For example, there may be one or more enhanced samples generated by the data enhancement strategy from one original annotated sample. Assuming that the original labeled sample is { I, S, question, you }, the enhanced sample generated for data enhancement is { I, S, question }, { I, Q, you, S, question }, and the like.
S12, classifying the enhanced sample into a clean sample and a noise sample according to a comparison result of the label of the original marked sample corresponding to the enhanced sample and the pseudo label of the enhanced sample.
Exemplarily, an enhanced sample with the same label and pseudo label of a corresponding original labeled sample is used as a clean sample, an enhanced sample with a different label and pseudo label of a corresponding original labeled sample is used as a noise sample, and the clean sample is respectively put into a clean sample set D clean In, the noise samples are put into a set of noise samples D noisy In (1).
S13, introducing class probability distribution of the clean sample predicted by the pre-training model after Monte Carlo sampling training under different model parameters, obtaining confidence of the clean sample according to the class probability distribution under different model parameters, and classifying the clean sample into a high-confidence sample and a low-confidence sample according to the confidence.
Specifically, the method and the device enhance the reliability of the generated sample through the quantized data, so that the problem of division of a high-confidence sample and a low-confidence sample in sample selection is solved.
S14, confirming the recall sample according to the comparison result of the vocabulary similarity of the sample to be recalled and a set vocabulary similarity threshold value and the semantic fluency of the sample to be recalled and the set semantic fluency threshold value; wherein the to-be-recalled samples comprise low confidence samples and noise samples.
Illustratively, due to the presence of noise data in data generated by data enhancement, during the training process of the pre-trained model, the model is gradually shifted under the influence of the noise data, so as to influence the performance of the pre-trained model, and the low confidence samples also have great uncertainty on the prediction result of the model, so the low confidence samples and the noise samples are classified into the low confidence samples.
However, for low confidence samples, there may be high quality samples, and although the confidence of prediction of such samples during the training of the pre-trained model is low, in practice, the training and performance of the pre-trained model are improved. Therefore, in the embodiment of the invention, the low-reliability sample is recalled by adopting a plurality of dimensions, and whether the low-reliability sample is a high-quality sample is evaluated by respectively adopting two angles of vocabulary similarity and semantic fluency.
And S15, taking the high-confidence sample and the recall sample as finally selected samples.
According to the technical scheme provided by the embodiment of the invention, an enhanced sample is obtained, the class probability distribution of the enhanced sample is predicted based on a trained pre-training model, and the pseudo label of the enhanced sample is obtained according to the class probability distribution of the enhanced sample; classifying the enhanced sample into a clean sample and a noise sample according to a comparison result of a label of an original labeled sample corresponding to the enhanced sample and a pseudo label of the enhanced sample, introducing class probability distribution of the clean sample predicted by a pre-training model after Monte Carlo sampling training under different model parameters, obtaining confidence coefficient of the clean sample according to the class probability distribution under different model parameters, and classifying the clean sample into a high-confidence sample and a low-confidence sample according to the confidence coefficient; confirming the recall sample according to the comparison result of the vocabulary similarity of the sample to be recalled and a set vocabulary similarity threshold value and the semantic fluency of the sample to be recalled and the set semantic fluency threshold value; and finally, taking the high confidence sample and the recall sample as finally selected samples. The embodiment can effectively screen out high-quality samples generated by the data enhancement strategy, and also can increase the diversity of the data enhancement samples by recalling the low-reliability samples, so that the model can learn more modes, the performance of the model is improved, and the generalization of the model is further improved.
In an optional embodiment, before predicting the class probability distribution of the enhanced sample based on the trained pre-trained model, the method further includes:
and acquiring the original marking sample, and training the pre-training model by adopting a semi-supervised method based on the original marking sample to obtain the trained pre-training model.
In this embodiment, a semi-supervised approach is used to train the pre-trained model to predict and filter the data enhancement generated samples. Semi-supervised methods may augment a labeled data set by using an unlabeled data set. In the implementation process, an original labeled sample with a label is used for training a pre-training model, the trained pre-training model is used for predicting an unlabeled enhanced sample, the enhanced sample with the highest prediction confidence coefficient is taken and a pseudo label is marked on the enhanced sample, and then the enhanced sample with the pseudo label is put into the current training sample for continuous training until the pre-training model is converged.
Illustratively, the semi-supervised approach employs self-training. In implementation, a self-training method is used for training a pre-training model to predict and filter a data enhancement generated sample. Suppose D l ={x l ,y l Denotes the original labeled sample, y l Representing the original annotated sample x l C class labels of (1), x l Representing a sequence consisting of n tokens. D g ={x g ,y g The enhanced sample generated by the data enhancement mode is represented, and the specific framework is realized as follows:
s1, using original labeled sample D l The pre-trained model is trained as a teacher network and cross-entropy is used as its loss function L.
S2, utilizing the teacher network to predict and enhance a sample D g Probability of the class label to which it belongs.
S3, screening the enhanced samples by using a plurality of screening strategies based on class label probability predicted by the teacher network to obtain a subset S of the enhanced samples g
S4, screening the sample S g With the original annotated sample D l Together for training a student network.
And S5, enabling the teacher network and the student network to be alternately trained continuously, namely regarding the current student network as the teacher network, and returning to the step S2 until the pre-training model is converged.
In an alternative embodiment, the pre-trained model is a pre-trained language model using a Bert structure.
Illustratively, the pre-training model employs RoBERTa. RoBERTa is a modified version of BERT that can use larger vocabularies and train large data sets for longer sequences with larger batches by improving the training task and data generation. The BERT model adopts two tasks of predicting the Next Sentence (NSP) and Modeling the Mask Language (MLM). In RoBERTa, the NSP task is deleted, leaving only the MLM task for pre-training.
For the classification problem, the first position in its input sentence is labeled [ CLS]Starting symbol, and its token corresponds to the final hidden state h i Usually represented as an aggregated sequence of classification tasks. For each token in a given sentence, its input representation is constructed by summing the corresponding token, segment, and position embedding.
Based on this, through experimental comparison, the embodiment of the present invention uses RoBERTa as a pre-training model, which is used as an encoder of language features, by obtaining the first token ([ CLS ] of the final hidden state]) To represent the sentence code S of the sample, to obtain a sentence vector representation of the sample, which is then output by the pre-trained model RoBERTa for each sample i Comprises the following steps:
S i =RoBERTa(a i ,b i ,c i )
wherein, a i ,b i ,c i Token, segment, and position embeddings.
In an optional implementation manner, the step S13 of "introducing class probability distributions of the clean sample predicted by a pre-training model after monte carlo sampling training under different model parameters, so as to obtain the confidence of the clean sample according to the class probability distributions under different model parameters" specifically includes:
and introducing class probability distribution of the clean sample predicted by a pre-training model after Monte Carlo sampling training under different model parameters, and calculating the confidence coefficient of the clean sample by adopting an information entropy formula according to the class probability distribution under different model parameters.
In this embodiment, different model parameters are extracted by using monte carlo dropout, and different output results of the same clean sample in the same pre-training model are obtained. In the t-th dropout, the pre-training model predicts the probability of the clean sample appearing in the class c
Figure BDA0003998945560000091
Expressed as:
Figure BDA0003998945560000092
wherein the content of the first and second substances,
Figure BDA0003998945560000093
the parameter matrix at the T-th dropout is shown, x is the expression of the sentence vector of the clean sample, T is the total dropout number, and B is the offset value.
Quantifying the credibility of the clean sample by using the information entropy according to the class probability distribution of each dropout to obtain the confidence degree H (p) of the clean sample i ) I.e. how easily the pre-trained model discriminates from the clean sample:
Figure BDA0003998945560000094
Figure BDA0003998945560000095
Figure BDA0003998945560000096
wherein, C represents the number of the class labels,
Figure BDA0003998945560000097
and &>
Figure BDA0003998945560000098
Respectively representing the degree of easy discrimination and the degree of difficult discrimination of the ith clean sample.
After the confidences of all the clean samples are determined, classifying the clean samples with the confidences larger than a preset confidence threshold value into a low confidence sample set D hrad Classifying clean samples with confidence less than or equal to the confidence threshold into high confidence samples D easy . Wherein, the confidence threshold value can be determined according to experiments, samples with confidence coefficient of 0.1-0.3 are selected in the experiments, and the experimental results show the confidence coefficientAt 0.25 optimum, the confidence threshold is set to 0.25.
In an optional implementation manner, the predicting the class probability distribution of the enhanced sample based on the trained pre-training model, and obtaining the pseudo label of the enhanced sample according to the class probability distribution of the enhanced sample specifically includes:
predicting class probability distribution of the enhanced sample based on a trained pre-training model
Figure BDA0003998945560000101
And obtain a pseudo-label->
Figure BDA0003998945560000102
Wherein a class probability distribution->
Figure BDA0003998945560000103
And a pseudo label->
Figure BDA0003998945560000104
Calculated by the following formula:
Figure BDA0003998945560000105
Figure BDA0003998945560000106
wherein the content of the first and second substances,
Figure BDA0003998945560000107
to enhance the samples, θ is the parameter of the pre-training model, softmax () is the softmax () function in the pre-training model, W is the parameter matrix, b is the offset value, argmaxf () function in the argmaxf () pre-training model.
It can be seen that, in this alternative embodiment, a binary classification task is introduced through the trained pre-training model prediction to obtain a pseudo label of an enhanced sample, and the pseudo label is compared with a label of a corresponding original labeled sample to filter the enhanced sample with a prediction error in the enhanced sample.
In an alternative embodiment, the vocabulary similarity threshold is obtained by:
calculating the vocabulary similarity between each high-confidence sample and the corresponding original labeling sample, and obtaining a vocabulary similarity threshold according to the vocabulary similarity between all the high-confidence samples and the corresponding original labeling samples;
and the vocabulary similarity is calculated by the following formula:
Figure BDA0003998945560000108
wherein J (x) represents lexical similarity, x g 、x l Respectively representing the enhanced sample and the original marked sample corresponding to the enhanced sample.
Specifically, an average value J of the vocabulary similarity between all the high-confidence samples and the corresponding original labeled samples is calculated avg A mixture of J and avg as the lexical similarity threshold.
It can be seen that in this alternative embodiment, the lexical similarity threshold is calculated by calculating lexical similarities between all high-confidence samples and corresponding original labeled samples, so that the lexical similarity threshold is used as one of the indicators of high and low quality of the low-confidence sample (including the noise sample and the low-confidence sample, as known from the above). In addition, the vocabulary similarity threshold is calculated through the high-confidence sample, model normalization and probability modeling are not needed, and the workload of standardization and probability modeling can be reduced.
In an alternative embodiment, the semantic fluency threshold is obtained by:
calculating the difference between the confusion degree of the original labeling sample corresponding to the high-confidence sample and the confusion degree of the high-confidence sample to obtain the semantic fluency of the high-confidence sample;
and obtaining the semantic fluency threshold according to the semantic fluency of all the high-confidence-level samples.
Specifically, the semantic fluency can be obtained by a Statistical Language model (called Statistical Language Models for short, SLM), which is very important in the field of syntax error correction task. Therefore, the present embodiment proposes a metric to evaluate the semantic fluency of the enhanced sample. In particular, assume an originally labeled sample
Figure BDA0003998945560000111
There are n words, one of which the enhanced sample is->
Figure BDA0003998945560000112
The total m words are calculated by firstly utilizing a language model ResLSTM trained on the One-Billion corpus to calculate the confusion degree of the high-confidence samples of the m words>
Figure BDA0003998945560000113
Confusion of originally labeled samples corresponding to high confidence samples>
Figure BDA0003998945560000114
Figure BDA0003998945560000115
Figure BDA0003998945560000116
Wherein the content of the first and second substances,
Figure BDA0003998945560000121
respectively representing the jth word of the ith sentence in the original annotation sample,
Figure BDA0003998945560000122
the jth word representing the ith sentence in the enhancement sample.
For said high confidence samplesDegree of confusion
Figure BDA0003998945560000123
Confusion of originally labeled samples corresponding to high confidence samples>
Figure BDA0003998945560000124
Performing difference operation to obtain semantic fluency F of the high-confidence sample ik (x):/>
Figure BDA0003998945560000125
Semantic fluency F for all high-confidence samples ik (x) Average value F avg Will F avg As the semantic fluency threshold.
It can be seen that in this alternative embodiment, the semantic fluency threshold is calculated by calculating the semantic fluency of all high confidence samples, and thus serves as another indicator for evaluating the quality of low confidence samples (including noise samples and low confidence samples, as noted above). In addition, the semantic fluency threshold is calculated through the high-confidence sample, model normalization and probability modeling are not needed, and the workload of standardization and probability modeling can be reduced.
In an optional embodiment, in step S14", the step of confirming the recall sample according to a comparison result between the vocabulary similarity of the sample to be recalled and the set vocabulary similarity threshold, and between the semantic fluency of the sample to be recalled and the set semantic fluency threshold specifically includes:
and selecting a to-be-recalled sample with the vocabulary similarity larger than the vocabulary similarity threshold and the semantic fluency larger than the semantic fluency threshold as the recall sample.
In the embodiment, the to-be-recalled sample with the vocabulary similarity larger than the vocabulary similarity threshold and the semantic fluency larger than the semantic fluency threshold is selected as the high-quality sample to be recalled, so that the high-quality sample with low credibility is supplemented on the basis of the preliminarily screened high-quality sample (high-confidence sample), and the diversity of the data enhancement sample is enhanced.
Correspondingly, an embodiment of the present invention further provides a sample selection apparatus, including:
the pseudo label obtaining module is used for obtaining an enhanced sample, predicting class probability distribution of the enhanced sample based on a trained pre-training model, and obtaining a pseudo label of the enhanced sample according to the class probability distribution of the enhanced sample; the enhanced sample is an unmarked sample generated after the original marked sample data is enhanced;
the first classification module is used for classifying the enhanced samples into clean samples and noise samples according to the comparison result of the labels and the pseudo labels of the original marked samples corresponding to the enhanced samples;
the second classification module is used for introducing class probability distribution of the clean sample predicted by a pre-training model after Monte Carlo sampling training under different model parameters, obtaining confidence of the clean sample according to the class probability distribution under different model parameters, and classifying the clean sample into a high-confidence sample and a low-confidence sample according to the confidence;
the recall module is used for confirming the recall sample according to the comparison result of the vocabulary similarity of the sample to be recalled and a set vocabulary similarity threshold value and the semantic fluency of the sample to be recalled and the set semantic fluency threshold value; wherein the samples to be recalled comprise low confidence samples and noise samples;
a selection module to take the high confidence sample and the recall sample as final selected samples.
In an optional embodiment, before predicting the class probability distribution of the enhanced sample based on the trained pre-trained model, the method further includes:
and acquiring the original marking sample, and training the pre-training model by adopting a semi-supervised method based on the original marking sample to obtain the trained pre-training model.
In an alternative embodiment, the vocabulary similarity threshold is obtained by:
calculating the vocabulary similarity between each high-confidence sample and the corresponding original labeling sample, and obtaining a vocabulary similarity threshold according to the vocabulary similarity between all the high-confidence samples and the corresponding original labeling samples;
and the vocabulary similarity is calculated by the following formula:
Figure BDA0003998945560000131
/>
wherein J (x) represents lexical similarity, x g 、x l Respectively representing the enhanced sample and the original marked sample corresponding to the enhanced sample.
In an alternative embodiment, the semantic fluency threshold is obtained by:
calculating the difference between the confusion degree of the original labeling sample corresponding to the high-confidence sample and the confusion degree of the high-confidence sample to obtain the semantic fluency of the high-confidence sample;
and obtaining the semantic fluency threshold according to the semantic fluency of all the high-confidence-degree samples.
In an optional implementation manner, the introducing a class probability distribution of the clean sample predicted by a pre-training model after monte carlo sampling training under different model parameters to obtain a confidence of the clean sample according to the class probability distribution under different model parameters specifically includes:
class probability distribution of the clean sample under different model parameters is predicted by a pre-training model after Monte Carlo sampling training, and the confidence coefficient of the clean sample is calculated by adopting an information entropy formula according to the class probability distribution under different model parameters.
In an optional implementation manner, the confirming the recall sample according to the comparison result between the vocabulary similarity of the to-be-recalled sample and the set vocabulary similarity threshold and between the semantic fluency of the to-be-recalled sample and the set semantic fluency threshold specifically includes:
and selecting a to-be-recalled sample with the vocabulary similarity larger than the vocabulary similarity threshold and the semantic fluency larger than the semantic fluency threshold as the recall sample.
In an alternative embodiment, the pre-trained model is a pre-trained language model using a Bert structure.
It should be noted that the sample selection apparatus provided in the embodiment of the present invention is used for executing all the steps and processes of the sample selection method provided in the above embodiment, and the working principles and beneficial effects of the two are in one-to-one correspondence, and redundant description is not repeated here.
Accordingly, an embodiment of the present invention further provides a terminal device, which includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, and when the processor executes the computer program, the processor implements the sample selection method provided in the foregoing embodiment, for example, S11 to S15 in fig. 2.
Accordingly, an embodiment of the present invention further provides a storage medium, where the storage medium includes a stored computer program, where when the computer program runs, the apparatus on which the storage medium is located is controlled to execute the sample selection method provided in the foregoing embodiment, for example, S11 to S15 in fig. 2.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (10)

1. A method of sample selection, comprising:
acquiring an enhanced sample, predicting class probability distribution of the enhanced sample based on a trained pre-training model, and obtaining a pseudo label of the enhanced sample according to the class probability distribution of the enhanced sample; the enhanced sample is an unmarked sample generated after the original marked sample data is enhanced;
classifying the enhanced sample into a clean sample and a noise sample according to a comparison result of the label of the original labeled sample corresponding to the enhanced sample and the pseudo label of the enhanced sample;
introducing class probability distribution of the clean sample predicted by a pre-training model after Monte Carlo sampling training under different model parameters to obtain confidence of the clean sample according to the class probability distribution under different model parameters, and classifying the clean sample into a high-confidence sample and a low-confidence sample according to the confidence;
confirming the recall sample according to the comparison result of the vocabulary similarity of the sample to be recalled and a set vocabulary similarity threshold value and the semantic fluency of the sample to be recalled and the set semantic fluency threshold value; wherein the samples to be recalled comprise low confidence samples and noise samples;
taking the high confidence sample and the recall sample as final selected samples.
2. The sample selection method of claim 1, further comprising, prior to the predicting the class probability distribution of the enhanced sample based on the trained pre-trained model:
and acquiring the original marking sample, and training the pre-training model by adopting a semi-supervised method based on the original marking sample to obtain the trained pre-training model.
3. The sample selection method of claim 1, wherein the lexical similarity threshold is obtained by:
calculating the vocabulary similarity between each high-confidence sample and the corresponding original labeling sample, and obtaining a vocabulary similarity threshold according to the vocabulary similarity between all the high-confidence samples and the corresponding original labeling samples;
and the vocabulary similarity is calculated by the following formula:
Figure FDA0003998945550000021
wherein J (x) represents lexical similarity, x g 、x l Respectively representing the enhanced sample and the original marked sample corresponding to the enhanced sample.
4. The sample selection method of claim 1, wherein the semantic fluency threshold is obtained by:
calculating the difference between the confusion degree of the original labeling sample corresponding to the high-confidence sample and the confusion degree of the high-confidence sample to obtain the semantic fluency of the high-confidence sample;
and obtaining the semantic fluency threshold according to the semantic fluency of all the high-confidence-degree samples.
5. The sample selection method according to claim 1, wherein the introducing of the class probability distribution of the clean sample predicted by the pre-training model after the monte carlo sampling training under different model parameters to obtain the confidence of the clean sample according to the class probability distribution under different model parameters specifically comprises:
class probability distribution of the clean sample under different model parameters is predicted by a pre-training model after Monte Carlo sampling training, and the confidence coefficient of the clean sample is calculated by adopting an information entropy formula according to the class probability distribution under different model parameters.
6. The sample selection method according to claim 1, wherein the identifying the recall sample according to the comparison result between the vocabulary similarity of the sample to be recalled and the set vocabulary similarity threshold and between the semantic fluency of the sample to be recalled and the set semantic fluency threshold comprises:
and selecting a sample to be recalled as the recall sample, wherein the vocabulary similarity is greater than the vocabulary similarity threshold and the semantic fluency is greater than the semantic fluency threshold.
7. The sample selection method as recited in claim 1, wherein the pre-trained model is a pre-trained language model employing a Bert structure.
8. A sample selection device, comprising:
the pseudo label obtaining module is used for obtaining an enhanced sample, predicting class probability distribution of the enhanced sample based on a trained pre-training model, and obtaining a pseudo label of the enhanced sample according to the class probability distribution of the enhanced sample; the enhanced sample is an unmarked sample generated after the original marked sample data is enhanced;
the first classification module is used for classifying the enhanced samples into clean samples and noise samples according to the comparison result of the labels of the original marked samples corresponding to the enhanced samples and the pseudo labels of the enhanced samples;
the second classification module is used for introducing class probability distribution of the clean sample predicted by a pre-training model after Monte Carlo sampling training under different model parameters, obtaining confidence of the clean sample according to the class probability distribution under different model parameters, and classifying the clean sample into a high-confidence sample and a low-confidence sample according to the confidence;
the recall module is used for confirming the recall sample according to the comparison result of the vocabulary similarity of the sample to be recalled and a set vocabulary similarity threshold value and the semantic fluency of the sample to be recalled and the set semantic fluency threshold value; wherein the samples to be recalled comprise low confidence samples and noise samples;
a selection module to take the high confidence sample and the recall sample as final selected samples.
9. A terminal device comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor when executing the computer program implementing the sample selection method of any of claims 1 to 7.
10. A storage medium comprising a stored computer program, wherein the computer program, when executed, controls an apparatus in which the storage medium is located to perform the sample selection method of any one of claims 1 to 7.
CN202211606921.3A 2022-12-14 2022-12-14 Sample selection method, device, equipment and medium Pending CN115840884A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211606921.3A CN115840884A (en) 2022-12-14 2022-12-14 Sample selection method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211606921.3A CN115840884A (en) 2022-12-14 2022-12-14 Sample selection method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN115840884A true CN115840884A (en) 2023-03-24

Family

ID=85578613

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211606921.3A Pending CN115840884A (en) 2022-12-14 2022-12-14 Sample selection method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN115840884A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116341561A (en) * 2023-03-27 2023-06-27 京东科技信息技术有限公司 Voice sample data generation method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116341561A (en) * 2023-03-27 2023-06-27 京东科技信息技术有限公司 Voice sample data generation method, device, equipment and storage medium
CN116341561B (en) * 2023-03-27 2024-02-02 京东科技信息技术有限公司 Voice sample data generation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN108664589B (en) Text information extraction method, device, system and medium based on domain self-adaptation
CN111738004A (en) Training method of named entity recognition model and named entity recognition method
CN109635108B (en) Man-machine interaction based remote supervision entity relationship extraction method
CN111506732B (en) Text multi-level label classification method
CN110298044B (en) Entity relationship identification method
CN113626589B (en) Multi-label text classification method based on mixed attention mechanism
CN113128206B (en) Question generation method based on word importance weighting
CN110874411A (en) Cross-domain emotion classification system based on attention mechanism fusion
CN113254675B (en) Knowledge graph construction method based on self-adaptive few-sample relation extraction
Zhou et al. ICRC-HIT: A deep learning based comment sequence labeling system for answer selection challenge
CN110866113A (en) Text classification method based on sparse self-attention mechanism fine-tuning Bert model
CN114492460B (en) Event causal relationship extraction method based on derivative prompt learning
CN112364125A (en) Text information extraction system and method combining reading course learning mechanism
CN115328782A (en) Semi-supervised software defect prediction method based on graph representation learning and knowledge distillation
CN114880307A (en) Structured modeling method for knowledge in open education field
CN115840884A (en) Sample selection method, device, equipment and medium
CN111444328A (en) Natural language automatic prediction inference method with interpretation generation
Wang et al. Linguistic steganalysis in few-shot scenario
CN115860002A (en) Combat task generation method and system based on event extraction
CN115640829A (en) Pseudo label iteration field self-adaption method based on prompt learning
CN115422945A (en) Rumor detection method and system integrating emotion mining
CN114996424B (en) Weak supervision cross-domain question-answer pair generation method based on deep learning
CN117668213B (en) Chaotic engineering abstract generation method based on cascade extraction and graph comparison model
Lee et al. Application of AutoML in the Automated Coding of Educational Discourse Data
CN116595147A (en) Legal question response model generation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination