CN115840884A

CN115840884A - Sample selection method, device, equipment and medium

Info

Publication number: CN115840884A
Application number: CN202211606921.3A
Authority: CN
Inventors: 蒋盛益; 林晓钿; 林楠铠; 付颖雯; 杨子渝
Original assignee: Guangdong University of Foreign Studies
Current assignee: Guangdong University of Foreign Studies
Priority date: 2022-12-14
Filing date: 2022-12-14
Publication date: 2023-03-24

Abstract

The invention discloses a method, a device, equipment and a medium for selecting a sample, wherein the method comprises the steps of generating a clean sample and a noise sample by a classified data enhancement strategy, screening the clean sample with high confidence coefficient to be used as a high-quality sample, and reselecting the clean sample with low confidence coefficient such as the noise sample and the clean sample with low confidence coefficient to supplement the high-quality low confidence coefficient sample in the clean sample with high confidence coefficient, so that the screening of the high-quality sample in the enhanced sample is completed. The method not only can effectively screen the high-quality samples generated in the data enhancement samples, but also increases the diversity of the data enhancement samples, so that the model can learn more modes to improve the performance of the model, thereby further improving the generalization of the model. Correspondingly, the invention also provides a sample selection device, equipment and a medium.

Description

Sample selection method, device, equipment and medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a sample selection method, a sample selection device, sample selection equipment and a sample selection medium.

Background

Text classification is the basis of many Natural Language Processing (NLP) tasks, and is widely applied in various fields such as emotion analysis and intelligent question and answer. Generally, training a classifier with strong generalization capability usually requires a large amount of labeled data, but the construction of a large corpus requires high manual labeling cost and a large amount of time and energy, which are often difficult for people to bear. In order to solve the problem, a Data Augmentation (Data Augmentation) strategy is proposed, and Data Augmentation can greatly increase the Data volume and alleviate the Data deficiency so as to improve the generalization capability of the model. However, in the field of natural language processing, data enhancement faces a great challenge, and besides the discontinuity of text data, the great reason is that the anti-interference capability of the language itself is weak, and the random modification of the language data is likely to destroy the semantics thereof, generate low-quality samples, greatly influence the judgment of a classifier, and thus generate a negative feedback effect on the model.

Therefore, how to select high quality samples from the data enhancement samples is important.

Disclosure of Invention

Various aspects of embodiments of the present invention provide a method, an apparatus, a device, and a medium for selecting a sample, which can effectively screen out a high-quality sample generated by a data enhancement policy.

A first aspect of an embodiment of the present invention provides a sample selection method, including:

acquiring an enhanced sample, predicting class probability distribution of the enhanced sample based on a trained pre-training model, and obtaining a pseudo label of the enhanced sample according to the class probability distribution of the enhanced sample; the enhanced sample is an unmarked sample generated after the original marked sample data is enhanced;

classifying the enhanced sample into a clean sample and a noise sample according to a comparison result of the label of the original labeled sample corresponding to the enhanced sample and the pseudo label of the enhanced sample;

introducing class probability distribution of the clean sample predicted by a pre-training model after Monte Carlo sampling training under different model parameters to obtain confidence coefficient of the clean sample according to the class probability distribution under different model parameters, and classifying the clean sample into a high-confidence sample and a low-confidence sample according to the confidence coefficient;

confirming the recall sample according to the comparison result of the vocabulary similarity of the sample to be recalled and a set vocabulary similarity threshold value and the semantic fluency of the sample to be recalled and the set semantic fluency threshold value; wherein the samples to be recalled comprise low confidence samples and noise samples;

taking the high confidence sample and the recall sample as final selected samples.

A second aspect of the embodiments of the present invention provides a sample selection apparatus, including:

the pseudo label obtaining module is used for obtaining an enhanced sample, predicting class probability distribution of the enhanced sample based on a trained pre-training model, and obtaining a pseudo label of the enhanced sample according to the class probability distribution of the enhanced sample; the enhanced sample is an unmarked sample generated after the original marked sample data is enhanced;

the first classification module is used for classifying the enhanced samples into clean samples and noise samples according to the comparison result of the labels of the original marked samples corresponding to the enhanced samples and the pseudo labels of the enhanced samples;

the second classification module is used for introducing class probability distribution of the clean sample predicted by a pre-training model after Monte Carlo sampling training under different model parameters, obtaining confidence coefficient of the clean sample according to the class probability distribution under different model parameters, and classifying the clean sample into a high-confidence sample and a low-confidence sample according to the confidence coefficient;

the recall module is used for confirming the recall sample according to the comparison result of the vocabulary similarity of the sample to be recalled and a set vocabulary similarity threshold value and the semantic fluency of the sample to be recalled and the set semantic fluency threshold value; wherein the samples to be recalled comprise low confidence samples and noise samples;

a selection module to take the high confidence sample and the recall sample as final selected samples. A third aspect of the embodiments of the present invention provides a terminal device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, where the processor executes the computer program to implement the sample selection method provided in the foregoing embodiments.

A fourth aspect of the embodiments of the present invention provides a storage medium, where the storage medium includes a stored computer program, and when the computer program runs, a device on which the storage medium is located is controlled to execute the sample selection method provided in the foregoing embodiments.

Compared with the prior art, the sample selection method provided by the embodiment of the invention generates the clean samples and the noise samples by the classified data enhancement strategy, screens the samples with high confidence coefficient from the clean samples as high-quality samples, and reselects the low confidence coefficient samples such as the noise samples and the clean samples with low confidence coefficient to supplement the high-quality low confidence coefficient samples from the clean samples with high confidence coefficient, thereby completing the screening of the high-quality samples in the enhanced samples. According to the embodiment, the high-quality samples generated by the data enhancement strategy can be effectively screened out, and the diversity of the data enhancement samples is increased by recalling the low-reliability samples, so that the model can learn more modes, the performance of the model is improved, and the generalization of the model is further improved. Correspondingly, the embodiment of the invention also provides a sample selection device, equipment and a medium.

Drawings

FIG. 1 is a block diagram of a sample selection framework provided by an embodiment of the present invention;

fig. 2 is a schematic flow chart of a sample selection method according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The data enhancement strategy is very popular in the research field of solving the problem of data shortage, and refers to a process of generating new data by training data through certain transformation operation, so that limited data can generate a value equivalent to more data, and the generalization capability and robustness of the model are improved. Data enhancement strategies are also widely used in the field of natural language processing. However, in the field of natural language processing, data enhancement faces a great challenge, and besides the discontinuity of text data, the great reason is that the language itself has weak anti-interference capability, and the random modification of language data is likely to destroy the semantics thereof, generate low-quality samples, greatly affect the judgment of a classifier, and thus generate negative feedback effect on a model.

Although data enhancement can alleviate the data shortage problem of model training to some extent, the enhanced samples with low quality may have negative feedback effect on the model. Therefore, it is important to filter the noise of the enhanced samples, i.e., sample selection. Currently, there are related studies for evaluating the quality of a sample, such as data evaluation based on classifier discrimination and data evaluation based on text similarity.

The data evaluation method based on classifier discrimination is to train a text classification model by using labeled data and then classify or predict label-free data by using a text classifier, however, the method singly uses the classifier to evaluate the generated data, which results in that the screened data distribution fits the classifier, i.e. the original data distribution is maintained, thus the diversity of the screened samples is not high, and the performance of the classifier cannot be improved.

The selection method based on the text similarity has no great difficulty in technical realization, and the text similarity is generally judged by calculating the text distance. In natural language processing, the problem of how to measure the similarity of two texts is often involved. How to measure the similarity between sentences or phrases is particularly important in questions such as dialogue systems (Dialog systems) and Information retrieval (Information retrieval). And screening the samples generated by the model through the text coverage between the original samples and the generated samples. However, the strategy only considers the similarity between the generated sample and the original sample from the text vocabulary level, and lacks information at the semantic level.

The sample evaluation method only evaluates from a single dimension, and does not screen the data-enhanced samples from multiple dimensions, so that the overall quality of the selected samples is not high.

In order to solve the problem of negative gain brought by harmful samples of a data enhancement strategy to a model, the embodiment of the application provides a sample selection method, a device, equipment and a medium, clean samples and noise samples generated by the data enhancement strategy are classified, samples with high confidence coefficient are screened from the clean samples to be used as high-quality samples, and low-confidence coefficient samples such as the noise samples and the clean samples with low confidence coefficient are reselected to supplement the high-quality low-confidence coefficient samples from the high-confidence coefficient clean samples, so that screening of the high-quality samples in the enhancement samples is completed, diversity of the data enhancement samples is increased, the model can learn more modes, the performance of the model is improved, and the generalization of the model is further improved.

For example, referring to fig. 1, fig. 1 is a block diagram of a sample selection framework employed in a sample selection method, apparatus, device, and medium according to an embodiment of the present application.

As shown in fig. 1, the frame diagram is composed of four modules: (1) a pre-training module; (2) a semi-supervised training module; (3) a sample selection module; and (4) a sample recall module. In the semi-supervised training module, original labeled sample D is utilized _l (X, Y) training a pre-training model (an example of the pre-training module selecting RoBERTa is shown in the figure); in the sample selection module, the pre-training model which is trained well is used for predicting the enhanced sample D _g (X, Y) class probability distribution, obtaining enhanced sample D from class probability distribution result _g Pseudo label of (X, Y), and will enhance sample D _g Comparing the false label of (X, Y) with its real label, classifying the clean sample D _clean And noise samples D _noisy And calculating a clean sample D through MC dropout _clean Uncertainty of prediction in a pre-training model is selected by adopting an entropy-based strategy to carry out self-training on uncertain samples of the model, and a high-confidence sample D is obtained _easy And low confidence samples D _hard . Finally, in the sample recall module, from two dimensions of vocabulary similarity and semantic fluency, automatically selecting a low-reliability sample (D) _hard And D _noisy ) Recall some high quality confidencable degree samples.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Referring to fig. 2, fig. 2 is a schematic flow chart of a sample selection method according to an embodiment of the present invention. The sample selection method of the present embodiment includes steps S11 to S15:

s11, obtaining an enhanced sample, predicting class probability distribution of the enhanced sample based on a trained pre-training model, and obtaining a pseudo label of the enhanced sample according to the class probability distribution of the enhanced sample; and the enhanced sample is a label-free sample generated after the original label sample data is enhanced.

For example, there may be one or more enhanced samples generated by the data enhancement strategy from one original annotated sample. Assuming that the original labeled sample is { I, S, question, you }, the enhanced sample generated for data enhancement is { I, S, question }, { I, Q, you, S, question }, and the like.

S12, classifying the enhanced sample into a clean sample and a noise sample according to a comparison result of the label of the original marked sample corresponding to the enhanced sample and the pseudo label of the enhanced sample.

Exemplarily, an enhanced sample with the same label and pseudo label of a corresponding original labeled sample is used as a clean sample, an enhanced sample with a different label and pseudo label of a corresponding original labeled sample is used as a noise sample, and the clean sample is respectively put into a clean sample set D _clean In, the noise samples are put into a set of noise samples D _noisy In (1).

S13, introducing class probability distribution of the clean sample predicted by the pre-training model after Monte Carlo sampling training under different model parameters, obtaining confidence of the clean sample according to the class probability distribution under different model parameters, and classifying the clean sample into a high-confidence sample and a low-confidence sample according to the confidence.

Specifically, the method and the device enhance the reliability of the generated sample through the quantized data, so that the problem of division of a high-confidence sample and a low-confidence sample in sample selection is solved.

S14, confirming the recall sample according to the comparison result of the vocabulary similarity of the sample to be recalled and a set vocabulary similarity threshold value and the semantic fluency of the sample to be recalled and the set semantic fluency threshold value; wherein the to-be-recalled samples comprise low confidence samples and noise samples.

Illustratively, due to the presence of noise data in data generated by data enhancement, during the training process of the pre-trained model, the model is gradually shifted under the influence of the noise data, so as to influence the performance of the pre-trained model, and the low confidence samples also have great uncertainty on the prediction result of the model, so the low confidence samples and the noise samples are classified into the low confidence samples.

However, for low confidence samples, there may be high quality samples, and although the confidence of prediction of such samples during the training of the pre-trained model is low, in practice, the training and performance of the pre-trained model are improved. Therefore, in the embodiment of the invention, the low-reliability sample is recalled by adopting a plurality of dimensions, and whether the low-reliability sample is a high-quality sample is evaluated by respectively adopting two angles of vocabulary similarity and semantic fluency.

And S15, taking the high-confidence sample and the recall sample as finally selected samples.

According to the technical scheme provided by the embodiment of the invention, an enhanced sample is obtained, the class probability distribution of the enhanced sample is predicted based on a trained pre-training model, and the pseudo label of the enhanced sample is obtained according to the class probability distribution of the enhanced sample; classifying the enhanced sample into a clean sample and a noise sample according to a comparison result of a label of an original labeled sample corresponding to the enhanced sample and a pseudo label of the enhanced sample, introducing class probability distribution of the clean sample predicted by a pre-training model after Monte Carlo sampling training under different model parameters, obtaining confidence coefficient of the clean sample according to the class probability distribution under different model parameters, and classifying the clean sample into a high-confidence sample and a low-confidence sample according to the confidence coefficient; confirming the recall sample according to the comparison result of the vocabulary similarity of the sample to be recalled and a set vocabulary similarity threshold value and the semantic fluency of the sample to be recalled and the set semantic fluency threshold value; and finally, taking the high confidence sample and the recall sample as finally selected samples. The embodiment can effectively screen out high-quality samples generated by the data enhancement strategy, and also can increase the diversity of the data enhancement samples by recalling the low-reliability samples, so that the model can learn more modes, the performance of the model is improved, and the generalization of the model is further improved.

In an optional embodiment, before predicting the class probability distribution of the enhanced sample based on the trained pre-trained model, the method further includes:

and acquiring the original marking sample, and training the pre-training model by adopting a semi-supervised method based on the original marking sample to obtain the trained pre-training model.

In this embodiment, a semi-supervised approach is used to train the pre-trained model to predict and filter the data enhancement generated samples. Semi-supervised methods may augment a labeled data set by using an unlabeled data set. In the implementation process, an original labeled sample with a label is used for training a pre-training model, the trained pre-training model is used for predicting an unlabeled enhanced sample, the enhanced sample with the highest prediction confidence coefficient is taken and a pseudo label is marked on the enhanced sample, and then the enhanced sample with the pseudo label is put into the current training sample for continuous training until the pre-training model is converged.

Illustratively, the semi-supervised approach employs self-training. In implementation, a self-training method is used for training a pre-training model to predict and filter a data enhancement generated sample. Suppose D _l ＝{x _l ,y _l Denotes the original labeled sample, y _l Representing the original annotated sample x _l C class labels of (1), x _l Representing a sequence consisting of n tokens. D _g ＝{x _g ,y _g The enhanced sample generated by the data enhancement mode is represented, and the specific framework is realized as follows:

s1, using original labeled sample D _l The pre-trained model is trained as a teacher network and cross-entropy is used as its loss function L.

S2, utilizing the teacher network to predict and enhance a sample D _g Probability of the class label to which it belongs.

S3, screening the enhanced samples by using a plurality of screening strategies based on class label probability predicted by the teacher network to obtain a subset S of the enhanced samples _g 。

S4, screening the sample S _g With the original annotated sample D _l Together for training a student network.

And S5, enabling the teacher network and the student network to be alternately trained continuously, namely regarding the current student network as the teacher network, and returning to the step S2 until the pre-training model is converged.

In an alternative embodiment, the pre-trained model is a pre-trained language model using a Bert structure.

Illustratively, the pre-training model employs RoBERTa. RoBERTa is a modified version of BERT that can use larger vocabularies and train large data sets for longer sequences with larger batches by improving the training task and data generation. The BERT model adopts two tasks of predicting the Next Sentence (NSP) and Modeling the Mask Language (MLM). In RoBERTa, the NSP task is deleted, leaving only the MLM task for pre-training.

For the classification problem, the first position in its input sentence is labeled [ CLS]Starting symbol, and its token corresponds to the final hidden state h _i Usually represented as an aggregated sequence of classification tasks. For each token in a given sentence, its input representation is constructed by summing the corresponding token, segment, and position embedding.

Based on this, through experimental comparison, the embodiment of the present invention uses RoBERTa as a pre-training model, which is used as an encoder of language features, by obtaining the first token ([ CLS ] of the final hidden state]) To represent the sentence code S of the sample, to obtain a sentence vector representation of the sample, which is then output by the pre-trained model RoBERTa for each sample _i Comprises the following steps:

S _i ＝RoBERTa(a _i ,b _i ,c _i )

wherein, a _i ,b _i ,c _i Token, segment, and position embeddings.

In an optional implementation manner, the step S13 of "introducing class probability distributions of the clean sample predicted by a pre-training model after monte carlo sampling training under different model parameters, so as to obtain the confidence of the clean sample according to the class probability distributions under different model parameters" specifically includes:

and introducing class probability distribution of the clean sample predicted by a pre-training model after Monte Carlo sampling training under different model parameters, and calculating the confidence coefficient of the clean sample by adopting an information entropy formula according to the class probability distribution under different model parameters.

In this embodiment, different model parameters are extracted by using monte carlo dropout, and different output results of the same clean sample in the same pre-training model are obtained. In the t-th dropout, the pre-training model predicts the probability of the clean sample appearing in the class c

Expressed as:

wherein the content of the first and second substances,

the parameter matrix at the T-th dropout is shown, x is the expression of the sentence vector of the clean sample, T is the total dropout number, and B is the offset value.

Quantifying the credibility of the clean sample by using the information entropy according to the class probability distribution of each dropout to obtain the confidence degree H (p) of the clean sample _i ) I.e. how easily the pre-trained model discriminates from the clean sample:

wherein, C represents the number of the class labels,

and &>

Respectively representing the degree of easy discrimination and the degree of difficult discrimination of the ith clean sample.

After the confidences of all the clean samples are determined, classifying the clean samples with the confidences larger than a preset confidence threshold value into a low confidence sample set D _hrad Classifying clean samples with confidence less than or equal to the confidence threshold into high confidence samples D _easy . Wherein, the confidence threshold value can be determined according to experiments, samples with confidence coefficient of 0.1-0.3 are selected in the experiments, and the experimental results show the confidence coefficientAt 0.25 optimum, the confidence threshold is set to 0.25.

In an optional implementation manner, the predicting the class probability distribution of the enhanced sample based on the trained pre-training model, and obtaining the pseudo label of the enhanced sample according to the class probability distribution of the enhanced sample specifically includes:

predicting class probability distribution of the enhanced sample based on a trained pre-training model

And obtain a pseudo-label->

Wherein a class probability distribution->

And a pseudo label->

Calculated by the following formula:

wherein the content of the first and second substances,

to enhance the samples, θ is the parameter of the pre-training model, softmax () is the softmax () function in the pre-training model, W is the parameter matrix, b is the offset value, argmaxf () function in the argmaxf () pre-training model.

It can be seen that, in this alternative embodiment, a binary classification task is introduced through the trained pre-training model prediction to obtain a pseudo label of an enhanced sample, and the pseudo label is compared with a label of a corresponding original labeled sample to filter the enhanced sample with a prediction error in the enhanced sample.

In an alternative embodiment, the vocabulary similarity threshold is obtained by:

calculating the vocabulary similarity between each high-confidence sample and the corresponding original labeling sample, and obtaining a vocabulary similarity threshold according to the vocabulary similarity between all the high-confidence samples and the corresponding original labeling samples;

and the vocabulary similarity is calculated by the following formula:

wherein J (x) represents lexical similarity, x _g 、x _l Respectively representing the enhanced sample and the original marked sample corresponding to the enhanced sample.

Specifically, an average value J of the vocabulary similarity between all the high-confidence samples and the corresponding original labeled samples is calculated _avg A mixture of J and _avg as the lexical similarity threshold.

It can be seen that in this alternative embodiment, the lexical similarity threshold is calculated by calculating lexical similarities between all high-confidence samples and corresponding original labeled samples, so that the lexical similarity threshold is used as one of the indicators of high and low quality of the low-confidence sample (including the noise sample and the low-confidence sample, as known from the above). In addition, the vocabulary similarity threshold is calculated through the high-confidence sample, model normalization and probability modeling are not needed, and the workload of standardization and probability modeling can be reduced.

In an alternative embodiment, the semantic fluency threshold is obtained by:

calculating the difference between the confusion degree of the original labeling sample corresponding to the high-confidence sample and the confusion degree of the high-confidence sample to obtain the semantic fluency of the high-confidence sample;

and obtaining the semantic fluency threshold according to the semantic fluency of all the high-confidence-level samples.

Specifically, the semantic fluency can be obtained by a Statistical Language model (called Statistical Language Models for short, SLM), which is very important in the field of syntax error correction task. Therefore, the present embodiment proposes a metric to evaluate the semantic fluency of the enhanced sample. In particular, assume an originally labeled sample

There are n words, one of which the enhanced sample is->

The total m words are calculated by firstly utilizing a language model ResLSTM trained on the One-Billion corpus to calculate the confusion degree of the high-confidence samples of the m words>

Confusion of originally labeled samples corresponding to high confidence samples>

Wherein the content of the first and second substances,

respectively representing the jth word of the ith sentence in the original annotation sample,

the jth word representing the ith sentence in the enhancement sample.

For said high confidence samplesDegree of confusion

Performing difference operation to obtain semantic fluency F of the high-confidence sample _ik (x)：/>

Semantic fluency F for all high-confidence samples _ik (x) Average value F _avg Will F _avg As the semantic fluency threshold.

It can be seen that in this alternative embodiment, the semantic fluency threshold is calculated by calculating the semantic fluency of all high confidence samples, and thus serves as another indicator for evaluating the quality of low confidence samples (including noise samples and low confidence samples, as noted above). In addition, the semantic fluency threshold is calculated through the high-confidence sample, model normalization and probability modeling are not needed, and the workload of standardization and probability modeling can be reduced.

In an optional embodiment, in step S14", the step of confirming the recall sample according to a comparison result between the vocabulary similarity of the sample to be recalled and the set vocabulary similarity threshold, and between the semantic fluency of the sample to be recalled and the set semantic fluency threshold specifically includes:

and selecting a to-be-recalled sample with the vocabulary similarity larger than the vocabulary similarity threshold and the semantic fluency larger than the semantic fluency threshold as the recall sample.

In the embodiment, the to-be-recalled sample with the vocabulary similarity larger than the vocabulary similarity threshold and the semantic fluency larger than the semantic fluency threshold is selected as the high-quality sample to be recalled, so that the high-quality sample with low credibility is supplemented on the basis of the preliminarily screened high-quality sample (high-confidence sample), and the diversity of the data enhancement sample is enhanced.

Correspondingly, an embodiment of the present invention further provides a sample selection apparatus, including:

the first classification module is used for classifying the enhanced samples into clean samples and noise samples according to the comparison result of the labels and the pseudo labels of the original marked samples corresponding to the enhanced samples;

the second classification module is used for introducing class probability distribution of the clean sample predicted by a pre-training model after Monte Carlo sampling training under different model parameters, obtaining confidence of the clean sample according to the class probability distribution under different model parameters, and classifying the clean sample into a high-confidence sample and a low-confidence sample according to the confidence;

a selection module to take the high confidence sample and the recall sample as final selected samples.

and the vocabulary similarity is calculated by the following formula:

/>

In an alternative embodiment, the semantic fluency threshold is obtained by:

and obtaining the semantic fluency threshold according to the semantic fluency of all the high-confidence-degree samples.

In an optional implementation manner, the introducing a class probability distribution of the clean sample predicted by a pre-training model after monte carlo sampling training under different model parameters to obtain a confidence of the clean sample according to the class probability distribution under different model parameters specifically includes:

class probability distribution of the clean sample under different model parameters is predicted by a pre-training model after Monte Carlo sampling training, and the confidence coefficient of the clean sample is calculated by adopting an information entropy formula according to the class probability distribution under different model parameters.

In an optional implementation manner, the confirming the recall sample according to the comparison result between the vocabulary similarity of the to-be-recalled sample and the set vocabulary similarity threshold and between the semantic fluency of the to-be-recalled sample and the set semantic fluency threshold specifically includes:

It should be noted that the sample selection apparatus provided in the embodiment of the present invention is used for executing all the steps and processes of the sample selection method provided in the above embodiment, and the working principles and beneficial effects of the two are in one-to-one correspondence, and redundant description is not repeated here.

Accordingly, an embodiment of the present invention further provides a terminal device, which includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, and when the processor executes the computer program, the processor implements the sample selection method provided in the foregoing embodiment, for example, S11 to S15 in fig. 2.

Accordingly, an embodiment of the present invention further provides a storage medium, where the storage medium includes a stored computer program, where when the computer program runs, the apparatus on which the storage medium is located is controlled to execute the sample selection method provided in the foregoing embodiment, for example, S11 to S15 in fig. 2.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A method of sample selection, comprising:

introducing class probability distribution of the clean sample predicted by a pre-training model after Monte Carlo sampling training under different model parameters to obtain confidence of the clean sample according to the class probability distribution under different model parameters, and classifying the clean sample into a high-confidence sample and a low-confidence sample according to the confidence;

2. The sample selection method of claim 1, further comprising, prior to the predicting the class probability distribution of the enhanced sample based on the trained pre-trained model:

3. The sample selection method of claim 1, wherein the lexical similarity threshold is obtained by:

and the vocabulary similarity is calculated by the following formula:

4. The sample selection method of claim 1, wherein the semantic fluency threshold is obtained by:

5. The sample selection method according to claim 1, wherein the introducing of the class probability distribution of the clean sample predicted by the pre-training model after the monte carlo sampling training under different model parameters to obtain the confidence of the clean sample according to the class probability distribution under different model parameters specifically comprises:

6. The sample selection method according to claim 1, wherein the identifying the recall sample according to the comparison result between the vocabulary similarity of the sample to be recalled and the set vocabulary similarity threshold and between the semantic fluency of the sample to be recalled and the set semantic fluency threshold comprises:

and selecting a sample to be recalled as the recall sample, wherein the vocabulary similarity is greater than the vocabulary similarity threshold and the semantic fluency is greater than the semantic fluency threshold.

7. The sample selection method as recited in claim 1, wherein the pre-trained model is a pre-trained language model employing a Bert structure.

8. A sample selection device, comprising:

9. A terminal device comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor when executing the computer program implementing the sample selection method of any of claims 1 to 7.

10. A storage medium comprising a stored computer program, wherein the computer program, when executed, controls an apparatus in which the storage medium is located to perform the sample selection method of any one of claims 1 to 7.