CN114091595A

CN114091595A - Sample processing method, apparatus and computer-readable storage medium

Info

Publication number: CN114091595A
Application number: CN202111348688.9A
Authority: CN
Inventors: 孙康康; 高洪; 周祥生; 屠要峰; 董修岗
Original assignee: Nanjing ZTE New Software Co Ltd
Current assignee: Nanjing ZTE New Software Co Ltd
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2022-02-25
Also published as: WO2023083176A1

Abstract

The invention provides a sample processing method, sample processing equipment and a computer-readable storage medium, wherein the method comprises the steps of determining an unlabeled target sample; inputting the unlabeled target sample into a classification prediction model to obtain classification prediction probability distribution data; calculating to obtain stability data according to the probability distribution data; obtaining a pre-labeled training sample according to the stability data; training the classification prediction model by using the pre-labeled training sample and a preset training set until the classification prediction model meets a preset training stopping condition; the invention can effectively reduce the cost of sample marking.

Description

Sample processing method, apparatus and computer-readable storage medium

Technical Field

Embodiments of the present invention relate to, but not limited to, the field of data processing technologies, and in particular, to a sample processing method and apparatus, and a computer-readable storage medium.

Background

In the current society of information explosion, the amount of non-tagged data is usually very large, and the acquisition of tagged data is also very difficult, time-consuming and costly. By the active learning method, label-free data can be effectively selected for labeling and training, so that a model with good performance is obtained. In real life, the data classification is widely applied, and the data classification also needs to use a large amount of training data to obtain a good classification effect.

In data classification of related technologies, an initial classification prediction model is usually trained by using a labeled sample, edge sampling is performed on an unlabeled sample by using an active learning method to further artificially label the sampled sample, and then the classification prediction model is trained by using the artificially labeled sample to obtain a classification prediction model meeting expectations.

Disclosure of Invention

The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.

The embodiment of the invention provides a sample processing method, sample processing equipment and a computer readable storage medium, which can effectively reduce the labeling cost of a sample.

In a first aspect, an embodiment of the present invention provides a sample processing method, including:

determining an unmarked target sample;

inputting the unlabeled target sample into a classification prediction model to obtain classification prediction probability distribution data;

calculating to obtain stability data according to the probability distribution data;

obtaining a pre-labeled training sample according to the stability data;

and training the classification prediction model by using the pre-labeled training sample and a preset training set until the classification prediction model meets a preset training stopping condition.

In a second aspect, an embodiment of the present invention further provides a sample processing apparatus, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the computer program implementing the sample processing method as described in the first aspect above.

In a third aspect, an embodiment of the present invention further provides a computer-readable storage medium storing computer-executable instructions for performing the sample processing method according to the first aspect.

The embodiment of the invention comprises the following steps: determining an unlabeled target sample, inputting the unlabeled target sample into a classification prediction model to obtain probability distribution data of classification prediction, then calculating to obtain stability data according to the probability distribution data, obtaining a pre-labeled training sample according to the stability data, and training the classification prediction model by using the pre-labeled training sample and a preset training set until the classification prediction model meets a preset training stopping condition. Compared with the related art, the pre-labeled training sample obtained by the embodiment of the invention has stronger stability and higher pertinence, and can effectively reduce the labeling cost of the sample.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the example serve to explain the principles of the invention and not to limit the invention.

FIG. 1 is a schematic flow diagram of a sample processing method provided by one embodiment of the present invention;

FIG. 2 is a schematic flow chart of determining an unlabeled target sample according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of probability distribution data provided by one embodiment of the present invention;

FIG. 4 is a schematic flow chart of stability data provided by one embodiment of the present invention;

FIG. 5 is a schematic flow chart of determining an unlabeled target sample according to another embodiment of the present invention;

FIG. 6 is a schematic flow chart of probability distribution data provided by another embodiment of the present invention;

FIG. 7 is a schematic flow chart of stability data provided by another embodiment of the present invention;

FIG. 8 is a flow chart of a pre-labeled training sample according to an embodiment of the present invention;

FIG. 9 is a schematic flow chart of training a classification prediction model according to an embodiment of the present invention;

FIG. 10 is a flow diagram of determining a classification prediction model according to an embodiment of the present invention;

fig. 11 is a schematic flowchart of a process of labeling a target sample to be labeled according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart. The terms first, second and the like in the description and in the claims, and the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

In the current society of information explosion, the amount of non-tagged data is usually very large, but the acquisition of tagged data is also very difficult, time consuming and costly. The Active Learning method is used for effectively selecting label-free data to label and train, so that a model with good performance can be obtained while the labeling cost is reduced. In real life, the method is widely applied to data classification, and the data classification also needs to use a large amount of training data to obtain a good classification effect.

In data classification of related technologies, an initial classification prediction model is usually trained by using a small number of labeled samples, samples are screened from unlabeled samples by using an active learning method, then, human experts label the screened/sampled samples, and then, the labeled samples are added into an original labeled training set to retrain the classification prediction model; and then, re-screening the samples by using the active learning method, and repeating the steps until a classification prediction model meeting expectations is obtained. However, in the above scheme, since the active learning method usually performs sample screening according to diversity and uncertainty, stability of the sampled sample to model training is not considered, and thus labeling cost of the sample is high.

Based on this, embodiments of the present invention provide a sample processing method, a sample processing apparatus, and a computer-readable storage medium, which can effectively reduce sample labeling cost.

It is to be appreciated that embodiments of the present invention relate specifically to data classification, such as text classification, and that text classification includes, but is not limited to, news classification, sentiment analysis, text review, and other application scenarios.

The embodiments of the present invention will be further explained with reference to the drawings.

An embodiment of the first aspect of the present invention specifically provides a sample processing method, as shown in fig. 1, fig. 1 is a schematic flow chart of the sample processing method according to an embodiment of the present invention. The sample processing method of the embodiment of the invention comprises the following steps of but is not limited to:

step S100, determining an unmarked target sample;

step S200, inputting unmarked target samples into a classification prediction model to obtain probability distribution data of classification prediction;

step S300, calculating to obtain stability data according to the probability distribution data;

step S400, obtaining a pre-labeled training sample according to the stability data;

step S500, training a classification prediction model by using a pre-labeled training sample and a preset training set;

and (5) repeatedly executing the steps S100 to S500 until the classification prediction model meets the preset training stopping condition.

It can be understood that, in the embodiment of the present invention, before the target sample is determined not to be labeled in step S100, original sample data is further obtained, and then the original sample data is initialized to obtain initialized data. The initialization data is then divided into labeled samples and unlabeled samples.

It will be appreciated that the predetermined training set is derived from the above-described labeled samples.

In some embodiments, the labeled samples may be divided into training sets and test sets.

As for the above initialization processing, a plurality of initialization processing methods may be correspondingly set. For example, a random sampling method is used to obtain a part of original target sample data from all original sample data, for example, 10% of the original sample data corresponding to the obtained original target sample data. In other embodiments, the corresponding quantity ratio of the original target sample data of the random sampling may also be set to be between 10% and 30%, or may be adaptively adjusted according to the quantity of the original target sample data, which is not specifically limited in the embodiment of the present invention. And then, generating a labeled sample after the original target sample data is labeled by an expert, namely the initialized data comprises a labeled sample and an unlabeled sample, and then dividing the labeled sample into a training set and a test set. Specifically, assuming that there are 1000 original sample data, acquiring 10% of the original sample data by using a random sampling method obtains 100 original sample data, and the 100 original sample data is used as original target sample data.

For another example, a clustering method is used to classify original sample data to obtain classified sample data, then a part of original target sample data is obtained from the classified sample data according to a preset proportion, and then an expert labels the original target sample data to generate a labeled sample, so as to divide the labeled sample into a training set and a test set. Clustering methods include, but are not limited to, k-means clustering methods (k-means clustering algorithms), hierarchical clustering methods, etc., where the distance metric may be in the form of word2vec word vectors or edit distances, etc. Specifically, assuming that the original sample data is classified by using a clustering method to obtain three classes of classified sample data, and the sample numbers corresponding to the three classes of classified sample data are respectively 500, 300 and 200, the original target sample data is respectively obtained from the three classes of classified sample data according to a proportion of 10%, that is, the sample numbers corresponding to the three classes of original target sample data are respectively 500 × 10%, 300 × 10% and 200 × 10%, that is, it is known that 50, 30 and 20 original target sample data are respectively selected from the three classes of classified sample data. It will be appreciated that if the scaled number of samples results in a non-integer number, the non-integer number is rounded, for example by rounding.

It is to be understood that other initialization processing methods may also be adopted in the embodiments of the present invention to perform initialization processing on original sample data, which is not limited to the above embodiments and is not described herein again.

It will be appreciated that the specific application of the test set provided may be as described with reference to fig. 10. Fig. 10 is a schematic flow chart of determining a classification prediction model according to an embodiment of the present invention. I.e., step S500, includes, but is not limited to, the following steps:

step S510, training a classification prediction model by using a pre-labeled training sample and a preset training set to obtain a candidate prediction model;

step S520, inputting a preset test set into the candidate prediction model to obtain test data;

step S530, when the test data meets the expected test result, determining that the classification prediction model meets the preset training stopping condition.

It can be understood that, by inputting the test set to the candidate prediction model, it is determined whether the test data output by the candidate prediction model meets an expected test result, and if the test data output by the candidate prediction model meets the expected test result, it is determined that the current candidate prediction model, i.e., the classification prediction model, meets a preset training stopping condition.

Specifically, before determining the unlabeled target sample in step S100, the labeled sample may be divided into a training set and a test set, and the training set in the labeled sample is used to train the classification prediction model. At this time, the test set in the labeled sample can be directly input into the classification prediction model to obtain first test data, and at this time, if the first test data meets an expected test result, it can be directly determined that the classification prediction model meets a preset training stopping condition.

In some embodiments, an initial model such as XLNet or textcnn may be trained based on a training set in labeled samples to obtain a classification prediction model, e.g., the classification prediction model may be a text classification model. It is to be understood that the present invention is not limited to the type of the trained classification prediction model. Because the classification prediction model is continuously updated in an iterative manner, after each round of training of the classification prediction model is completed, the test set can be input into the classification prediction model to obtain second test data, and when the second test data accords with an expected test result, the current classification prediction model is determined to accord with a preset training stopping condition.

It will be appreciated that the accuracy, recall, F1 values (i.e., the harmonic mean of accuracy and recall), etc. may be used to characterize expected test results. Taking the accuracy rate used for representing the expected test result as an example, assuming that the accuracy rate of the obtained second test data is 82%, and the set expected test result is 85%, it indicates that the current classification prediction model does not meet the preset training stopping condition, and the classification prediction model needs to be trained by continuously adopting the sample processing method of the embodiment of the present invention; or, if the set expected test result is 80%, it indicates that the current second test data meets the expected test result, and further it is determined that the current classification prediction model meets the preset training stopping condition. It can be understood that the expected test result can be set according to the actual application scenario, and is not limited to the above embodiment, and will not be described herein again.

Referring to fig. 11, it can be understood that, after determining that the classification prediction model meets the preset training stopping condition, the sample processing method of the embodiment of the present invention further includes:

step S600, obtaining a target sample to be marked;

and S700, performing labeling processing on the target sample to be labeled according to the classification prediction model meeting the training stopping condition.

After determining that the classification prediction model meets the preset training stopping condition, the embodiment of the invention can directly execute step S600 and step S700. The target sample to be labeled is a sample to be labeled actually, the target sample to be labeled is input into a classification prediction model (namely, the classification prediction model after training) meeting the condition of stopping training, probability distribution data to be labeled of classification prediction can be obtained, then, labeling attribute data corresponding to the target sample to be labeled is obtained according to the probability distribution data to be labeled, and labeling data corresponding to the target sample to be labeled is determined according to the labeling attribute data.

It is understood that steps S600 and S700 of the embodiment of the present invention may be disposed after step S500, or may be disposed after step S530.

It should be noted that, after the classification prediction model meets the preset training stopping condition, the obtaining of the pre-labeled training sample is stopped, at this time, the trained classification prediction model is adopted to label the target sample to be labeled, and the expert reviews the labeled data corresponding to the target sample to be labeled. It can be understood that the target sample to be labeled may be an unlabeled sample other than the training sample to be labeled; alternatively, the target sample to be labeled may also be other actually required samples to be labeled, and is not specifically limited herein.

Referring to fig. 2, it can be understood that step S100, includes but is not limited to the following steps:

step S101, performing data disturbance processing on a preset unmarked sample to obtain a disturbed sample;

and step S102, determining the disturbance sample and the unmarked sample as unmarked target samples.

According to the embodiment of the invention, data disturbance processing is carried out on the preset unmarked sample A to obtain a disturbance sample. For example, unlabelled sample data may be selected from preset unlabelled sample data a (the unlabelled sample data a may be a set of unlabelled sample data), and data perturbation processing may be performed on each unlabelled sample data, so as to obtain a perturbation sample corresponding to each unlabelled sample data. And determining the disturbance sample and the unlabeled sample A as the unlabeled target sample.

It is understood that the data perturbation process includes, but is not limited to, the following methods:

1. processing synonym disturbance: initializing a synonym word table, selecting an unlabeled sample data from a preset unlabeled sample A (the unlabeled sample A can be a set of unlabeled sample data), performing word segmentation processing on the unlabeled sample data to obtain a plurality of unlabeled word data, and randomly selecting an unlabeled word from the unlabeled word data to perform synonym replacement processing on the selected unlabeled word to obtain synonym data. The method comprises the steps of searching synonyms in a synonym word table, replacing the unmarked words with synonyms when the synonyms corresponding to the unmarked words are found, and obtaining synonym data, namely the synonym data can be used as one of disturbance samples corresponding to unmarked sample data. And when the synonym cannot be found, randomly selecting another unmarked word from the rest unmarked word data again to carry out synonym replacement processing. And when all the data of the unlabeled words in the unlabeled sample data cannot find the synonym, abandoning the generation of the disturbance sample by using the synonym disturbance processing method.

For example, the unlabeled sample data is 'how to fast learn to sing a song', the unlabeled sample data is subjected to word segmentation to obtain a plurality of unlabeled word data, namely 'how to fast learn to sing a song', an unlabeled word is randomly selected from the unlabeled word data, for example, 'singing' is selected, synonyms of 'singing' cannot be obtained from the synonym word table through searching, then another un-labeled word is randomly selected from the rest un-labeled word data, for example, what is selected, the synonym of which is searched from the synonym table has "how", etc., randomly selecting a synonym from the synonym table, for example, selecting "how" to perform synonym replacement processing on the unlabeled word of "how" to obtain synonym data, that is, the synonym data represents the perturbation sample corresponding to the unlabelled sample data, specifically, "how to quickly learn to sing a song".

2. And translation disturbance processing: selecting one unlabelled sample data from a preset unlabelled sample A (the unlabelled sample A can be a set of unlabelled sample data), performing language translation processing on the unlabelled sample data by using a translation tool to obtain translation data, and performing language translation processing on the translation data again to obtain one of disturbance samples corresponding to the unlabelled sample data, wherein the language type corresponding to the disturbance sample is the same as the language type corresponding to the unlabelled sample data. That is, the un-labeled sample data is translated into other languages, and then translated back to the source language. It is to be understood that the unlabeled sample data may be text data, for example, a sentence may be used as one unlabeled sample data.

For example, if the language type corresponding to the unlabeled sample a is chinese, the unlabeled sample data of chinese may be translated into english translation data, and then the translated translation data is translated back to chinese, so as to obtain a disturbance sample of chinese. It can be understood that, the perturbed sample of the embodiment may be the same as the unlabeled sample data before translation, so a Beam search method may be adopted to ensure that the perturbed sample is different from the unlabeled sample data before translation. The Beam search method is a disclosed technology in the field of machine translation, and is not described herein in detail. It can be understood that the unlabeled sample data can be subjected to language translation processing by means of a plurality of different language types, and then a plurality of disturbance samples can be generated. For example, the translation perturbation process may be embodied as: generating a disturbance sample by Chinese-English-Chinese; another perturbation sample is generated by chinese-italian-chinese. The Beam search method is also employed to ensure that the generated perturbed samples are not identical.

3. Disturbance processing of a pre-training language model: and constructing the disturbance sample by adopting the pre-training language model, namely, inputting a preset unlabeled sample A into the preset pre-training language model to obtain the disturbance sample. Specifically, the pre-training language model may be trained using the MASK masking method through a model such as BERT, ELECTRA, or the like. Taking a BERT model as an example, selecting one unlabeled sample data from a preset unlabeled sample a (the unlabeled sample a may be a set of unlabeled sample data), randomly setting part of unlabeled word data in the unlabeled sample data as MASK to obtain unlabeled MASK data, inputting the unlabeled MASK data into the BERT model, predicting the unlabeled MASK data by using the BERT model, and outputting one of the perturbation samples corresponding to the unlabeled sample data. Since the results of the MASK predicted by the BERT model, that is, the disturbance samples, may be the same as the unlabeled sample data, the unlabeled word data in the unlabeled sample data may be combined into different unlabeled MASK data, and the different unlabeled MASK data may be input into the BERT model. Setting up to 10 times, and abandoning the generation of the disturbance sample by using the method when the appropriate disturbance sample is not generated after 10 times.

For example, the unlabeled sample data is "how to fast learn to sing a song", and the BERT segmentation is used to perform segmentation processing on the unlabeled sample data to obtain a plurality of unlabeled word data, that is, "how to fast learn to sing a song". And randomly setting part of unlabeled word data in the unlabeled sample data into MASK to obtain unlabeled MASK data. If not more than 20% of the unlabelled word data are set to be MASK at random, for example, the unlabelled MASK data are 'how MASK to quickly master a song', and the unlabelled MASK data are input into the BERT model to obtain a disturbance sample. And when the disturbed sample is different from the unlabeled sample data, determining the disturbed sample as a corresponding disturbed sample. And when the disturbance sample is the same as the unlabeled sample data, acquiring the unlabeled mask data again. The number of repetitions may be set to 10 at most, and in other embodiments, other numbers of repetitions may be set, which is not particularly limited herein.

It can be understood that other data perturbation processing methods may also be used instead of the above data perturbation processing method, which do not affect the training of the classification prediction model in the embodiments of the present invention, and are all within the protection scope of the present application, and are not described herein again.

Thereafter, referring to fig. 3, it can be understood that step S200, includes but is not limited to the following steps:

step S201, inputting the disturbance sample and the unmarked sample into a classification prediction model to obtain disturbance probability distribution data and unmarked probability distribution data.

In the embodiment of the invention, after a disturbed sample and an unlabelled sample are determined as unlabelled target samples, the disturbed sample and the unlabelled sample A are input into a classification prediction model to obtain disturbed probability distribution data and unlabelled probability distribution data, wherein the disturbed probability distribution data is probability distribution data corresponding to classification prediction of the disturbed sample, and the unlabelled probability distribution data is probability distribution data corresponding to classification prediction of the unlabelled sample A.

In another embodiment, each unlabeled sample data can be input into the classification prediction model respectively to obtain unlabeled probability distribution data corresponding to each unlabeled sample data; and respectively inputting a plurality of disturbance samples corresponding to each unlabeled sample data into the classification prediction model to obtain disturbance probability distribution data corresponding to each disturbance sample.

Referring to fig. 4, it can be understood that step S300, includes but is not limited to the following steps:

step S301, stability data is obtained through calculation according to the first stability algorithm, the disturbance probability distribution data and the unmarked probability distribution data.

Namely, step S101, step S102, step S201 and step S301 according to the embodiment of the present invention, stability data corresponding to the unlabeled sample a is obtained.

Specifically, probability distribution data of classification prediction of a disturbed sample, namely disturbed probability distribution data, and probability distribution data of classification prediction of an unlabeled sample A, namely unlabeled probability distribution data, are predicted through a classification prediction model; and calculating to obtain stability data corresponding to the unlabeled sample A based on the first stability algorithm, the disturbance probability distribution data and the unlabeled probability distribution data. It can be understood that each unlabeled sample data in the unlabeled sample a corresponds to stability data.

It can be understood that, assuming that a total of N disturbance samples are generated by one unlabeled sample data in the unlabeled sample a, the disturbance sample is specifically represented as R₁,R₂,...,R_NIf the original unlabeled sample data is added, a total of N +1 samples are obtained, for example, N +1 sentences are obtained. If m is a classification problem, and m is the number of classification prediction categories output by the classification prediction model, probability distribution data are not labeled

Is shaped as

Is a vector of length m, where j is 1, 2. Without labeling probability distribution data

In, e.g.

It can be understood that the unlabeled sample data in the unlabeled sample a belongs to the probability data of class 1.

Recording disturbance probability distribution data

For the probability distribution data corresponding to the classification prediction of the perturbed sample, unlabeled probability distribution data

Predicted probability distribution data for the class corresponding to unlabeled sample a. It is understood that each unlabeled sample data in the unlabeled sample a may correspond to unlabeled probability distribution data.

And defining a first stability algorithm, and calculating to obtain a first stability parameter of the unlabeled sample A after the data disturbance processing.

Specifically, the calculation formula of the first stability algorithm is:

wherein TDS is a first stability parameter used for representing stability data, N is the number of disturbance samples, m is the number of classification prediction categories output by the classification prediction model,

in order to perturb the probability distribution data,

to label no probability distribution data, i is 1, 2.

It can be understood that the larger the TDS, i.e. the first stability parameter, the more stable the corresponding unlabeled sample data in the unlabeled sample a is.

It should be noted that, for the synonym perturbation process and the pre-training language model perturbation process, it may not be possible to generate effective perturbation samples. And for translation perturbation processing, perturbation samples can be effectively generated. For example, two different language types are selected to generate two perturbation samples, respectively, by translation perturbation processing. Through the data disturbance processing method, the possible values of the number N of the generated disturbance samples are 2, 3 and 4. Aiming at the problem that the number of the disturbance samples corresponding to different unlabeled sample data is not uniform, a 1/N coefficient in the first stability algorithm can be adjusted to ensure the accuracy of the data.

It is understood that for each unlabeled sample data in the unlabeled sample A, its corresponding stability data can be calculated. Specifically, referring to fig. 8, it can be understood that step S400, includes but is not limited to the following steps:

s410, sequencing unlabeled samples according to the stability data;

step S420, screening the sorted unlabeled samples to obtain training samples to be labeled;

and step S430, performing labeling processing on the training sample to be labeled according to the classification prediction model to obtain a pre-labeled training sample.

It is to be understood that embodiments of the present invention include labeled samples and unlabeled samples. Because the number of unlabeled samples is usually large, if all the unlabeled samples are directly labeled manually, the labeling cost is high. Therefore, in the embodiment of the present invention, through steps S100 to S500, the unlabeled samples are labeled by the classification prediction model in the training process, so as to obtain the pre-labeled training samples. It can be understood that, in the process of labeling, after the expert is required to review and confirm, the expert queries the correct labeling data to obtain the pre-labeling training sample.

Specifically, the embodiment of the invention performs data disturbance processing on a preset unlabeled sample to obtain a disturbed sample, and then inputs the disturbed sample and the unlabeled sample into a classification prediction model to obtain disturbed probability distribution data and unlabeled probability distribution data; and then calculating to obtain stability data corresponding to each unlabeled sample data in the unlabeled samples according to the first stability algorithm, the disturbance probability distribution data and the unlabeled probability distribution data.

Specifically, according to the embodiment of the invention, the unlabeled sample data with poor stability is selected from the unlabeled samples, so that the expert can label and confirm the unlabeled sample data.

The unlabeled samples are sequenced through the stability data, for example, the stability data sequences the unlabeled samples from small to large or from large to small, and it can be understood that the unlabeled sample data with poor stability needs to be obtained to train the classification prediction model in the embodiment of the present invention. Therefore, it is necessary to select the unlabeled sample data of top n with the smallest TDS (i.e. the smallest first stability parameter (representing the worst stability)). It is understood that n can be adjusted according to actual conditions. For example 5% of the total data volume of unlabelled samples. For example, if the total data size of the unlabeled samples is 10000, the sorted unlabeled samples are screened to obtain training samples to be labeled, that is, the unlabeled sample data corresponding to 10000 × 5% of the TDS with the minimum first stability parameter is selected to obtain training samples to be labeled with 500 data sizes. And then, labeling the training sample to be labeled according to the classification prediction model to obtain a pre-labeled training sample. And then, the pre-labeled training samples can be audited through experts so as to save workload.

It can be understood that the stability of each unlabeled sample data in the unlabeled sample A to data disturbance is fully considered in the embodiment of the present invention, so that the collected pre-labeled training sample is more valuable. And the subsequent experts only need to confirm or modify the acquired high-value pre-labeled training sample, for example, the experts inquire correct labeled data, so that the labeling cost is effectively reduced.

Referring to fig. 5, it can be understood that step S100, includes but is not limited to the following steps:

step S110, when the current training round for training the classification prediction model is greater than the preset round threshold, determining a preset unlabeled sample corresponding to the current training round as an unlabeled target sample of the current training round, and determining a preset unlabeled sample corresponding to a next training round after the current training round as an unlabeled target sample of the next training round.

According to the embodiment of the invention, stability data are calculated and obtained by obtaining the last k rounds of training of the classification prediction model.

In the last k rounds, each round predicts a preset unlabeled sample corresponding to the current training round, specifically, referring to fig. 6, it can be understood that step S200 includes, but is not limited to, the following steps:

step S210, inputting corresponding unlabeled target samples to a classification prediction model for the current training round to obtain current round probability distribution data of the current training round;

step S220, for the next training round, inputting the corresponding unlabeled target sample into the classification prediction model to obtain the probability distribution data of the next round of the next training round;

and step S230, performing multiple rounds of training processing on the classification prediction model by analogy, and obtaining multiple current round probability distribution data and multiple next round probability distribution data.

It is to be understood that the current round probability distribution data is probability distribution data corresponding to the class prediction of the current training round, and the next round probability distribution data is probability distribution data corresponding to the class prediction of the next training round.

Referring to fig. 7, it can be understood that step S300, includes but is not limited to the following steps:

and step S310, calculating to obtain stability data according to the second stability algorithm, the current round probability distribution data and the next round probability distribution data.

Specifically, according to the embodiment of the invention, the unlabeled sample data with poor stability is selected from the unlabeled target samples, so that the expert can confirm the labeling. For example, the step S410 to the step S430 may be adopted to select the unlabeled target sample, and the specific implementation steps and effects thereof are the same as those described above, and are not described herein again.

In the last k rounds of training of the classification prediction model, each round predicts the unlabeled target sample corresponding to the current training round to obtain probability distribution data of classification prediction of the current training round, namely the current round probability distribution data, and probability distribution data of classification prediction of the next training round, namely the next round probability distribution data.

It is understood that the training round of the classification prediction model using all the training samples is referred to as a training round, and thus the training round is calculated.

For example, assume that the classification prediction model eventually requires 20 rounds of training.

Setting a preset turn threshold k as 10, when a current training turn x is 11, indicating that the current training turn for training the classification prediction model is greater than the preset turn threshold 10, and starting from the xth turn, namely the 11 th turn, predicting the corresponding unlabelled target sample by using the classification prediction model to obtain the probability distribution data of the corresponding classification prediction.

It is to be understood that the unlabeled target samples corresponding thereto may be the same for the current training round and the next training round.

It can be understood that the embodiment of the present invention requires multiple rounds of training processing on the classification prediction model, so as to obtain multiple current round probability distribution data and multiple next round probability distribution data. For the current training turn, inputting an unlabeled target sample corresponding to the current training turn into a classification prediction model to obtain current turn probability distribution data of the current training turn; and for the next training turn, inputting the unlabeled target sample corresponding to the next training turn into the classification prediction model to obtain the probability distribution data of the next turn of the next training turn.

More specifically, for inputting the corresponding unlabeled target sample into the classification prediction model trained for 11 rounds, the probability distribution data of the current round corresponding to the 11 th round can be obtained, and Q can be used₁₁Represents; inputting all the unlabeled target samples into the classification prediction model, and training one round again, wherein the next training round after 11 rounds is the 12 th round. And (4) predicting the probability distribution data of the next round corresponding to the unmarked target sample of the 12 th round again, wherein Q can be used₁₂Represents, and so on, and finally obtains Q₂₀。

For the last k rounds, assume the m classification problem, for Q_x(x-r-k +1, r-k + 2.., r), i.e., (Q)_r-k+1,Q_r-k+2,...,Q_r) It is recorded as the current round probability distribution data, i.e.

The probability distribution data of the next round is expressed as

For example, for a classification prediction model trained for 11 rounds, the sample data corresponding to one unlabeled sample data in the unlabeled target sample

It can be understood that: the number of unlabeled samplesAccording to the probability data belonging to the j-th class in the classification prediction model trained in the 11 th round.

It can be seen that for the second stability algorithm, when the preset round threshold is 10, i.e. for the last 10 rounds, the 11 th round, Q needs to be obtained_11jAnd Q_12j(ii) a At round 12, Q is acquired_12jAnd Q_13jBy analogy, in round 19, Q needs to be obtained_19jAnd Q_20j。

And defining a second stability algorithm, and calculating to obtain a second stability parameter of the unlabeled sample data in the unlabeled target sample for the last k rounds.

Specifically, the calculation formula of the second stability algorithm is:

wherein LKS is a second stability parameter used for representing stability data, r is a total training round for training the classification prediction model, x is a current training round, k is a preset round threshold value, m is the number of classification prediction categories output by the classification prediction model,

for the current round of probability distribution data,

and x is the probability distribution data of the next round, namely r-k +1, r-k + 2.

It can be understood that the larger the LKS, i.e. the second stability parameter, is, the more stable the corresponding unlabeled sample data in the unlabeled target sample is.

It should be noted that, for each unlabeled sample data in the unlabeled target sample corresponding to each round, the corresponding stability data may be calculated.

The unlabeled samples (i.e., the unlabeled target samples in this embodiment) are sorted by the stability data, for example, the unlabeled target samples are sorted by the stability data in a descending order or a descending order, and it can be understood that the unlabeled sample data with poor stability needs to be obtained in the embodiment of the present invention, so as to implement training of the classification prediction model. Therefore, LKS, i.e., the unlabeled sample data of top n with the minimum second stability parameter (indicating the worst stability) needs to be selected. It is understood that n can be adjusted according to actual conditions. For example 5% of the total data volume of the unlabeled target sample. For example, if the total data size of the unlabeled target samples is 10000, the sorted unlabeled target samples are screened to obtain training samples to be labeled, that is, LKS, that is, unlabeled sample data corresponding to 10000 × 5% of the second stability parameter, is selected, so that training samples to be labeled with 500 data sizes are obtained. And then, labeling the training sample to be labeled according to the classification prediction model to obtain a pre-labeled training sample. And then, the pre-labeled training samples can be audited through experts so as to save workload.

It can be understood that the embodiment of the invention fully considers the stability of the training duration, so that the collected pre-labeled training samples are more valuable and targeted. And the subsequent experts only need to confirm or modify the acquired high-value sample, and the high-value sample of the embodiment is the pre-labeled training sample, for example, the experts query correct labeled data, so that the labeling cost is effectively reduced.

It can be understood that, for each specific classification task, there are differences in the total training rounds corresponding to the classification prediction model. According to the embodiment of the invention, whether the third test data meets the expected test result or not can be judged by acquiring the third test data of the continuous y rounds, such as the continuous 5 rounds. For example, when none of the third test data for successive y rounds exceeds the test data for pick r, then the value of r may be determined. Assuming that the accuracy obtained by training the 22 nd round classification prediction model is 95%, the new high is reached, and the accuracy of the following 23 rd to 27 th rounds does not exceed 95%, the total training round r is determined to be 22. The preset turn threshold value can be adjusted according to actual conditions. For example, it is proposed to take r/2, and to round if r/2 is not an integer. For example, when the total training round is 22, the preset round threshold may be 11.

Referring to fig. 9, it can be understood that step S500 further includes, but is not limited to, the following steps:

step S501, inputting a preset pre-labeled sample into a classification prediction model to obtain probability distribution data corresponding to the pre-labeled sample;

step S502, obtaining a first pre-labeled training sample according to probability distribution data corresponding to the pre-labeled sample, wherein the confidence coefficient corresponding to the first pre-labeled training sample is greater than or equal to a preset confidence coefficient;

step S503, training the classification prediction model by using the first pre-labeled training sample, a training weight value corresponding to the first pre-labeled training sample and a training set, wherein the training weight value is the product of a confidence corresponding to the first pre-labeled training sample and a preset hyper-parameter.

If the labeled samples and the pre-labeled training samples are adopted to train the classification prediction model, the remaining unlabeled samples which are not subjected to manual examination and labeling cannot be effectively utilized. Therefore, the remaining unlabeled samples which are not subjected to manual review labeling are defined as preset prearranged samples.

It can be understood that the classification prediction model of the embodiment of the present invention not only uses labeled samples, but also uses pre-labeled samples. And inputting the preset pre-labeled samples into the classification prediction model to obtain probability distribution data corresponding to the pre-labeled samples, and selecting the high-confidence samples according to the probability distribution data corresponding to the pre-labeled samples. Specifically, a first pre-labeled training sample with a confidence degree greater than or equal to a preset confidence degree is obtained according to probability distribution data corresponding to the pre-labeled sample, that is, the first pre-labeled training sample is a high-confidence-degree sample.

It should be noted that, for the pre-labeled sample, the labeling data corresponding to the first pre-labeled training sample obtained further by the pre-labeled sample is a pseudo label given by the classification prediction model. The embodiment of the invention can further screen the pre-labeled samples according to the probability distribution data corresponding to the pre-labeled samples. Specifically, a first pre-labeled training sample is screened out, wherein a confidence corresponding to the first pre-labeled training sample is greater than or equal to a preset confidence. Namely, the first pre-labeled training sample corresponding to less than the preset confidence level is selected to be abandoned. And then, acquiring a training weight value corresponding to the first pre-labeled training sample, wherein the training weight value is the product of a confidence coefficient corresponding to the first pre-labeled training sample and a preset hyper-parameter, and can be represented as w × σ. The confidence coefficient sigma is expressed as the maximum probability data in the probability distribution data corresponding to the pre-labeled sample, namely the probability data corresponding to the pseudo label (the first pre-labeled training sample); w is a preset hyper-parameter which can be set between 0 and 1, and a specific value can be preset. It can be understood that the first pre-labeled training sample obtained in the embodiment of the present invention does not require an expert to perform an audit verification, so as to save the workload.

For example, assuming that the preset confidence is 0.7 and the preset hyper-parameter is 0.5, the embodiment of the present invention is assumed to be an emotional three-classification problem, that is, the number m of classification prediction classes output by the classification prediction model is 3, which is specifically classified as positive, negative, and neutral. For the pre-labeled samples, the classification prediction model predicts that the corresponding probability distribution data is (0.1, 0.6, 0.3), namely the positive probability data is 0.1, the negative probability data is 0.6, and the neutral probability data is 0.3. It can be understood that the order of the output probability distribution data corresponding to the classification prediction category corresponds to the order in the training and prediction processes of the model. Since the confidence σ, which is the largest probability data among the probability distribution data, is 0.6, which is lower than the preset confidence 0.7, this pre-labeled sample is discarded. The probability distribution data corresponding to another pre-labeled sample is (0.02, 0.9, 0.08), that is, the positive probability data is 0.02, the negative probability data is 0.9, and the neutral probability data is 0.08, then the pseudo label (that is, the first pre-labeled training sample is determined) is the classification prediction class corresponding to the maximum probability data 0.9, that is, the negative. It is to be understood that positive, negative, neutral marking of data is an embodiment of the present invention. Through the mode, the first pre-labeled training sample is screened out from the pre-labeled sample according to the probability distribution data corresponding to the pre-labeled sample. And calculating the training weight value corresponding to the first pre-labeled training sample to be 0.5 x 0.9-0.45. And finally, training the classification prediction model by using the first pre-labeled training sample, the training weight value corresponding to the first pre-labeled training sample and the training set based on the loss function, specifically, multiplying the corresponding training weight value by 0.45 before the loss function corresponding to the first pre-labeled training sample.

In the embodiment of the invention, in order to provide a lower training weight value for the first pre-labeled training sample with low confidence coefficient, w is adopted to distinguish the labeled sample from the first pre-labeled training sample corresponding to the pseudo label. For example, w takes 0.5, the training weight of the first pre-labeled training sample is always below 0.5, which specifies the upper bound for the pseudo-label weight. Specifically, the preset confidence level of the present embodiment is 0.7, and the preset hyper-parameter is 0.5.

The embodiment of the invention utilizes the first pre-labeled training sample by adopting a dynamic weight method so as to fully utilize the pre-labeled sample. And when the classification prediction model is updated iteratively, the marked sample and the pre-marked sample are utilized simultaneously. And for the pre-labeled samples, a classification prediction model is used for giving pseudo labels, namely, a first pre-labeled training sample with high confidence is selected, and dynamic weight is added to a loss function corresponding to the first pre-labeled training sample based on the confidence corresponding to the first pre-labeled training sample.

It can be understood that the embodiment of the invention can be applied to relevant classifications such as text classification, for example, news classification, emotion analysis, text review, and the like, and can effectively save manual marking amount. In addition, the embodiment of the invention can also be used in combination with other existing screening methods, and each screening method is reordered after scoring and weighting so as to select the pre-labeled training sample with high comprehensive value.

It can be understood that, in the related art, although there is a way to train a model by using an unlabeled sample, it does not distinguish between a manually labeled real label and a pseudo label predicted by an untrained classification prediction model. The first pre-labeled training sample obtained by the embodiment of the invention has higher value and pertinence and is representative. The embodiment of the invention also adopts the classification prediction model to automatically label the first pre-labeled training sample, thereby effectively reducing the labeling cost.

In addition, the embodiment of the second aspect of the present invention also provides a sample processing device, which includes: a memory, a processor, and a computer program stored on the memory and executable on the processor.

The processor and memory may be connected by a bus or other means.

The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The non-transitory software programs and instructions required to implement the sample processing method of the first aspect embodiment described above are stored in a memory, and when executed by a processor, perform the sample processing method of the embodiment described above, e.g., perform the method steps S100 to S500 in fig. 1, the method steps S101 to S102 in fig. 2, the method step S201 in fig. 3, the method step S301 in fig. 4, the method step S110 in fig. 5, the method steps S210 to S230 in fig. 6, the method step S310 in fig. 7, the method steps S410 to S430 in fig. 8, the method steps S501 to S503 in fig. 9, the method steps S510 to S530 in fig. 10, and the method steps S600 to S700 in fig. 11 described above.

The above described embodiments of the device are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, which stores computer-executable instructions, which are executed by a processor or controller, for example, by a processor in the above-mentioned apparatus embodiment, and can enable the above-mentioned processor to execute the sample processing method in the above-mentioned embodiment, for example, execute the above-mentioned method steps S100 to S500 in fig. 1, method steps S101 to S102 in fig. 2, method step S201 in fig. 3, method step S301 in fig. 4, method step S110 in fig. 5, method steps S210 to S230 in fig. 6, method step S310 in fig. 7, method steps S410 to S430 in fig. 8, method steps S501 to S503 in fig. 9, method steps S510 to S530 in fig. 10, and method steps S600 to S700 in fig. 11.

One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

While the preferred embodiments of the present invention have been described in detail, it will be understood by those skilled in the art that the foregoing and various other changes, omissions and deviations in the form and detail thereof may be made without departing from the scope of this invention.

Claims

1. A method of sample processing, comprising:

determining an unmarked target sample;

obtaining a pre-labeled training sample according to the stability data;

2. The method of claim 1, wherein the determining unlabeled target samples comprises:

performing data disturbance processing on a preset unmarked sample to obtain a disturbed sample;

and determining the disturbance sample and the unlabeled sample as unlabeled target samples.

3. The method of claim 2, wherein inputting the unlabeled target sample to a classification prediction model to obtain probability distribution data of classification prediction comprises:

inputting the disturbance sample and the unlabeled sample into the classification prediction model to obtain disturbance probability distribution data and unlabeled probability distribution data;

the disturbance probability distribution data is probability distribution data corresponding to classification prediction of the disturbance samples, and the unlabeled probability distribution data is probability distribution data corresponding to classification prediction of the unlabeled samples.

4. The method of claim 3, wherein calculating stability data from the probability distribution data comprises:

and calculating to obtain the stability data according to a first stability algorithm, the disturbance probability distribution data and the unmarked probability distribution data.

5. The method of claim 4, wherein the first stability algorithm is calculated by the formula:

wherein the TDS is a first stability parameter used for characterizing the stability data, N is the number of the disturbance samples, m is the number of the classification prediction categories output by the classification prediction model, and

for said disturbance probability distribution data, said

For the unlabeled probability distribution data, i is 1,2, and N, j is 1, 2.

6. The method of claim 1, wherein the determining unlabeled target samples comprises:

when the current training turn for training the classification prediction model is larger than a preset turn threshold, determining a preset unlabeled sample corresponding to the current training turn as an unlabeled target sample of the current training turn, and determining a preset unlabeled sample corresponding to a next training turn after the current training turn as an unlabeled target sample of the next training turn.

7. The method of claim 6, wherein inputting the unlabeled target sample to a classification prediction model to obtain probability distribution data of classification prediction comprises:

for the current training turn, inputting the corresponding unlabeled target sample into the classification prediction model to obtain current turn probability distribution data of the current training turn;

for the next training round, inputting the corresponding unlabeled target sample into the classification prediction model to obtain the next round probability distribution data of the next training round;

and by analogy, performing multiple rounds of training processing on the classification prediction model to obtain multiple current round probability distribution data and multiple next round probability distribution data.

8. The method of claim 7, wherein calculating stability data from the probability distribution data comprises:

and calculating to obtain the stability data according to a second stability algorithm, the current round probability distribution data and the next round probability distribution data.

9. The method of claim 8, wherein the second stability algorithm is calculated by the formula:

the LKS is a second stability parameter used for characterizing the stability data, r is a total training round for training the classification prediction model, and x is the current training roundThe number of training turns is k is the preset turn threshold, m is the number of classification prediction categories output by the classification prediction model, and

for the current round probability distribution data, the

And x is the probability distribution data of the next round, wherein x is r-k +1, r-k + 2.

10. The method according to any one of claims 2 to 9, wherein the obtaining a pre-labeled training sample according to the stability data comprises:

sequencing the unlabeled samples according to the stability data;

screening the sorted unlabeled samples to obtain training samples to be labeled;

and labeling the training samples to be labeled according to the classification prediction model to obtain pre-labeled training samples.

11. The method according to any one of claims 1 to 9, wherein the training the classification prediction model using the pre-labeled training samples and a preset training set further comprises:

inputting a preset pre-labeled sample into the classification prediction model to obtain probability distribution data corresponding to the pre-labeled sample;

obtaining a first pre-labeled training sample according to the probability distribution data corresponding to the pre-labeled sample, wherein the confidence coefficient corresponding to the first pre-labeled training sample is greater than or equal to a preset confidence coefficient;

and training the classification prediction model by using the first pre-labeled training sample, a training weight value corresponding to the first pre-labeled training sample and the training set, wherein the training weight value is the product of a confidence coefficient corresponding to the first pre-labeled training sample and a preset hyper-parameter.

12. The method according to any one of claims 1 to 9, wherein the training the classification prediction model using the pre-labeled training samples and a preset training set until the classification prediction model meets a preset training stopping condition comprises:

training the classification prediction model by using the pre-labeled training sample and a preset training set to obtain a candidate prediction model;

inputting a preset test set into the candidate prediction model to obtain test data;

and when the test data accords with an expected test result, determining that the classification prediction model accords with a preset training stopping condition.

13. The method according to any one of claims 1 to 9, further comprising:

obtaining a target sample to be marked;

and performing labeling processing on the target sample to be labeled according to the classification prediction model meeting the training stopping condition.

14. A sample processing apparatus, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the sample processing method of any one of claims 1 to 13 when executing the computer program.

15. A computer-readable storage medium storing computer-executable instructions for performing the sample processing method of any one of claims 1 to 13.