WO2023083176A1 - 样本处理方法、设备及计算机可读存储介质 - Google Patents

样本处理方法、设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2023083176A1
WO2023083176A1 PCT/CN2022/130616 CN2022130616W WO2023083176A1 WO 2023083176 A1 WO2023083176 A1 WO 2023083176A1 CN 2022130616 W CN2022130616 W CN 2022130616W WO 2023083176 A1 WO2023083176 A1 WO 2023083176A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
samples
data
unlabeled
sample
Prior art date
Application number
PCT/CN2022/130616
Other languages
English (en)
French (fr)
Inventor
孙康康
高洪
周祥生
屠要峰
董修岗
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2023083176A1 publication Critical patent/WO2023083176A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Definitions

  • the embodiments of the present application relate to but are not limited to the technical field of data processing, and in particular relate to a sample processing method, device, and computer-readable storage medium.
  • labeled samples are usually used to train an initial classification prediction model, and active learning methods are used to sample unlabeled samples at the edge, so as to further manually label the sampled samples, and then use artificial The labeled samples are used to train the above-mentioned classification prediction model to obtain the expected classification prediction model.
  • active learning methods are usually used to sample unlabeled samples at the edge, so as to further manually label the sampled samples, and then use artificial
  • the labeled samples are used to train the above-mentioned classification prediction model to obtain the expected classification prediction model.
  • the above-mentioned sampling methods are usually based on diversity and uncertainty, the cost of labeling samples is high.
  • Embodiments of the present application provide a sample processing method, device, and computer-readable storage medium.
  • an embodiment of the present application provides a sample processing method, including: determining an unlabeled target sample; inputting the unlabeled target sample into a classification prediction model to obtain probability distribution data for classification prediction; according to the probability distribution data, and calculate the stability data; according to the stability data, obtain the pre-labeled training samples; use the pre-labeled training samples and the preset training set to train the classification prediction model until the classification prediction model conforms to Preset stop training condition.
  • the embodiment of the present application also provides a sample processing device, including: a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor executes the The computer program implements the sample processing method as described in the first aspect above.
  • the embodiment of the present application further provides a computer-readable storage medium, storing computer-executable instructions, and the computer-executable instructions are used to execute the sample processing method as described in the above-mentioned first aspect.
  • Fig. 1 is a schematic flowchart of a sample processing method provided by an embodiment of the present application
  • Fig. 2 is a schematic flow chart of determining an unmarked target sample provided by an embodiment of the present application
  • FIG. 3 is a schematic flow diagram of probability distribution data provided by an embodiment of the present application.
  • Fig. 4 is a schematic flow chart of the stability data provided by one embodiment of the present application.
  • Fig. 5 is a schematic flow chart of determining an unmarked target sample provided by another embodiment of the present application.
  • FIG. 6 is a schematic flow diagram of probability distribution data provided by another embodiment of the present application.
  • Fig. 7 is a schematic flow chart of stability data provided by another embodiment of the present application.
  • FIG. 8 is a schematic flow chart of pre-marked training samples provided by an embodiment of the present application.
  • FIG. 9 is a schematic flow diagram of training a classification prediction model provided by an embodiment of the present application.
  • FIG. 10 is a schematic flow diagram of determining a classification prediction model provided by an embodiment of the present application.
  • Fig. 11 is a schematic flowchart of labeling processing for target samples to be labeled provided by an embodiment of the present application.
  • the active learning method adopted is designed to effectively select unlabeled data for labeling and training, and to obtain a model with good performance while reducing labeling costs.
  • data classification is also widely used, and data classification also needs to use a large amount of training data to obtain better classification results.
  • a small number of labeled samples are usually used to train an initial classification prediction model, and active learning methods are used to screen samples from unlabeled samples, and then human experts are used to label the screened/sampled samples, Then add the marked samples to the original marked training set to retrain the above-mentioned classification prediction model; then use the active learning method to re-screen the samples, and so on, until the expected classification prediction model is obtained.
  • active learning method since the active learning method usually screens samples based on diversity and uncertainty, it does not consider the stability of the sampled samples themselves for model training, which leads to high labeling costs for samples.
  • embodiments of the present application provide a sample processing method, device, and computer-readable storage medium, which can effectively reduce sample labeling costs.
  • data classification such as text classification
  • text classification includes but not limited to application scenarios such as news classification, sentiment analysis, and text review.
  • the embodiment of the first aspect of the present application provides a sample processing method, as shown in FIG. 1 , which is a schematic flowchart of the sample processing method provided by an embodiment of the present application.
  • the sample processing method of the embodiment of the present application includes but not limited to the following steps:
  • Step S100 determining an unmarked target sample
  • Step S200 inputting unmarked target samples into the classification prediction model to obtain probability distribution data of classification prediction
  • Step S300 calculating and obtaining stability data according to the probability distribution data
  • Step S400 obtaining pre-marked training samples according to the stability data
  • Step S500 using the pre-marked training samples and the preset training set to train the classification prediction model
  • step S100 Repeat step S100 to step S500 until the classification prediction model meets the preset training stop condition.
  • the original sample data will be obtained, and then the original sample data will be initialized to obtain the initialization data. Afterwards, the initialization data is divided into labeled samples and unlabeled samples.
  • the preset training set comes from the above-mentioned labeled samples.
  • the labeled samples can be divided into training set and test set.
  • multiple initialization processing methods can be correspondingly set.
  • a part of the original target sample data is obtained from all the original sample data by using a random sampling method, for example, the obtained original target sample data corresponds to 10% of the original sample data.
  • experts mark the original target sample data to generate labeled samples that is, the initialization data includes labeled samples and unlabeled samples, and then divide the labeled samples into training set and test set. Assuming that there are 1000 original sample data in total, 100 original sample data are obtained by using random sampling method to obtain 10% of the original sample data, and the 100 original sample data are used as the original target sample data.
  • Another example is to use the clustering method to classify the original sample data to obtain the classified sample data, and then obtain a part of the original target sample data from the classified sample data according to the preset ratio, and then mark the original target sample data by experts to generate the classified sample data. Label the samples so that the labeled samples can be divided into training set and test set.
  • Clustering methods include but are not limited to k-means clustering algorithm, hierarchical clustering method, etc., wherein the distance measure can use word2vec word vector or edit distance, etc.
  • the original target sample data is obtained from the classification sample data of each category respectively, that is, the corresponding sample numbers of the three categories of original target sample data are 500*10%, 300*10%, and 200*10%, respectively, which can be known from the three 50, 30, and 20 original target sample data were respectively selected from the classification sample data of the category. It can be understood that if the result of the sample number obtained by the proportional calculation is a non-integer number, the non-integer number is rounded up, for example, by rounding up.
  • the embodiment of the present application may also adopt other initialization processing methods to perform initialization processing on the original sample data, which is not limited to the above embodiment, and details are not repeated here.
  • Fig. 10 is a schematic flowchart of determining a classification prediction model provided by an embodiment of the present application. That is, step S500, including but not limited to the following steps:
  • Step S510 using the pre-marked training samples and the preset training set to train the classification prediction model to obtain a candidate prediction model
  • Step S520 inputting the preset test set into the candidate prediction model to obtain test data
  • Step S530 when the test data conforms to the expected test result, it is determined that the classification prediction model meets the preset training stop condition.
  • the test set into the candidate prediction model it can be judged whether the test data output by the candidate prediction model conforms to the expected test result. If it meets the expected test result, the current candidate prediction model can be determined, that is, classification The predictive model meets the preset criteria for stopping training.
  • the labeled samples can be divided into a training set and a test set, and the training set in the labeled samples can be used to train the classification prediction model.
  • the test set in the labeled sample can be directly input into the classification prediction model to obtain the first test data.
  • the first test data conforms to the expected test result, it can be directly determined that the classification prediction model conforms to the preset The stop training condition.
  • an initial model such as XLNet or textcnn may be trained based on a training set in labeled samples to obtain a classification prediction model
  • the classification prediction model may be a text classification model. It can be understood that, the present application does not specifically limit the type of the classification prediction model to be trained. Since the classification prediction model is updated iteratively, after each round of training of the classification prediction model, the test set can be input into the above classification prediction model to obtain the second test data. When the second test data meets the expected test result, it is determined that the current classification prediction model meets the preset stop training condition.
  • data such as precision rate, recall rate, and F1 value (that is, the harmonic mean of precision rate and recall rate) can be used to characterize expected test results.
  • accuracy rate as an example to characterize the expected test results, assuming that the obtained second test data is the accuracy rate of 82%, and the set expected test result is 85%, it means that the current classification prediction model does not meet the preset It is necessary to continue to use the sample processing method of the embodiment of the present application to train the classification prediction model; or, if the expected test result is set to 80%, it means that the current second test data conforms to the expected test result , and then determine that the current classification prediction model meets the preset stop training condition.
  • the expected test result can be set according to the actual application scenario, and is not limited to the above-mentioned embodiment, and details are not repeated here.
  • the sample processing method in the embodiment of the present application further includes:
  • Step S600 obtaining target samples to be marked
  • step S700 the target sample to be marked is marked according to the classification prediction model meeting the training stop condition.
  • step S600 and step S700 may be directly executed.
  • the target sample to be marked is the sample that actually needs to be marked, and the target sample to be marked is input into the classification prediction model that meets the training stop condition (that is, input into the classification prediction model that has completed the training), and the probability distribution data to be marked for classification prediction can be obtained , and then obtain the labeling attribute data corresponding to the target sample to be labeled according to the probability distribution data to be labeled, and determine the labeling data corresponding to the target sample to be labeled according to the labeling attribute data.
  • step S600 and step S700 in the embodiment of the present application may be set after step S500, or may be set after step S530.
  • the classification prediction model when it meets the preset training stop conditions, it will stop obtaining pre-labeled training samples. At this time, the trained classification prediction model is used to label the target samples to be marked, and the target to be marked is reviewed by experts.
  • the labeled data corresponding to the sample. It can be understood that the target samples to be labeled may be unlabeled samples other than the training samples to be labeled; or, the target samples to be labeled may also be other actually required samples to be labeled, which is not specifically limited here.
  • step S100 includes but not limited to the following steps:
  • Step S101 performing data perturbation processing on preset unlabeled samples to obtain perturbed samples
  • Step S102 determining the disturbed samples and unlabeled samples as unlabeled target samples.
  • data perturbation processing is performed on a preset unlabeled sample A to obtain a perturbed sample.
  • unlabeled sample data can be selected from the preset unlabeled sample A (unlabeled sample A can be a collection of unlabeled sample data), and data disturbance processing is performed on each unlabeled sample data to obtain each unlabeled sample data. Label the perturbed samples corresponding to the sample data. Determine the above perturbed sample and unlabeled sample A as the unlabeled target sample.
  • data disturbance processing includes but not limited to the following methods:
  • Synonym perturbation processing initialize a synonym vocabulary, select an unlabeled sample data from the preset unlabeled sample A (unlabeled sample A can be a collection of unlabeled sample data), and perform word segmentation processing on the unlabeled sample data , to obtain a number of unlabeled word data, randomly select an unlabeled word from the unlabeled word data, and perform synonym replacement processing on the selected unlabeled word to obtain synonym data. That is, search for synonyms in the synonym vocabulary, and when a synonym corresponding to an unlabeled word is found, replace the unlabeled word with a synonym to obtain synonym data, that is, the synonym data can be used as one of the disturbance samples corresponding to the unlabeled sample data.
  • the unlabeled sample data is "how to quickly learn to sing a song", and after word segmentation processing, a number of unlabeled word data is obtained, namely "how to quickly learn to sing a song", and a random selection is made from the unlabeled word data
  • Unlabeled words such as selecting "sing”
  • can not search for the synonym of "singing” from the synonym vocabulary then randomly select another unlabeled word from the remaining unlabeled word data, such as selecting "how to ", the synonyms of "how” obtained from the synonym vocabulary search are "how", "how", etc., then randomly select a synonym from the synonym vocabulary, such as selecting "how", to "how” the unmarked word
  • Perform synonym replacement processing to obtain synonym data that is, the synonym data represents the disturbance sample corresponding to the unlabeled sample data, which is "how to quickly learn to sing a song”.
  • unlabeled sample A can be a collection of unlabeled sample data
  • use translation tools to perform language translation processing on the unlabeled sample data, and obtain Translate the data, and then perform language translation processing on the translated data again to obtain one of the disturbance samples corresponding to the unlabeled sample data, wherein the language type corresponding to the disturbance sample is the same as the language type corresponding to the unlabeled sample data. That is, the unlabeled sample data is first translated into other languages, and then translated back to the source language.
  • the unlabeled sample data can be text data, for example, a sentence can be used as an unlabeled sample data.
  • the language type corresponding to the unlabeled sample A is Chinese
  • the perturbed sample in this embodiment may be exactly the same as the unlabeled sample data before translation, so the Beam search (beam search) method can be used to ensure that the disturbed sample is different from the unlabeled sample data before translation.
  • Beam search (beam search) method is an open technology in the field of machine translation, so it will not be repeated here. It can be understood that language translation processing can be performed on unlabeled sample data with the help of multiple different language types, thereby generating multiple perturbed samples.
  • translation perturbation processing can be embodied as: generating a perturbation sample through Chinese-English-Chinese; generating another perturbation sample through Chinese-Italian-Chinese.
  • the Beam search (beam search) method is also used to ensure that the generated disturbance samples are not the same.
  • Pre-trained language model disturbance processing the pre-trained language model is used to construct disturbance samples, that is, the disturbance samples are obtained by inputting the preset unlabeled sample A into the preset pre-trained language model.
  • the pre-trained language model can be trained by using the MASK mask method through BERT, ELECTRA and other models.
  • unlabeled sample A can be a collection of unlabeled sample data
  • unlabeled sample A can be a collection of unlabeled sample data
  • MASK unlabeled word data
  • input the unlabeled mask data into the BERT model use the BERT model to predict the unlabeled mask data, and output one of the disturbance samples corresponding to the unlabeled sample data.
  • the combination of unlabeled word data in the unlabeled sample data can be set as different unlabeled mask data, and input into the BERT model . Set a maximum of 10 attempts, and if no suitable perturbation sample is generated after 10 times, then give up using this method to generate a perturbation sample.
  • the unlabeled sample data is "how to quickly learn to sing a song”
  • the BERT word segmentation is used to perform word segmentation on the unlabeled sample data to obtain some unlabeled word data, that is, "how to quickly learn to sing a song”.
  • the unlabeled mask data is "How to quickly learn MASK to sing a song with MASK”
  • the disturbed sample is determined to be the corresponding disturbed sample.
  • the unlabeled mask data is reacquired. It can be set to repeat up to 10 times, and in other embodiments, other times of repetition can also be set, which is not specifically limited here.
  • step S200 includes but not limited to the following steps:
  • Step S201 input the disturbed samples and unlabeled samples into the classification prediction model to obtain disturbed probability distribution data and unlabeled probability distribution data.
  • the disturbed sample and the unlabeled sample A are input into the classification prediction model to obtain the disturbance probability distribution data and the unlabeled probability distribution data, wherein the disturbance probability
  • the distribution data is the probability distribution data of the classification prediction corresponding to the disturbance sample
  • the unlabeled probability distribution data is the probability distribution data of the classification prediction corresponding to the unlabeled sample A.
  • each unlabeled sample data into the classification prediction model to obtain the unlabeled probability distribution data corresponding to each unlabeled sample data; the multiple disturbance samples corresponding to each unlabeled sample data , which are respectively input into the classification prediction model to obtain the disturbance probability distribution data corresponding to each disturbance sample.
  • step S300 includes but not limited to the following steps:
  • Step S301 calculate and obtain stability data according to the first stability algorithm, disturbance probability distribution data and unmarked probability distribution data.
  • step S101, step S102, step S201 and step S301 of the embodiment of the present application the stability data corresponding to the unmarked sample A is obtained.
  • the classification prediction model Predict the probability distribution data of the classification prediction of the disturbance sample through the classification prediction model, that is, the disturbance probability distribution data, and predict the probability distribution data of the classification prediction of the unlabeled sample A, that is, the unlabeled probability distribution data; based on the first stability algorithm, the disturbance probability
  • the distribution data and the unlabeled probability distribution data are calculated to obtain the stability data corresponding to the unlabeled sample A. It can be understood that each unlabeled sample data in the unlabeled sample A corresponds to stability data.
  • Record the disturbance probability distribution data is the probability distribution data corresponding to the classification prediction of the perturbed sample, and the probability distribution data is not labeled is the probability distribution data for the class prediction corresponding to the unlabeled sample A. It can be understood that each unlabeled sample data in the unlabeled sample A may correspond to unlabeled probability distribution data.
  • the first stability algorithm is defined, and the first stability parameter of the unlabeled sample A after data perturbation processing can be calculated.
  • the calculation formula of the first stability algorithm is:
  • TDS is the first stability parameter
  • the first stability parameter is used to represent the stability data
  • N is the number of disturbance samples
  • m is the number of classification prediction categories output by the classification prediction model
  • step S400 includes but not limited to the following steps:
  • Step S410 sorting the unlabeled samples according to the stability data
  • Step S420 screening the sorted unlabeled samples to obtain training samples to be labeled
  • Step S430 Perform labeling processing on the training samples to be labeled according to the classification prediction model to obtain pre-labeled training samples.
  • this embodiment of the present application includes labeled samples and unlabeled samples. Since the number of unlabeled samples is usually large, if all unlabeled samples are manually labeled directly, the labeling cost will be high. Therefore, in the embodiment of the present application, step S100 to step S500 are used to mark unlabeled samples through the classification prediction model in the training process, so as to obtain pre-labeled training samples. It can be understood that, in the labeling process, it needs to be checked and confirmed by experts, that is, the correct labeling data is queried from experts to obtain pre-labeled training samples.
  • the disturbed samples are obtained by performing data disturbance processing on the preset unlabeled samples, and then the disturbed samples and the unlabeled samples are input into the classification prediction model to obtain the disturbance probability distribution data and the unlabeled probability distribution data; then according to the A stability algorithm, the disturbance probability distribution data and the unlabeled probability distribution data are calculated to obtain the stability data corresponding to each unlabeled sample data in the unlabeled samples.
  • the unlabeled sample data with poor stability is selected from the unlabeled samples, and then let the experts label and confirm.
  • the unlabeled samples are sorted by stability data, for example, the stability data are sorted in order of small to large or from large to small. It is understandable that the embodiment of the present application needs to obtain unlabeled Label the sample data to realize the training of the classification prediction model. Therefore, it is necessary to select the top n unlabeled sample data with TDS, that is, the smallest first stability parameter (indicating the worst stability). It can be understood that the above n can be adjusted according to actual conditions. For example, it is 5% of the total data volume of unlabeled samples.
  • the sorted unlabeled samples are screened to obtain the training samples to be labeled, that is, the unlabeled samples corresponding to 10000*5% of TDS, which is the smallest first stability parameter, are selected Data, and 500 training samples to be labeled are obtained. Afterwards, the training samples to be marked are marked according to the classification prediction model, and the pre-marked training samples are obtained. Afterwards, the pre-labeled training samples can be reviewed by experts to save workload.
  • the embodiment of the present application fully considers the stability of each unlabeled sample data in the unlabeled sample A against data disturbance, thereby making the collected pre-labeled training samples more valuable. And subsequent experts only need to confirm or modify the obtained high-value pre-labeled training samples, such as querying the correct labeled data through experts, thereby effectively reducing the labeling cost.
  • step S100 includes but not limited to the following steps:
  • Step S110 when the current training round for training the classification prediction model is greater than the preset round threshold, the preset unlabeled sample corresponding to the current training round is determined as the unlabeled target sample of the current training round, and the current training round The preset unlabeled sample corresponding to the next training round after the round is determined as the unlabeled target sample of the next training round.
  • the stability data is calculated by obtaining the last k rounds of training of the classification prediction model.
  • step S200 includes but is not limited to the following steps:
  • Step S210 for the current training round, input the corresponding unlabeled target samples into the classification prediction model to obtain the current round probability distribution data of the current training round;
  • Step S220 for the next training round, input the corresponding unlabeled target samples into the classification prediction model to obtain the next round probability distribution data of the next training round;
  • step S230 by analogy, multiple rounds of training processing are performed on the classification prediction model to obtain a plurality of current round probability distribution data and a plurality of next round probability distribution data.
  • the probability distribution data of the current round is the probability distribution data of the classification prediction corresponding to the current training round
  • the probability distribution data of the next round is the probability distribution data of the classification prediction corresponding to the next training round.
  • step S300 includes but not limited to the following steps:
  • Step S310 calculating and obtaining stability data according to the second stability algorithm, the probability distribution data of the current round and the probability distribution data of the next round.
  • the unlabeled sample data with poor stability is selected from the unlabeled target samples, and then let the experts label and confirm.
  • steps S410 to S430 may be used to select unlabeled target samples, and the implementation steps and effects thereof are the same as those described above, and will not be repeated here.
  • each round predicts the unlabeled target samples corresponding to the current training round, and obtains the probability distribution data of the classification prediction of the current training round, that is, the current round
  • the probability distribution data, and the probability distribution data of the classification prediction of the next training round are the probability distribution data of the next round.
  • the corresponding unlabeled target samples may be the same.
  • the embodiment of the present application needs to perform multiple rounds of training processing on the classification prediction model, so as to obtain multiple current round probability distribution data and multiple next round probability distribution data. That is, for the current training round, the unlabeled target samples corresponding to the current training round are input into the classification prediction model to obtain the current round probability distribution data of the current training round; for the next training round, the next training round The corresponding unlabeled target samples are input to the classification prediction model to obtain the next round probability distribution data of the next training round.
  • the probability distribution data of the current round corresponding to the 11th round can be obtained, which can be represented by Q 11 ; for all unlabeled target samples input to the classification prediction
  • the model is trained for another round, and the next training round after the 11th round is the 12th round. Re-predict the probability distribution data of the next round corresponding to the unmarked target samples in the 12th round, which can be represented by Q 12 , and so on, and finally get Q 20 .
  • the probability distribution data of the next round is expressed as For example, for a classification prediction model that has been trained for 11 rounds, the unlabeled sample data corresponding to one of the unlabeled target samples It can be understood as: the unlabeled sample data belongs to the probability data of the jth class in the classification prediction model trained in the 11th round.
  • the second stability algorithm is defined, and the second stability parameter of the unlabeled sample data in the unlabeled target samples for the last k rounds can be calculated.
  • LKS is the second stability parameter
  • the second stability parameter is used to represent the stability data
  • r is the total training rounds of the classification prediction model training
  • x is the current training round
  • k is the preset round threshold
  • m is the number of classification prediction categories output by the classification prediction model
  • x r-k+1, r-k+2,...,r.
  • the corresponding stability data can be calculated for each unlabeled sample data in the unlabeled target samples corresponding to each round.
  • Sorting the unlabeled samples (that is, the unlabeled target samples in this embodiment) by the stability data for example, sorting the unlabeled target samples according to the order of the stability data from small to large or from large to small, understandable Yes, the embodiment of the present application needs to obtain unlabeled sample data with poor stability to realize the training of the classification prediction model. Therefore, it is necessary to select LKS, that is, the top n unlabeled sample data with the smallest second stability parameter (indicating the worst stability). It can be understood that the above n can be adjusted according to actual conditions. For example, it is 5% of the total data volume of unlabeled target samples.
  • the sorted unlabeled target samples are screened to obtain the training samples to be labeled, that is, the unlabeled training samples corresponding to LKS, that is, the second stability parameter with the smallest value of 10000*5% are selected. Label the sample data and get 500 training samples to be labeled. Afterwards, the training samples to be marked are marked according to the classification prediction model, and the pre-marked training samples are obtained. Afterwards, the pre-labeled training samples can be reviewed by experts to save workload.
  • the embodiment of the present application fully considers the stability of the training duration, so that the collected pre-marked training samples are more valuable and pertinent. And subsequent experts only need to confirm or modify the obtained high-value samples.
  • the high-value samples in this embodiment are pre-labeled training samples. For example, the experts query the correct labeled data, thereby effectively reducing the labeling cost.
  • the total training rounds corresponding to the classification prediction model are different.
  • the r value can be determined.
  • the accuracy rate of the classification prediction model in the 22nd round of training is 95%, reaching a new high, and the accuracy rate of the subsequent 23rd to 27th rounds does not exceed 95%, then the total number of training rounds r is determined to be 22.
  • the aforementioned preset round threshold can be adjusted according to actual conditions. For example, it is recommended to take r/2, and if r/2 is not an integer, it is rounded. For example, when the total number of training rounds is 22, the preset round threshold can be 11.
  • step S500 also includes but not limited to the following steps:
  • Step S501 inputting the preset pre-labeled samples into the classification prediction model to obtain the probability distribution data corresponding to the pre-labeled samples;
  • Step S502 according to the probability distribution data corresponding to the pre-labeled samples, the first pre-labeled training samples are obtained, wherein the confidence corresponding to the first pre-labeled training samples is greater than or equal to the preset reliability;
  • Step S503 using the first pre-labeled training sample, the training weight value corresponding to the first pre-labeled training sample and the training set to train the classification prediction model, wherein the training weight value is the confidence degree and the pre-labeled training sample corresponding to the first pre-labeled training sample.
  • the training weight value is the confidence degree and the pre-labeled training sample corresponding to the first pre-labeled training sample.
  • the remaining unlabeled samples that have not been manually reviewed and labeled cannot be effectively used. Therefore, the remaining unlabeled samples that have not been marked by manual review are defined as preset pre-labeled samples.
  • the classification prediction model in the embodiment of the present application not only uses labeled samples, but also uses pre-labeled samples.
  • the first pre-labeled training samples whose reliability is greater than or equal to the preset reliability are obtained, that is, the first pre-labeled training samples are high-confidence samples.
  • the further obtained labeled data corresponding to the first pre-labeled training samples are pseudo-labels given by the classification prediction model.
  • the pre-labeled samples may be further screened according to the probability distribution data corresponding to the pre-labeled samples.
  • the first pre-labeled training samples are screened out, wherein the confidence corresponding to the first pre-labeled training samples is greater than or equal to the preset confidence. That is, for the first pre-labeled training samples corresponding to less than the preset reliability, choose not to use them.
  • the training weight value corresponding to the first pre-labeled training sample is obtained, wherein the training weight value is the product of the confidence degree corresponding to the first pre-labeled training sample and the preset hyperparameter, which can be expressed as w* ⁇ .
  • the confidence degree ⁇ represents the largest probability data in the probability distribution data corresponding to the pre-labeled sample, that is, the probability data corresponding to the pseudo-label (the first pre-labeled training sample);
  • w is a preset hyperparameter, which can be set between 0- 1, the specific value can be preset. It can be understood that the first pre-marked training samples obtained in the embodiment of the present application do not need to be reviewed and confirmed by experts, so as to save workload.
  • the classification prediction model predicts that the corresponding probability distribution data is (0.1, 0.6, 0.3), that is, the positive probability data is 0.1, the negative probability data is 0.6, and the neutral probability data is 0.3. It can be understood that the sequence of the probability distribution data corresponding to the output classification prediction category is corresponding in the training and prediction process of the model.
  • the probability distribution data corresponding to another pre-labeled sample is (0.02, 0.9, 0.08), that is, the positive probability data is 0.02, the negative probability data is 0.9, and the neutral probability data is 0.08.
  • label training sample is the classification prediction category corresponding to the maximum probability data 0.9, ie negative. It can be understood that the positive, negative, and middle of the embodiment of the present application immediately mark the data.
  • w is used to distinguish the labeled sample from the first pre-labeled training sample corresponding to the pseudo-label. For example, if w is 0.5, the training weight of the first pre-labeled training sample will always be lower than 0.5, which specifies the upper limit of the pseudo-label weight.
  • the preset reliability is 0.7, and the preset hyperparameter is 0.5.
  • the first pre-marked training samples are used by using a dynamic weight method, so as to make full use of the pre-marked samples.
  • a dynamic weight method When iteratively updating the classification prediction model, both labeled samples and pre-labeled samples are used.
  • pre-labeled samples use the classification prediction model to give pseudo-labels, that is, select the first pre-labeled training sample with high confidence, and based on the confidence corresponding to the first pre-labeled training sample, give the corresponding loss of the first pre-labeled training sample function plus dynamic weights.
  • embodiments of the present application can be applied to text classification and other related classifications, such as news classification, sentiment analysis, text review, etc., which can effectively save the amount of manual labeling.
  • embodiments of the present application can also be used in conjunction with other existing screening methods, and the scores of each screening method are weighted and then reranked to select pre-marked training samples with high comprehensive value.
  • the classification prediction model is also used to automatically label the first pre-labeled training sample, which can effectively reduce the labeling cost.
  • the embodiment of the second aspect of the present application also provides a sample processing device, which includes: a memory, a processor, and a computer program stored in the memory and operable on the processor.
  • the processor and memory can be connected by a bus or other means.
  • memory can be used to store non-transitory software programs and non-transitory computer-executable programs.
  • the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage devices.
  • the memory may include memory located remotely from the processor, which remote memory may be connected to the processor via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the non-transitory software programs and instructions required to realize the sample processing method in the embodiment of the first aspect above are stored in the memory, and when executed by the processor, the sample processing method in the above embodiment is executed, for example, the above-described diagram is executed.
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • an embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by a processor or a controller, for example, by the above-mentioned Execution by a processor in the device embodiment can cause the processor to execute the sample processing method in the above embodiment, for example, execute the method steps S100 to S500 in FIG. 1 and the method steps S101 to S101 in FIG. 2 described above.
  • the embodiment of the present application includes: determining the unlabeled target samples, inputting the unlabeled target samples into the classification prediction model, obtaining the probability distribution data of the classification prediction, and then calculating the stability data according to the probability distribution data, and then obtaining the stability data according to the stability data
  • Pre-labeled training samples using the pre-labeled training samples and the preset training set to train the classification prediction model until the classification prediction model meets the preset stop training conditions.
  • the pre-labeled training samples obtained in the embodiment of the present application are more stable and more targeted, and can effectively reduce the cost of labeling samples.
  • Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tape, magnetic disk storage or other magnetic storage devices, or can Any other medium used to store desired information and which can be accessed by a computer.
  • communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

Abstract

本申请提供了一种样本处理方法、设备及计算机可读存储介质,所述方法包括确定未标注目标样本(S100);将所述未标注目标样本输入至分类预测模型,得到分类预测的概率分布数据(S200);根据所述概率分布数据,计算得到稳定性数据(S300);根据所述稳定性数据,得到预标注训练样本(S400);利用所述预标注训练样本和预设的训练集对所述分类预测模型进行训练(S500),直至所述分类预测模型符合预设的停止训练条件。

Description

样本处理方法、设备及计算机可读存储介质
相关申请的交叉引用
本申请基于申请号为202111348688.9、申请日为2021年11月15日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请实施例涉及但不限于数据处理技术领域,尤其涉及一种样本处理方法、设备及计算机可读存储介质。
背景技术
在当前信息爆炸的社会中,无标签数据的数量通常非常庞大,而有标签数据的获取也非常困难、费时和高成本。通过主动学习方法能够有效地选择无标签数据进行标注并训练,以得到性能良好的模型。在现实生活中,对于数据分类也有着广泛应用,而数据分类同样需要利用大量训练数据来获得较好的分类效果。
在一些情形下的数据分类中,通常采用已标注样本来训练一个初始的分类预测模型,并利用主动学习方法来对未标注样本进行边缘采样,以进一步对采样的样本进行人工标注,再利用人工标注后的样本对上述的分类预测模型进行训练,以得到符合预期的分类预测模型,但由于上述采样方法通常基于多样性和不确定性来进行的,进而导致对样本的标注成本较高。
发明内容
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。
本申请实施例提供了一种样本处理方法、设备及计算机可读存储介质。
第一方面,本申请实施例提供了一种样本处理方法,包括:确定未标注目标样本;将所述未标注目标样本输入至分类预测模型,得到分类预测的概率分布数据;根据所述概率分布数据,计算得到稳定性数据;根据所述稳定性数据,得到预标注训练样本;利用所述预标注训练样本和预设的训练集对所述分类预测模型进行训练,直至所述分类预测模型符合预设的停止训练条件。
第二方面,本申请实施例还提供了一种样本处理设备,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上述第一方面所述的样本处理方法。
第三方面,本申请实施例还提供了一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令用于执行如上述第一方面所述的样本处理方法。
本申请的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本申请而了解。本申请的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。
附图说明
附图用来提供对本申请技术方案的进一步理解,并且构成说明书的一部分,与本申请的实施例一起用于解释本申请的技术方案,并不构成对本申请技术方案的限制。
图1是本申请一个实施例提供的样本处理方法的流程示意图;
图2是本申请一个实施例提供的确定未标注目标样本的流程示意图;
图3是本申请一个实施例提供的概率分布数据的流程示意图;
图4是本申请一个实施例提供的稳定性数据的流程示意图;
图5是本申请另一个实施例提供的确定未标注目标样本的流程示意图;
图6是本申请另一个实施例提供的概率分布数据的流程示意图;
图7是本申请另一个实施例提供的稳定性数据的流程示意图;
图8是本申请一个实施例提供的预标注训练样本的流程示意图;
图9是本申请一个实施例提供的对分类预测模型进行训练的流程示意图;
图10是本申请一个实施例提供的确定分类预测模型的流程示意图;
图11是本申请一个实施例提供的对待标注目标样本进行标注处理的流程示意图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。
需要说明的是,虽然在装置示意图中进行了功能模块划分,在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于装置中的模块划分,或流程图中的顺序执行所示出或描述的步骤。说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。
在当前信息爆炸的社会中,无标签数据的数量通常非常庞大,然而有标签数据的获取也非常困难、费时和高成本。所采用的主动学习(Active Learning)方法,旨在能够通过有效地选择无标签数据进行标注并训练,在减少标注成本的同时能够得到性能良好的模型。而在现实生活中,对于数据分类也有着广泛应用,而数据分类同样需要利用大量训练数据来获得较好的分类效果。
在一些情形下的数据分类中,通常采用少量已标注样本来训练一个初始的分类预测模型,并利用主动学习方法从未标注样本中筛选样本,再让人类专家对筛选/采样的样本进行标注,接着把标注后的样本增加到原有已标注的训练集中,以重新训练上述的分类预测模型;之后再利用该主动学习方法重新筛选样本,如此反复,直到得到符合预期的分类预测模型。然而上述的方案,由于该主动学习方法通常根据多样性和不确定性来进行样本筛选的,并没有考虑采样的样本本身对于模型训练的稳定性,进而导致对样本的标注成本较高。
基于此,本申请实施例提供了一种样本处理方法、设备及计算机可读存储介质,能够有效降低样本标注成本。
可以理解的是,本申请实施例涉及数据分类,如文本分类,而文本分类包括但不限于新闻分类、情感分析、文本审核等应用场景。
下面结合附图,对本申请实施例作进一步阐述。
本申请第一方面实施例提供一种样本处理方法,如图1所示,图1是本申请一个实施例提供的样本处理方法的流程示意图。本申请实施例的样本处理方法包括但不限于以下步骤:
步骤S100,确定未标注目标样本;
步骤S200,将未标注目标样本输入至分类预测模型,得到分类预测的概率分布数据;
步骤S300,根据概率分布数据,计算得到稳定性数据;
步骤S400,根据稳定性数据,得到预标注训练样本;
步骤S500,利用预标注训练样本和预设的训练集对分类预测模型进行训练;
重复执行步骤S100至步骤S500,直至分类预测模型符合预设的停止训练条件。
可以理解的是,本申请实施例在步骤S100的确定未标注目标样本之前,还将获取原始样本数据,之后对原始样本数据进行初始化处理,得到初始化数据。之后再将该初始化数据划分为已标注样本和未标注样本。
可以理解的是,预设的训练集来自于上述的已标注样本。
一些实施例中,可以将已标注样本划分为训练集和测试集。
对于上述初始化处理而言,可对应设置多种初始化处理方法。例如,采用随机采样方法 从所有的原始样本数据中获取一部分原始目标样本数据,例如获取的原始目标样本数据对应的个数为10%的原始样本数据。在其他实施例中,也可设定随机采样的原始目标样本数据对应的数量比例在10%-30%之间,又或者可根据原始样本数据的数量作适应性调整,本申请实施例对此不作具体限定。之后,通过专家对该原始目标样本数据进行标注后生成已标注样本,即初始化数据包括已标注样本和未标注样本,之后再将该已标注样本划分为训练集和测试集。假设共有1000个原始样本数据,采用随机采样方法来获取10%的原始样本数据即获取得到了100个原始样本数据,该100个原始样本数据则作为原始目标样本数据。
又例如,采用聚类方法对原始样本数据进行分类,得到分类样本数据,之后按预设比例从分类样本数据中获取一部分原始目标样本数据,再通过专家对该原始目标样本数据进行标注后生成已标注样本,以便于将已标注样本划分为训练集和测试集。聚类方法包括但不限于k均值聚类方法(k-means clustering algorithm)、层次聚类方法等,其中,距离度量可以用word2vec词向量或者是编辑距离等。假设采用聚类方法对原始样本数据进行分类后,得到三个类别的分类样本数据,且三个类别的分类样本数据对应的样本数分别为500、300、200,则按10%的比例从三个类别的分类样本数据中分别获取得到原始目标样本数据,即三个类别的原始目标样本数据对应的样本数分别为500*10%、300*10%、200*10%,即可知从三个类别的分类样本数据中分别挑选得到了50、30、20个原始目标样本数据。可以理解的是,若按比例计算得到的样本数结果为非整数,则对该非整数进行取整,例如通过四舍五入的方式。
可以理解的是,本申请实施例还可以采用其他的初始化处理方法,以对原始样本数据进行初始化处理,而不局限于上述实施例,在此不再赘述。
可以理解的是,所设置的测试集的应用可参照图10所示。图10是本申请一个实施例提供的确定分类预测模型的流程示意图。即步骤S500,包括但不限于以下步骤:
步骤S510,利用预标注训练样本和预设的训练集对分类预测模型进行训练,得到候选预测模型;
步骤S520,将预设的测试集输入至候选预测模型,得到测试数据;
步骤S530,当测试数据符合预期的测试结果,确定分类预测模型符合预设的停止训练条件。
可以理解的是,可通过采用将测试集输入至候选预测模型,来判断候选预测模型输出的测试数据是否符合预期的测试结果,若符合预期的测试结果,则可确定当前的候选预测模型即分类预测模型符合预设的停止训练条件。
在步骤S100的确定未标注目标样本之前,可将已标注样本划分为训练集和测试集,并采用已标注样本中的训练集来训练分类预测模型。此时,可将已标注样本中的测试集直接输入至分类预测模型,得到第一测试数据,此时,若第一测试数据符合预期的测试结果,则可直接确定该分类预测模型符合预设的停止训练条件。
一些实施例中,可以基于已标注样本中的训练集来对XLNet或textcnn等初始模型进行训练,以得到分类预测模型,如分类预测模型可以为文本分类模型。可以理解的是,本申请对训练的分类预测模型的类型不作具体限定。由于分类预测模型是不断迭代更新的,因此,在每完成一轮分类预测模型的训练后,可将测试集输入至上述分类预测模型中,得到第二测试数据,当第二测试数据符合预期的测试结果,则确定当前的分类预测模型符合预设的停止训练条件。
可以理解的是,可以采用精确率、召回率、F1值(即精确率和召回率的调和均值)等数据来表征预期的测试结果。以精确率用于表征预期的测试结果为例,假设得到的第二测试数据即精确率为82%,而设定的预期的测试结果为85%,则表示当前的分类预测模型不符合预设的停止训练条件,需要继续采用本申请实施例的样本处理方法对分类预测模型进行训练;又或者,设定的预期的测试结果为80%,则表示当前的第二测试数据符合预期的测试结果,进而确定当前的分类预测模型符合预设的停止训练条件。可以理解的是,可根据实际应用场景对预期的测试结果进行设置,而不局限于上述实施例,在此不再赘述。
参照图11所示,可以理解的是,在确定分类预测模型符合预设的停止训练条件之后,本申请实施例的样本处理方法还包括:
步骤S600,获取待标注目标样本;
步骤S700,根据符合停止训练条件的分类预测模型对待标注目标样本进行标注处理。
本申请实施例在确定该分类预测模型符合预设的停止训练条件之后,可直接执行步骤S600、步骤S700。待标注目标样本即为实际需要标注的样本,将待标注目标样本输入到符合停止训练条件的分类预测模型(即输入到完成训练的分类预测模型)中,可得到分类预测的待标注概率分布数据,之后根据待标注概率分布数据,获取待标注目标样本对应的标注属性数据,根据标注属性数据,确定待标注目标样本对应的标注数据。
可以理解的是,本申请实施例的步骤S600、步骤S700可设置于步骤S500之后,或可设置于步骤S530之后。
需要说明的是,当分类预测模型符合预设的停止训练条件之后,将停止获取预标注训练样本,此时采用完成训练的分类预测模型对待标注目标样本进行标注处理,并通过专家审核待标注目标样本对应的标注数据。可以理解的是,待标注目标样本可以为除待标注训练样本之外的未标注样本;又或者,待标注目标样本还可以为其他实际所需的待标注样本,在此不作具体限定。
参照图2,可以理解的是,步骤S100,包括但不限于以下步骤:
步骤S101,对预设的未标注样本进行数据扰动处理,得到扰动样本;
步骤S102,将扰动样本和未标注样本确定为未标注目标样本。
本申请实施例对预设的未标注样本A进行数据扰动处理,得到扰动样本。例如,可以从预设的未标注样本A(未标注样本A可以为未标注样本数据的集合)中挑选未标注样本数据,并分别对每一未标注样本数据进行数据扰动处理,得到每一未标注样本数据对应的扰动样本。将上述的扰动样本和未标注样本A确定为未标注目标样本。
可以理解的是,数据扰动处理包括但不限于以下方法:
1.同义词扰动处理:初始化一个同义词词表,从预设的未标注样本A(未标注样本A可以为未标注样本数据的集合)中挑选一个未标注样本数据,对未标注样本数据进行分词处理,得到若干未标注词数据,从该未标注词数据中随机挑选出一个未标注词,以对挑选出的未标注词进行同义词替换处理,得到同义词数据。即在同义词词表中搜索同义词,当找到未标注词对应的同义词,将该未标注词替换为同义词,得到同义词数据,即同义词数据可作为未标注样本数据对应的其中一个扰动样本。当找不到同义词,重新从剩下的未标注词数据中随机挑选一个其他的未标注词,以进行同义词替换处理。当未标注样本数据中所有的未标注词数据,均找不到同义词,放弃使用该同义词扰动处理方法生成扰动样本。
例如,未标注样本数据为“怎么快速学会唱一首歌”,对其进行分词处理后得到若干未标注词数据即“怎么快速学会唱一首歌”,从该未标注词数据中随机挑选一个未标注词,如挑选出“唱”,从同义词词表中无法搜索得到“唱”的同义词,则重新从剩下的未标注词数据中随机挑选一个其他的未标注词,如挑选出“怎么”,从同义词词表中搜索得到“怎么”的同义词有“如何”、“怎样”等,则从同义词词表中随机挑选一个同义词,比如挑选“如何”,以对“怎么”这个未标注词进行同义词替换处理,得到同义词数据,即同义词数据表征该未标注样本数据对应的扰动样本,即为“如何快速学会唱一首歌”。
2.翻译扰动处理:从预设的未标注样本A(未标注样本A可以为未标注样本数据的集合)中挑选一个未标注样本数据,使用翻译工具对未标注样本数据进行语言翻译处理,得到翻译数据,再对翻译数据再次进行语言翻译处理,得到未标注样本数据对应的其中一个扰动样本,其中,该扰动样本对应的语言类型与未标注样本数据对应的语言类型相同。即先把未标注样本数据翻译成其他语言,然后再翻译回源语言。可以理解的是,未标注样本数据可以为文本数据,例如一个句子可作为一个未标注样本数据。
例如,未标注样本A对应的语言类型为中文,则可以先把中文的未标注样本数据翻译成 英文的翻译数据,再把翻译好的翻译数据翻译回中文,得到中文的扰动样本。可以理解的是,本实施例的扰动样本可能和翻译前的未标注样本数据一模一样,故可以采用Beam search(束搜索)方法,来保证扰动样本和翻译前的未标注样本数据不相同。Beam search(束搜索)方法为机器翻译领域中的公开技术,在此不再赘述。可以理解的是,可以借助多种不同的语言类型对未标注样本数据进行语言翻译处理,进而生成多种扰动样本。例如,翻译扰动处理可体现为:通过中文-英文-中文,生成一个扰动样本;通过中文-意大利文-中文,生成另一个扰动样本。同样采用Beam search(束搜索)方法来保证生成的扰动样本之间不相同。
3.预训练语言模型扰动处理:采用预训练语言模型来构建扰动样本,即通过将预设的未标注样本A输入至预设的预训练语言模型,得到扰动样本。可以通过BERT、ELECTRA等模型,利用MASK掩码方法来训练预训练语言模型。以BERT模型为例,从预设的未标注样本A(未标注样本A可以为未标注样本数据的集合)中挑选一个未标注样本数据,随机将未标注样本数据中的部分未标注词数据设置成MASK,以得到未标注掩码数据,将未标注掩码数据输入BERT模型中,利用BERT模型预测该未标注掩码数据,以输出得到未标注样本数据对应的其中一个扰动样本。由于通过BERT模型预测的MASK的结果即扰动样本可能和未标注样本数据一样,此时可以把未标注样本数据中的未标注词数据组合设置成不同的未标注掩码数据,输入至BERT模型中。设置最多尝试10次,当10次之后仍没有生成合适的扰动样本,则放弃使用这种方法生成扰动样本。
例如,未标注样本数据为“怎么快速学会唱一首歌”,使用BERT分词对该未标注样本数据进行分词处理,得到若干未标注词数据,即“怎么快速学会唱一首歌”。随机将未标注样本数据中的部分未标注词数据设置成MASK,以得到未标注掩码数据。如随机将不超过20%的未标注词数据设置成MASK,比如未标注掩码数据为“怎么MASK速学MASK唱一首歌”,将未标注掩码数据输入到BERT模型,得到扰动样本。当扰动样本和未标注样本数据不一样,则确定该扰动样本为对应的扰动样本。当扰动样本和未标注样本数据一样,则重新获取未标注掩码数据。可设置为最多重复10次,在其他实施例中,也可设置其他重复次数,在此不作具体限定。
可以理解的是,还可以采用其他数据扰动处理方法代替上述数据扰动处理方法,其并不影响本申请实施例对分类预测模型的训练,均处于本申请的保护范围内,在此不再赘述。
之后,参照图3,可以理解的是,步骤S200,包括但不限于以下步骤:
步骤S201,将扰动样本和未标注样本输入至分类预测模型,得到扰动概率分布数据和未标注概率分布数据。
本申请实施例,将扰动样本和未标注样本确定为未标注目标样本之后,将扰动样本和未标注样本A输入至分类预测模型,得到扰动概率分布数据和未标注概率分布数据,其中,扰动概率分布数据为对应于扰动样本的分类预测的概率分布数据,未标注概率分布数据为对应于未标注样本A的分类预测的概率分布数据。
另一实施例中,还可以将每一未标注样本数据分别输入至分类预测模型,得到每一未标注样本数据对应的未标注概率分布数据;将每一未标注样本数据对应的多个扰动样本,分别输入至分类预测模型,得到每一扰动样本对应的扰动概率分布数据。
参照图4,可以理解的是,步骤S300,包括但不限于以下步骤:
步骤S301,根据第一稳定性算法、扰动概率分布数据和未标注概率分布数据,计算得到稳定性数据。
即根据本申请实施例的步骤S101、步骤S102、步骤S201和步骤S301,来获取未标注样本A对应的稳定性数据。
通过分类预测模型来预测扰动样本的分类预测的概率分布数据即扰动概率分布数据,以及预测未标注样本A的分类预测的概率分布数据即未标注概率分布数据;基于第一稳定性算法、扰动概率分布数据和未标注概率分布数据,计算得到未标注样本A对应的稳定性数据。可以理解的是,未标注样本A中的每一未标注样本数据均对应有稳定性数据。
可以理解的是,假设未标注样本A中的一个未标注样本数据一共生成了N个扰动样本,扰动样本表示为R 1,R 2,...,R N,加上原始的未标注样本数据,则一共是N+1个样本,例如得到N+1个句子。假设是m分类问题,m为分类预测模型输出的分类预测类别的个数,则未标注概率分布数据
Figure PCTCN2022130616-appb-000001
形如
Figure PCTCN2022130616-appb-000002
的长度为m的向量,其中,j=1,2,...,m;而未标注概率分布数据
Figure PCTCN2022130616-appb-000003
中,如
Figure PCTCN2022130616-appb-000004
可理解为未标注样本A中该未标注样本数据属于第1类的概率数据。
记扰动概率分布数据
Figure PCTCN2022130616-appb-000005
为对应于扰动样本的分类预测的概率分布数据,未标注概率分布数据
Figure PCTCN2022130616-appb-000006
为对应于未标注样本A的分类预测的概率分布数据。可以理解的是,未标注样本A中的每一未标注样本数据均可对应有未标注概率分布数据。
定义第一稳定性算法,可计算得到未标注样本A对于数据扰动处理后的第一稳定性参数。
第一稳定性算法的计算公式为:
Figure PCTCN2022130616-appb-000007
其中,TDS为第一稳定性参数,第一稳定性参数用于表征稳定性数据,N为扰动样本的个数,m为分类预测模型输出的分类预测类别的个数,
Figure PCTCN2022130616-appb-000008
为扰动概率分布数据,
Figure PCTCN2022130616-appb-000009
为未标注概率分布数据,i=1,2,...,N,j=1,2,...,m。
可以理解的是,TDS即第一稳定性参数越大,表示未标注样本A中对应的未标注样本数据越稳定。
需说明的是,对于同义词扰动处理和预训练语言模型扰动处理,可能无法生成有效的扰动样本。而对于翻译扰动处理则能够有效生成扰动样本。例如,通过翻译扰动处理,选择两个不同的语言类型来分别生成两个扰动样本。通过上述的数据扰动处理方法,可知生成的扰动样本的个数N可能的取值为2,3,4。针对不同未标注样本数据所对应的扰动样本的个数不统一的问题,上述第一稳定性算法中有个1/N系数可以调节,以保证数据的准确度。
可以理解的是,对未标注样本A中的每一未标注样本数据都可以计算其对应的稳定性数据。参照图8,可以理解的是,步骤S400,包括但不限于以下步骤:
步骤S410,根据稳定性数据对未标注样本进行排序;
步骤S420,对排序后的未标注样本进行筛选得到待标注训练样本;
步骤S430,根据分类预测模型对待标注训练样本进行标注处理,得到预标注训练样本。
可以理解的是,本申请实施例包括已标注样本和未标注样本。由于未标注样本的数量通常较多,若直接对所有的未标注样本进行人工标注,则将导致标注成本较高。因此,本申请实施例通过步骤S100至步骤S500,以通过训练过程中的分类预测模型对未标注样本进行标注处理,进而得到预标注训练样本。可以理解的是,在标注处理过程中,需要经过专家审核确认后,即通过向专家查询正确的标注数据,从而得到预标注训练样本。
本申请实施例通过对预设的未标注样本进行数据扰动处理,得到扰动样本,再将扰动样本和未标注样本输入至分类预测模型,得到扰动概率分布数据和未标注概率分布数据;之后根据第一稳定性算法、扰动概率分布数据和未标注概率分布数据,计算得到未标注样本中的每一未标注样本数据对应的稳定性数据。
本申请实施例通过在未标注样本中挑选出稳定性差的未标注样本数据,进而让专家标注确认。
通过稳定性数据对未标注样本进行排序,例如将稳定性数据按照从小到大或者从大到小的顺序来对未标注样本进行排序,可以理解的是,本申请实施例需要获取稳定性差的未标注样本数据,以实现对分类预测模型的训练。故需要挑选TDS即第一稳定性参数最小(表示稳定性最差)的top n的未标注样本数据。可以理解的是,上述n可根据实际情况可以调整。例如为未标注样本的总数据量的5%。举例而言,未标注样本的总数据量为10000,则对排序后 的未标注样本进行筛选得到待标注训练样本,即挑选TDS即第一稳定性参数最小的10000*5%对应的未标注样本数据,得到500个数据量的待标注训练样本。之后,根据分类预测模型对待标注训练样本进行标注处理,得到预标注训练样本。之后可通过专家对预标注训练样本进行审核,以节省工作量。
可以理解的是,本申请实施例充分考虑了未标注样本A中每一未标注样本数据对于数据扰动的稳定性,进而使得收集到的预标注训练样本更具价值。且后续专家只需确认或修改所获取的高价值的预标注训练样本,例如通过专家查询正确的标注数据,从而有效地降低标注成本。
参照图5,可以理解的是,步骤S100,包括但不限于以下步骤:
步骤S110,当对分类预测模型进行训练的当前训练轮次大于预设轮次阈值,将当前训练轮次对应的预设的未标注样本确定为当前训练轮次的未标注目标样本,将当前训练轮次之后的下一训练轮次对应的预设的未标注样本确定为下一训练轮次的未标注目标样本。
本申请实施例通过获取分类预测模型进行训练的最后k轮,来计算得到稳定性数据。
在最后k轮中,每一轮均对当前训练轮次对应的预设的未标注样本进行预测,参照图6,可以理解的是,步骤S200,包括但不限于以下步骤:
步骤S210,对于当前训练轮次,将对应的未标注目标样本输入至分类预测模型,得到当前训练轮次的当前轮次概率分布数据;
步骤S220,对于下一训练轮次,将对应的未标注目标样本输入至分类预测模型,得到下一训练轮次的下一轮次概率分布数据;
步骤S230,以此类推,对分类预测模型进行多轮次的训练处理,得到多个当前轮次概率分布数据和多个下一轮次概率分布数据。
可以理解的是,当前轮次概率分布数据为对应于当前训练轮次的分类预测的概率分布数据,下一轮次概率分布数据为对应于下一训练轮次的分类预测的概率分布数据。
参照图7,可以理解的是,步骤S300,包括但不限于以下步骤:
步骤S310,根据第二稳定性算法、当前轮次概率分布数据和下一轮次概率分布数据,计算得到稳定性数据。
本申请实施例通过在未标注目标样本中挑选出稳定性差的未标注样本数据,进而让专家标注确认。例如,可采用步骤S410至步骤S430来对未标注目标样本进行挑选,其实施步骤和效果与上述相同,在此不再赘述。
本申请实施例在分类预测模型进行训练的最后k轮中,每一轮均对当前训练轮次对应的未标注目标样本进行预测,得到当前训练轮次的分类预测的概率分布数据即当前轮次概率分布数据,以及下一训练轮次的分类预测的概率分布数据即下一轮次概率分布数据。
可以理解的是,利用所有的训练样本对分类预测模型训练一遍后称为训练一轮,以此计算训练轮次。
举例而言,假设分类预测模型最终需要训练20轮。
设定预设轮次阈值k为10,当当前训练轮次x为11,则表示对分类预测模型进行训练的当前训练轮次大于预设轮次阈值10,则从第x轮即第11轮开始,采用分类预测模型来对对应的未标注目标样本进行预测,得到对应的分类预测的概率分布数据。
可以理解的是,对于当前训练轮次和下一训练轮次,其所对应的未标注目标样本可以相同。
可以理解的是,本申请实施例需要对分类预测模型进行多轮次的训练处理,以此得到多个当前轮次概率分布数据和多个下一轮次概率分布数据。即对于当前训练轮次,将当前训练轮次对应的未标注目标样本输入至分类预测模型,得到当前训练轮次的当前轮次概率分布数据;对于下一训练轮次,将下一训练轮次对应的未标注目标样本输入至分类预测模型,得到下一训练轮次的下一轮次概率分布数据。
对于将对应的未标注目标样本输入至训练了11轮的分类预测模型中,可得到第11轮对 应的当前轮次概率分布数据,可用Q 11表示;对所有的未标注目标样本输入至分类预测模型中再训练一轮,在11轮之后的下一训练轮次即第12轮。重新预测第12轮的未标注目标样本对应的下一轮次概率分布数据,可用Q 12表示,依此类推,最终得到Q 20
对于最后k轮,假设是m分类问题,对于Q x,(x=r-k+1,r-k+2,...,r),即(Q r-k+1,Q r-k+2,...,Q r),记其为当前轮次概率分布数据,即
Figure PCTCN2022130616-appb-000010
则下一轮次概率分布数据表示为
Figure PCTCN2022130616-appb-000011
例如对于训练了11轮的分类预测模型,未标注目标样本中的一个未标注样本数据所对应的
Figure PCTCN2022130616-appb-000012
可理解为:该未标注样本数据在第11轮训练的分类预测模型中属于第j类的概率数据。
由此可知,对于第二稳定性算法,当预设轮次阈值为10时,即对于最后10轮而言,第11轮时,需要获取
Figure PCTCN2022130616-appb-000013
Figure PCTCN2022130616-appb-000014
第12轮时,需要获取
Figure PCTCN2022130616-appb-000015
Figure PCTCN2022130616-appb-000016
以此类推,第19轮时,需要获取
Figure PCTCN2022130616-appb-000017
Figure PCTCN2022130616-appb-000018
定义第二稳定性算法,可计算得到未标注目标样本中未标注样本数据对于最后k轮的第二稳定性参数。
第二稳定性算法的计算公式为:
Figure PCTCN2022130616-appb-000019
其中,LKS为第二稳定性参数,第二稳定性参数用于表征稳定性数据,r为分类预测模型进行训练的总训练轮次,x为当前训练轮次,k为预设轮次阈值,m为分类预测模型输出的分类预测类别的个数,
Figure PCTCN2022130616-appb-000020
为当前轮次概率分布数据,
Figure PCTCN2022130616-appb-000021
为下一轮次概率分布数据,x=r-k+1,r-k+2,...,r。
可以理解的是,LKS即第二稳定性参数越大,表示未标注目标样本中对应的未标注样本数据越稳定。
需说明的是,对每一轮对应的未标注目标样本中的每一未标注样本数据都可以计算其对应的稳定性数据。
通过稳定性数据对未标注样本(即本实施例的未标注目标样本)进行排序,例如将稳定性数据按照从小到大或者从大到小的顺序来对未标注目标样本进行排序,可以理解的是,本申请实施例需要获取稳定性差的未标注样本数据,以实现对分类预测模型的训练。故需要挑选LKS即第二稳定性参数最小(表示稳定性最差)的top n的未标注样本数据。可以理解的是,上述n可根据实际情况可以调整。例如为未标注目标样本的总数据量的5%。举例而言,未标注目标样本的总数据量为10000,则对排序后的未标注目标样本进行筛选得到待标注训练样本,即挑选LKS即第二稳定性参数最小的10000*5%对应的未标注样本数据,得到500个数据量的待标注训练样本。之后,根据分类预测模型对待标注训练样本进行标注处理,得到预标注训练样本。之后可通过专家对预标注训练样本进行审核,以节省工作量。
可以理解的是,本申请实施例充分考虑了训练时长的稳定性,使得收集到的预标注训练样本更具价值和针对性。且后续专家只需确认或修改所获取的高价值样本,本实施例的高价值样本即预标注训练样本,例如通过专家查询正确的标注数据,从而有效地降低标注成本。
可以理解的是,针对每个分类任务,分类预测模型对应的总训练轮次均有差异。本申请实施例可通过获取连续y轮,如连续5轮的第三测试数据,来判断第三测试数据是否符合预期的测试结果。例如,当连续y轮的第三测试数据均未超过挑选r的测试数据,则可确定r值。假设,训练第22轮的分类预测模型得到的精确率为95%,达到新高,之后的第23至27轮的精确率都没超过95%,则确定总训练轮次r为22。上述预设轮次阈值可根据实际情况来调整。例如,建议取为r/2,如果r/2不是整数则取整。例如,总训练轮次为22时,则预设轮次阈 值可取11。
参照图9,可以理解的是,步骤S500,还包括但不限于以下步骤:
步骤S501,将预设的预标注样本输入至分类预测模型,得到预标注样本对应的概率分布数据;
步骤S502,根据预标注样本对应的概率分布数据,得到第一预标注训练样本,其中,第一预标注训练样本对应的置信度大于或等于预设置信度;
步骤S503,利用第一预标注训练样本、第一预标注训练样本对应的训练权重值和训练集对分类预测模型进行训练,其中,训练权重值为第一预标注训练样本对应的置信度和预设超参数的乘积。
若采用已标注样本、预标注训练样本对分类预测模型进行训练,则对于剩下的未进行人工审核标注的未标注样本无法有效利用。因此,定义剩下的未进行人工审核标注的未标注样本为预设的预标注样本。
可以理解的是,本申请实施例的分类预测模型不仅采用了已标注样本,还采用了预标注样本。将预设的预标注样本输入至分类预测模型,得到预标注样本对应的概率分布数据,根据预标注样本对应的概率分布数据,来挑选出高置信度样本。根据预标注样本对应的概率分布数据,来获取大于或等于预设置信度的第一预标注训练样本,即该第一预标注训练样本为高置信度样本。
需要说明的是,对于预标注样本而言,其所进一步得到的第一预标注训练样本所对应的标注数据,为分类预测模型给出的伪标签。本申请实施例可进一步根据预标注样本对应的概率分布数据,来对预标注样本进行筛选。筛选出第一预标注训练样本,其中,第一预标注训练样本对应的置信度大于或等于预设置信度。即对于小于预设置信度对应的第一预标注训练样本选择放弃使用。之后,获取第一预标注训练样本对应的训练权重值,其中,训练权重值为第一预标注训练样本对应的置信度和预设超参数的乘积,可表示为w*σ。其中,置信度σ表示为预标注样本对应的概率分布数据中最大的概率数据,即伪标签(第一预标注训练样本)对应的概率数据;w为预设超参数,其可以设置在0-1之间,具体值可进行预设。可以理解的是,本申请实施例所得到的第一预标注训练样本并不需要专家进行审核确认,以节省工作量。
例如,以预设置信度为0.7,预设超参数为0.5举例,假设本申请实施例为情感三分类问题,即分类预测模型输出的分类预测类别的个数m为3,分为正面、负面、中立。针对预标注样本,分类预测模型预测其对应的概率分布数据为(0.1,0.6,0.3),即正面的概率数据为0.1,负面的概率数据为0.6,中立的概率数据为0.3。可以理解的是,输出的分类预测类别对应的概率分布数据的顺序,在模型的训练和预测过程中均是对应的。由于在该概率分布数据中,最大的概率数据即置信度σ为0.6,低于预设置信度0.7,因此放弃这个预标注样本。另外一个预标注样本对应的概率分布数据为(0.02,0.9,0.08),即正面的概率数据为0.02,负面的概率数据为0.9,中立的概率数据为0.08,则伪标签(即确定第一预标注训练样本)为最大的概率数据0.9所对应的分类预测类别,即负面。可以理解的是,本申请实施例的正面、负面、中立即标注数据。通过上述方式,实现根据预标注样本对应的概率分布数据,从预标注样本中筛选出第一预标注训练样本。计算第一预标注训练样本对应的训练权重值为0.5*0.9=0.45。最后基于损失函数,利用第一预标注训练样本、第一预标注训练样本对应的训练权重值和训练集对分类预测模型进行训练,在第一预标注训练样本对应的损失函数前乘以对应的训练权重值0.45。
本申请实施例为了给低置信度的第一预标注训练样本更低的训练权重值,而采用w来区分已标注样本和伪标签对应的第一预标注训练样本。例如,w取0.5,则第一预标注训练样本的训练权重永远低于0.5,其规定了伪标签权重的上限。本实施例的预设置信度取0.7,预设超参数取0.5。
本申请实施例通过采用动态权重方法来利用第一预标注训练样本,以充分利用预标注样本。在迭代更新分类预测模型时,同时利用已标注样本和预标注样本。对于预标注样本,使用分类预测模型给出伪标签,即挑选高置信度的第一预标注训练样本,并基于第一预标注训练样本对应的置信度,给第一预标注训练样本对应的损失函数加上动态权重。
可以理解的是,本申请实施例可以应用于文本分类等相关的分类中,如新闻分类、情感分析、文本审核等,能够有效节省人工标注量。此外本申请实施例还可以与其他已有的筛选方法搭配使用,每个筛选方法得分加权后进行重排序,以挑选综合价值高的预标注训练样本。
可以理解的是,在一些情形下,虽然存在有利用未标注样本对模型进行训练的方式,但其并没有对人工标注的真实标签和根据未训练完成的分类预测模型进行预测得到的伪标签进行区分。而本申请实施例所获取的第一预标注训练样本价值性和针对性更高,具有代表性。本申请实施例还采用分类预测模型对第一预标注训练样本进行自动标注,可以有效地减少标注成本。
另外,本申请第二方面实施例还提供了一种样本处理设备,该样本处理设备包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序。
处理器和存储器可以通过总线或者其他方式连接。
存储器作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序。此外,存储器可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中,存储器可包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至该处理器。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
实现上述第一方面实施例的样本处理方法所需的非暂态软件程序以及指令存储在存储器中,当被处理器执行时,执行上述实施例中的样本处理方法,例如,执行以上描述的图1中的方法步骤S100至S500、图2中的方法步骤S101至至S102、图3中的方法步骤S201、图4中的方法步骤S301、图5中的方法步骤S110、图6中的方法步骤S210至S230、图7中的方法步骤S310、图8中的方法步骤S410至S430、图9中的方法步骤S501至S503、图10中的方法步骤S510至S530、图11中的方法步骤S600至S700。
以上所描述的设备实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
此外,本申请的一个实施例还提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机可执行指令,该计算机可执行指令被一个处理器或控制器执行,例如,被上述设备实施例中的一个处理器执行,可使得上述处理器执行上述实施例中的样本处理方法,例如,执行以上描述的图1中的方法步骤S100至S500、图2中的方法步骤S101至至S102、图3中的方法步骤S201、图4中的方法步骤S301、图5中的方法步骤S110、图6中的方法步骤S210至S230、图7中的方法步骤S310、图8中的方法步骤S410至S430、图9中的方法步骤S501至S503、图10中的方法步骤S510至S530、图11中的方法步骤S600至S700。
本申请实施例包括:确定未标注目标样本,将未标注目标样本输入至分类预测模型,得到分类预测的概率分布数据,之后根据概率分布数据,计算得到稳定性数据,再根据稳定性数据,得到预标注训练样本,利用预标注训练样本和预设的训练集对分类预测模型进行训练,直至分类预测模型符合预设的停止训练条件。与一些情形相比,本申请实施例所获得的预标注训练样本的稳定性更强,针对性也更高,能够有效降低样本的标注成本。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统可以被实施为软件、固件、硬件及其适当的组合。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普 通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。
以上是对本申请的若干实施方式进行了说明,但本申请并不局限于上述实施方式,熟悉本领域的技术人员在不违背本申请精神的前提下还可作出种种的等同变形或替换,这些等同的变形或替换均包含在本申请权利要求所限定的范围内。

Claims (15)

  1. 一种样本处理方法,包括:
    确定未标注目标样本;
    将所述未标注目标样本输入至分类预测模型,得到分类预测的概率分布数据;
    根据所述概率分布数据,计算得到稳定性数据;
    根据所述稳定性数据,得到预标注训练样本;
    利用所述预标注训练样本和预设的训练集对所述分类预测模型进行训练,直至所述分类预测模型符合预设的停止训练条件。
  2. 根据权利要求1所述的方法,其中,所述确定未标注目标样本,包括:
    对预设的未标注样本进行数据扰动处理,得到扰动样本;
    将所述扰动样本和所述未标注样本确定为未标注目标样本。
  3. 根据权利要求2所述的方法,其中,所述将所述未标注目标样本输入至分类预测模型,得到分类预测的概率分布数据,包括:
    将所述扰动样本和所述未标注样本输入至所述分类预测模型,得到扰动概率分布数据和未标注概率分布数据;
    其中,所述扰动概率分布数据为对应于所述扰动样本的分类预测的概率分布数据,所述未标注概率分布数据为对应于所述未标注样本的分类预测的概率分布数据。
  4. 根据权利要求3所述的方法,其中,所述根据所述概率分布数据,计算得到稳定性数据,包括:
    根据第一稳定性算法、所述扰动概率分布数据和所述未标注概率分布数据,计算得到所述稳定性数据。
  5. 根据权利要求4所述的方法,其中,所述第一稳定性算法的计算公式为:
    Figure PCTCN2022130616-appb-100001
    其中,所述TDS为第一稳定性参数,所述第一稳定性参数用于表征所述稳定性数据,所述N为所述扰动样本的个数,所述m为所述分类预测模型输出的分类预测类别的个数,所述
    Figure PCTCN2022130616-appb-100002
    为所述扰动概率分布数据,所述
    Figure PCTCN2022130616-appb-100003
    为所述未标注概率分布数据,i=1,2,...,N,j=1,2,...,m。
  6. 根据权利要求1所述的方法,其中,所述确定未标注目标样本,包括:
    当对所述分类预测模型进行训练的当前训练轮次大于预设轮次阈值,将所述当前训练轮次对应的预设的未标注样本确定为所述当前训练轮次的未标注目标样本,将所述当前训练轮次之后的下一训练轮次对应的预设的未标注样本确定为所述下一训练轮次的未标注目标样本。
  7. 根据权利要求6所述的方法,其中,所述将所述未标注目标样本输入至分类预测模型,得到分类预测的概率分布数据,包括:
    对于当前训练轮次,将对应的所述未标注目标样本输入至所述分类预测模型,得到当前训练轮次的当前轮次概率分布数据;
    对于下一训练轮次,将对应的所述未标注目标样本输入至所述分类预测模型,得到下一训练轮次的下一轮次概率分布数据;
    以此类推,对所述分类预测模型进行多轮次的训练处理,得到多个当前轮次概率分布数据和多个下一轮次概率分布数据。
  8. 根据权利要求7所述的方法,其中,所述根据所述概率分布数据,计算得到稳定性数据,包括:
    根据第二稳定性算法、所述当前轮次概率分布数据和所述下一轮次概率分布数据,计算得到所述稳定性数据。
  9. 根据权利要求8所述的方法,其中,所述第二稳定性算法的计算公式为:
    Figure PCTCN2022130616-appb-100004
    其中,所述LKS为第二稳定性参数,所述第二稳定性参数用于表征所述稳定性数据,所述r为所述分类预测模型进行训练的总训练轮次,所述x为所述当前训练轮次,所述k为所述预设轮次阈值,所述m为所述分类预测模型输出的分类预测类别的个数,所述
    Figure PCTCN2022130616-appb-100005
    为所述当前轮次概率分布数据,所述
    Figure PCTCN2022130616-appb-100006
    为所述下一轮次概率分布数据,x=r-k+1,r-k+2,...,r。
  10. 根据权利要求2至9任一项所述的方法,其中,所述根据所述稳定性数据,得到预标注训练样本,包括:
    根据所述稳定性数据对所述未标注样本进行排序;
    对排序后的所述未标注样本进行筛选得到待标注训练样本;
    根据所述分类预测模型对所述待标注训练样本进行标注处理,得到预标注训练样本。
  11. 根据权利要求1至9任一项所述的方法,其中,所述利用所述预标注训练样本和预设的训练集对所述分类预测模型进行训练,还包括:
    将预设的预标注样本输入至所述分类预测模型,得到所述预标注样本对应的概率分布数据;
    根据所述预标注样本对应的概率分布数据,得到第一预标注训练样本,其中,所述第一预标注训练样本对应的置信度大于或等于预设置信度;
    利用所述第一预标注训练样本、所述第一预标注训练样本对应的训练权重值和所述训练集对所述分类预测模型进行训练,其中,所述训练权重值为所述第一预标注训练样本对应的置信度和预设超参数的乘积。
  12. 根据权利要求1至9任一项所述的方法,其中,所述利用所述预标注训练样本和预设的训练集对所述分类预测模型进行训练,直至所述分类预测模型符合预设的停止训练条件,包括:
    利用所述预标注训练样本和预设的训练集对所述分类预测模型进行训练,得到候选预测模型;
    将预设的测试集输入至所述候选预测模型,得到测试数据;
    当所述测试数据符合预期的测试结果,确定所述分类预测模型符合预设的停止训练条件。
  13. 根据权利要求1至9任一项所述的方法,其中,所述方法还包括:
    获取待标注目标样本;
    根据符合所述停止训练条件的分类预测模型对所述待标注目标样本进行标注处理。
  14. 一种样本处理设备,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如权利要求1至13中任意一项所述的样本处理方法。
  15. 一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令用于执行如权利要求1至13中任意一项所述的样本处理方法。
PCT/CN2022/130616 2021-11-15 2022-11-08 样本处理方法、设备及计算机可读存储介质 WO2023083176A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111348688.9A CN114091595A (zh) 2021-11-15 2021-11-15 样本处理方法、设备及计算机可读存储介质
CN202111348688.9 2021-11-15

Publications (1)

Publication Number Publication Date
WO2023083176A1 true WO2023083176A1 (zh) 2023-05-19

Family

ID=80300847

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/130616 WO2023083176A1 (zh) 2021-11-15 2022-11-08 样本处理方法、设备及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN114091595A (zh)
WO (1) WO2023083176A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114091595A (zh) * 2021-11-15 2022-02-25 南京中兴新软件有限责任公司 样本处理方法、设备及计算机可读存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111476256A (zh) * 2019-01-24 2020-07-31 北京京东尚科信息技术有限公司 基于半监督学习的模型训练方法、装置及电子设备
CN112308144A (zh) * 2020-10-30 2021-02-02 江苏云从曦和人工智能有限公司 一种筛选样本的方法、系统、设备及介质
US20210256420A1 (en) * 2020-02-19 2021-08-19 Microsoft Technology Licensing, Llc System and method for improving machine learning models by detecting and removing inaccurate training data
CN113590764A (zh) * 2021-09-27 2021-11-02 智者四海(北京)技术有限公司 训练样本构建方法、装置、电子设备和存储介质
CN114091595A (zh) * 2021-11-15 2022-02-25 南京中兴新软件有限责任公司 样本处理方法、设备及计算机可读存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111476256A (zh) * 2019-01-24 2020-07-31 北京京东尚科信息技术有限公司 基于半监督学习的模型训练方法、装置及电子设备
US20210256420A1 (en) * 2020-02-19 2021-08-19 Microsoft Technology Licensing, Llc System and method for improving machine learning models by detecting and removing inaccurate training data
CN112308144A (zh) * 2020-10-30 2021-02-02 江苏云从曦和人工智能有限公司 一种筛选样本的方法、系统、设备及介质
CN113590764A (zh) * 2021-09-27 2021-11-02 智者四海(北京)技术有限公司 训练样本构建方法、装置、电子设备和存储介质
CN114091595A (zh) * 2021-11-15 2022-02-25 南京中兴新软件有限责任公司 样本处理方法、设备及计算机可读存储介质

Also Published As

Publication number Publication date
CN114091595A (zh) 2022-02-25

Similar Documents

Publication Publication Date Title
AU2019263758B2 (en) Systems and methods for generating a contextually and conversationally correct response to a query
WO2021093755A1 (zh) 问题的匹配方法及装置、问题的回复方法及装置
WO2022100045A1 (zh) 分类模型的训练方法、样本分类方法、装置和设备
CN110413780B (zh) 文本情感分析方法和电子设备
US20200035229A1 (en) Word clustering and categorization
CN109271640B (zh) 文本信息的地域属性识别方法及装置、电子设备
CN113591483A (zh) 一种基于序列标注的文档级事件论元抽取方法
CN110633365A (zh) 一种基于词向量的层次多标签文本分类方法及系统
CN114254653A (zh) 一种科技项目文本语义抽取与表示分析方法
CN111090771A (zh) 歌曲搜索方法、装置及计算机存储介质
CN115357719A (zh) 基于改进bert模型的电力审计文本分类方法及装置
CN114997288A (zh) 一种设计资源关联方法
WO2023083176A1 (zh) 样本处理方法、设备及计算机可读存储介质
CN109271624A (zh) 一种目标词确定方法、装置及存储介质
CN116151220A (zh) 分词模型训练方法、分词处理方法和装置
CN114492420A (zh) 文本分类方法、装置、设备及计算机可读存储介质
CN111930949B (zh) 搜索串处理方法、装置、计算机可读介质及电子设备
Sisodia et al. Performance evaluation of learners for analyzing the hotel customer sentiments based on text reviews
CN115510326A (zh) 基于文本特征和情感倾向的网络论坛用户兴趣推荐算法
CN112215006B (zh) 机构命名实体归一化方法和系统
CN112463960B (zh) 一种实体关系的确定方法、装置、计算设备及存储介质
CN114254622A (zh) 一种意图识别方法和装置
US20230142351A1 (en) Methods and systems for searching and retrieving information
AU2019290658B2 (en) Systems and methods for identifying and linking events in structured proceedings
CN113536772A (zh) 一种文本处理方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22891970

Country of ref document: EP

Kind code of ref document: A1