CN115293271A - Training method, device and equipment of prediction model and storage medium - Google Patents

Training method, device and equipment of prediction model and storage medium Download PDF

Info

Publication number
CN115293271A
CN115293271A CN202210950841.3A CN202210950841A CN115293271A CN 115293271 A CN115293271 A CN 115293271A CN 202210950841 A CN202210950841 A CN 202210950841A CN 115293271 A CN115293271 A CN 115293271A
Authority
CN
China
Prior art keywords
sample set
training
samples
target
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210950841.3A
Other languages
Chinese (zh)
Inventor
谢国平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingniuzhisheng Technology Co ltd
Original Assignee
Qingniuzhisheng Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingniuzhisheng Technology Co ltd filed Critical Qingniuzhisheng Technology Co ltd
Priority to CN202210950841.3A priority Critical patent/CN115293271A/en
Publication of CN115293271A publication Critical patent/CN115293271A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a training method of a prediction model, which comprises the following steps: acquiring a first training sample set and a first prediction model obtained based on training of the first training sample set; obtaining a sample to be predicted, predicting the sample to be predicted by adopting a first prediction model, and dividing the sample to be predicted into a first sample set and a second sample set according to the prediction probability of a prediction result; predicting the samples to be predicted based on a plurality of different second prediction models, and selecting a third sample set; combining the first sample set and the third sample set, and performing similarity calculation on the combined sample and the sample of the first training sample set to obtain a target sample set; labeling each sample of the target sample set to obtain a third training sample set; performing machine labeling processing on the second sample set to obtain a second training sample set; and combining the first training sample set, the second training sample set and the third training sample set, and training the first prediction model to obtain a target prediction model.

Description

Training method, device and equipment of prediction model and storage medium
Technical Field
The present invention relates to the field of active learning, and in particular, to a method, an apparatus, a device, and a storage medium for training a prediction model.
Background
Active learning is an effective means for constructing a sample commonly used in a sample labeling scene, and the method starts from the uncertainty angle of sample prediction, allows a person to label a sample which is inaccurate in model taking and considered to be relatively 'difficult', and can greatly improve the labeling efficiency of the sample and the cost of manual labeling. And the manually marked samples are added into the sample set for training, so that the accuracy of the model can be improved.
The prior art methods also make some optimization aiming at the uncertainty principle of active learning, such as screening 'difficult' samples under different standards through different strategies for subsequent manual labeling, but the methods do not utilize the difference of open source pre-training models, and are still limited to the improvement made by the uncertainty principle. If only uncertainty principle is considered, possibly screened samples can be concentrated in a certain class, and the problems of data imbalance and diversity exist.
Based on this, there is a need in the art for a new method, apparatus, device and storage medium for training a prediction model to solve the above existing technical problems.
Disclosure of Invention
The invention provides a method, a device, equipment and a storage medium for training a prediction model, which can ensure that samples screened out for subsequent manual labeling are more balanced.
In order to solve the technical problems, the invention adopts a technical scheme that: provided is a training method of a prediction model, comprising the following steps:
acquiring a first training sample set and a first prediction model obtained based on the training of the first training sample set;
obtaining a sample to be predicted, predicting the sample to be predicted by adopting the first prediction model to obtain a first prediction result, wherein the first prediction result comprises a first class label and a corresponding prediction probability, and dividing the sample to be predicted into a first sample set and a second sample set according to the prediction probability;
predicting the samples to be predicted based on a plurality of different second prediction models to obtain second prediction results, and selecting a third sample set from the samples to be predicted according to the second prediction results;
merging the first sample set and the third sample set to obtain a merged sample set, performing similarity calculation on the merged sample set and the samples of the first training sample set, and obtaining a target sample set according to a similarity calculation result;
labeling each sample of the target sample set to obtain a third training sample set;
performing machine labeling processing on the second sample set according to the first class label of the second sample set to obtain a second training sample set;
merging the first training sample set, the second training sample set and the third training sample set to obtain a target training sample set;
and training the first prediction model by adopting the target training sample set to obtain a target prediction model.
Preferably, the merging the first sample set and the third sample set to obtain a merged sample set, performing similarity calculation on the merged sample set and the samples of the first training sample set, and obtaining a target sample set according to a similarity calculation result includes:
merging the first sample set and the third sample set to obtain a merged sample set;
characterizing each sample of the first training sample set as a first semantic vector, and characterizing each sample of the merged sample set as a second semantic vector;
normalizing the first semantic vector and the second semantic vector;
calculating the similarity between the normalized second semantic vector and the normalized first semantic vector;
and selecting a sample with the similarity lower than a preset threshold from the merged sample set according to the similarity calculation result to obtain the target sample set.
Preferably, the merging the first sample set and the third sample set to obtain a merged sample set, performing similarity calculation on the merged sample set and the samples of the first training sample set, and obtaining a target sample set according to a similarity calculation result includes:
merging the first sample set and the third sample set to obtain a merged sample set;
representing each sample of the first training sample set into a first semantic vector by using a SimBERT model;
establishing an index library based on the first semantic vector by using a Faiss library;
characterizing each sample of the merged sample set into a second semantic vector by using a SimBERT model;
inputting the second semantic vector into the index library to obtain the corresponding first semantic vector with the highest similarity, and calculating to obtain the similarity;
and selecting a sample with the similarity lower than a preset threshold from the merged sample set according to the similarity calculation result to obtain the target sample set.
Preferably, the merging the first training sample set, the second training sample set, and the third training sample set to obtain a target training sample set includes:
adjusting the number of samples of the second training sample set according to the total number of samples of the third training sample set;
and merging the first training sample set, the third training sample set and the adjusted second training sample set to obtain a target training sample set.
Preferably, the adjusting the number of samples of the second training sample set according to the total number of samples of the third training sample set comprises:
calculating the number of target samples of the second training sample set according to the total number of samples of the third training sample set and a preset proportion, wherein the preset proportion is the proportion of the total number of samples of the third training sample set and the number of target samples of the second training sample set;
and adjusting the number of samples in the second training sample set according to the target sample number calculation result.
Preferably, the adjusting the number of samples in the second training sample set according to the target sample number calculation result includes:
counting the number of samples of each class label of the third training sample set;
and adjusting the sample number of each class label of the second training sample set according to the statistical result and the target sample number calculation result.
Preferably, after the training of the first prediction model by using the target training sample set is performed to obtain a target prediction model, the method further includes:
predicting the samples of the target sample set by adopting the target prediction model to obtain a third prediction result;
comparing the third prediction result of the target sample set with the corresponding labeling processing result to obtain a comparison result;
and obtaining the sample labeled with the error in the third training sample set according to the comparison result, correcting the sample, and updating the third training sample set.
In order to solve the technical problem, the invention adopts another technical scheme that: there is provided a training apparatus of a predictive model, including:
the first training module is used for acquiring a first training sample set and a first prediction model obtained by training based on the first training sample set;
the first prediction module is used for acquiring a sample to be predicted, predicting the sample to be predicted by adopting the first prediction model to obtain a first prediction result, wherein the first prediction result comprises a first class label and a corresponding prediction probability, and dividing the sample to be predicted into a first sample set and a second sample set according to the prediction probability;
the second prediction module is used for predicting the samples to be predicted based on a plurality of different second prediction models to obtain second prediction results, and selecting a third sample set from the samples to be predicted according to the second prediction results;
a target sample set obtaining module, configured to combine the first sample set and the third sample set to obtain a combined sample set, perform similarity calculation on the combined sample set and the samples of the first training sample set, and obtain a target sample set according to a similarity calculation result;
the first labeling module is used for labeling each sample of the target sample set to obtain a third training sample set;
the second labeling module is used for performing machine labeling processing on the second sample set according to the first class label of the second sample set to obtain a second training sample set;
a merging module, configured to merge the first training sample set, the second training sample set, and the third training sample set to obtain a target training sample set;
and the second training module is used for training the first prediction model by adopting the target training sample set to obtain a target prediction model.
In order to solve the technical problem, the invention adopts another technical scheme that: there is provided a computer apparatus comprising a processor and a memory coupled to the processor, the memory storing program instructions for implementing a training method for a predictive model as described above.
In order to solve the technical problem, the invention adopts another technical scheme that: there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, implements a method of training a predictive model as described above.
The invention has the beneficial effects that: according to the method, the samples which are difficult to predict of the first prediction model can be screened out from the samples to be predicted through the steps, and the samples with high similarity are removed through similarity calculation to obtain the target sample set.
Drawings
FIG. 1 is a flow chart illustrating a method for training a predictive model according to a first embodiment of the present invention;
FIG. 2 is a schematic flow chart diagram illustrating one embodiment of step S104 of FIG. 1;
FIG. 3 is a schematic flow chart diagram illustrating another embodiment of step S104 in FIG. 1;
FIG. 4 is a flowchart illustrating a method for training a predictive model according to a second embodiment of the invention;
FIG. 5 is a flowchart illustrating a method for training a prediction model according to a third embodiment of the present invention;
FIG. 6 is a flowchart illustrating a method for training a prediction model according to a fourth embodiment of the present invention;
FIG. 7 is a flowchart illustrating a method for training a prediction model according to a fifth embodiment of the present invention;
FIG. 8 is a schematic structural diagram of a training apparatus for a prediction model according to an embodiment of the present invention;
FIG. 9 is a schematic structural diagram of a computer device according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of a computer storage medium according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
The terms "first", "second" and "third" in the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first," "second," or "third" may explicitly or implicitly include at least one of the feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise. All directional indicators (such as upper, lower, left, right, front, rear, 8230; etc.) in the embodiments of the present invention are only used to explain the relative positional relationship between the components at a certain posture (as shown in the drawing), the motion, etc., and if the certain posture is changed, the directional indicator is correspondingly changed. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein may be combined with other embodiments.
Fig. 1 is a flowchart illustrating a method for training a prediction model according to a first embodiment of the present invention. It should be noted that the method of the present invention is not limited to the flow sequence shown in fig. 1 if the results are substantially the same. As shown in fig. 1, the method comprises the steps of:
step S101: obtaining a first training sample set and a first prediction model obtained based on the training of the first training sample set.
In step S101, a large amount of data needs to be acquired from the history data as a training sample library D. And randomly selecting a quantitative sample D0 from the sample library D, and manually labeling the selected sample to obtain a first training sample set D0'. And training the pre-training model by adopting the first training sample set D0' to obtain a first prediction model. The pre-training model of the embodiment may be selected according to actual situations, such as Bert, ernie, roberta, and the like.
Step S102: obtaining a sample to be predicted, predicting the sample to be predicted by adopting the first prediction model to obtain a first prediction result, wherein the first prediction result comprises a first class label and a corresponding prediction probability, and dividing the sample to be predicted into a first sample set and a second sample set according to the prediction probability.
In step S102, a new batch of samples is selected from the sample library D as the sample D1 to be predicted. And then, predicting the sample D1 to be predicted by adopting a first prediction model to obtain a first prediction result. The first prediction result of each sample of the sample D1 to be predicted comprises a category label and a corresponding prediction probability. In this embodiment, a first sample set E1 and a second sample set E2 are obtained by determining the measurement probability, and a sample with a prediction probability within a preset interval is regarded as a sample that is difficult to predict by a first prediction model, and is used as the first sample set E1; and regarding other samples as samples which are easier to predict by the first prediction model, wherein the accuracy of prediction of the samples is high, and the samples are used as a second sample set E2. For example, samples with prediction probabilities in the interval 0.3-0.7 are taken as the first sample set E1, and other samples are taken as the second sample set E2. In summary, step S102 is to divide the samples in the sample D1 to be predicted into the first sample set E1 which is difficult to predict and the second sample set E2 which is easy to predict, so as to be processed in the following steps.
Step S103: and predicting the samples to be predicted based on a plurality of different second prediction models to obtain second prediction results, and selecting a third sample set from the samples to be predicted according to the second prediction results.
In step S103, a plurality of second prediction models are obtained by training in the first training sample set D0', and the pre-training model may be selected according to actual situations, such as Bert, ernie, roberta, and so on. And then, predicting the sample D1 to be predicted based on the obtained multiple different pre-training models to obtain the prediction results of the multiple pre-training models. And comparing the prediction results, and selecting samples with different prediction results. In this embodiment, when the class labels predicted by the plurality of pre-training models for the prediction result of a certain sample are not completely consistent, the samples are determined as samples with different prediction results. These samples are taken as a third sample set E3.
Step S104: and merging the first sample set and the third sample set to obtain a merged sample set, performing similarity calculation on the merged sample set and the samples of the first training sample set, and obtaining a target sample set according to a similarity calculation result.
In step S104, the first sample set E1 and the third sample set E3 are merged to obtain a merged sample set P1. Of course, in this step, the samples in the combined sample set P1 need to be deduplicated to prevent duplicate samples from occurring in one set.
Then, the samples in the merged sample set P1 are screened, and in this embodiment, the samples in the merged sample set P1 with lower similarity to the first training sample set D0 'are screened by calculating the similarity between the samples in the merged sample set P1 and the samples D0 in the first training sample set D0'. By the step, samples different from the original training samples can be screened out, samples with high similarity to the original training samples are removed, and a high-information target sample set P2 is obtained. The method further improves the high richness of the screened samples, the difference between the samples is large, fewer samples need to be obtained, and the workload of manual labeling is reduced. By richness is meant the diversity or variability of the sample.
Specifically, in one embodiment, step S104 includes:
step S201: and combining the first sample set and the third sample set to obtain a combined sample set.
Step S202: and characterizing each sample of the first training sample set as a first semantic vector, and characterizing each sample of the merged sample set as a second semantic vector.
Step S203: and normalizing the first semantic vector and the second semantic vector.
Step S204: and calculating the similarity between the normalized second semantic vector and the normalized first semantic vector.
Step S205: and selecting a sample with the similarity lower than a preset threshold value from the merged sample set according to the similarity calculation result to obtain the target sample set.
Further, through the above steps, the first sample set E1 and the third sample set E3 are merged to obtain a merged sample set P1, and the samples in the merged sample set P1 are deduplicated to prevent duplicate samples from occurring in one set. And then, each sample of the first training sample set D0' is characterized as a first semantic vector, and each sample of the merged sample set P1 is characterized as a second semantic vector, namely, each sample of the two sets is characterized as a corresponding semantic vector. In particular, samples are characterized into corresponding semantic vectors using the SimBERT model. And normalizing the first semantic vector and the second semantic vector. And calculating the similarity between the second semantic vector of the sample of the combined sample set P1 and the first semantic vector of each sample of the first training sample set D0' one by one.
Specifically, the similarity calculation method is as follows: and calculating the dot product between the second semantic vector of a certain sample of the merged sample set P1 and all the first semantic vectors to obtain a plurality of dot product values, wherein the larger the dot product value is, the higher the similarity between the two samples is. And selecting the sample with the maximum dot product value as the most similar vector, namely, taking the sample with the maximum dot product value as the most similar sample, and taking the maximum dot product value as a judgment reference. If the maximum dot product value of a certain sample of the merged sample set P1 and the first semantic vector is smaller than a preset threshold, it is proved that the similarity between the sample and the sample of the first training sample set D0' is lower than the preset threshold, the sample is a sample with low similarity, and the sample can be retained; if the maximum dot product value of a certain sample in the merged sample set P1 and the first semantic vector is greater than or equal to the preset threshold, it is proved that the similarity between the sample and the sample in the first training sample set D0' is higher than the preset threshold, and the sample is a sample with high similarity, and the sample is discarded. And selecting a sample with the similarity lower than a preset threshold value from the merged sample set P1 through the steps to obtain a target sample set P2.
In another embodiment, step S104 includes:
step S301: and combining the first sample set and the third sample set to obtain a combined sample set.
Step S302: each sample of the first set of training samples is characterized as a first semantic vector using a SimBERT model.
Step S303: and establishing an index library based on the first semantic vector by utilizing a Faiss library.
Step S304: and characterizing each sample of the merged sample set as a second semantic vector by using a SimBERT model.
Step S305: and inputting the second semantic vector into the index library to obtain the corresponding first semantic vector with the highest similarity, and calculating to obtain the similarity.
Step S306: and selecting a sample with the similarity lower than a preset threshold value from the merged sample set according to the similarity calculation result to obtain the target sample set.
Through the steps, an index base is established by using a Faiss base based on a first semantic vector, namely, an index base is established by using the Faiss base on the existing training samples, in the subsequent similarity calculation process, a second semantic vector is directly input into the index base, so that the most similar sample can be directly obtained, then the similarity of the most similar sample is directly calculated, and the dot product between the second semantic vector and the most similar first semantic vector of a certain sample in a merged sample set P1 is calculated. If the dot product value is smaller than the preset threshold, it is proved that the similarity between the sample and the sample of the first training sample set D0' is lower than the preset threshold, the sample is a sample with low similarity, and the sample can be retained; if the dot product value is greater than or equal to the preset threshold, the similarity between the sample and the sample of the first training sample set D0' is higher than the preset threshold, the sample is a sample with high similarity, and the sample is discarded. And selecting a sample with the similarity lower than a preset threshold value from the merged sample set P1 through the steps to obtain a target sample set P2. In this embodiment, the method of creating the index library can further accelerate the screening of the target sample set P2 with low similarity from the merged sample set P1.
Through steps S101-S104, a target sample set P2 is screened out from the sample D1 to be predicted. The target sample set P2 is a sample set which is difficult to predict by the first prediction model and has a large difference with samples of the existing training set, these samples are so-called "difficult" samples of the first prediction model, and these "difficult" samples have a large information richness.
Step S105: and labeling each sample of the target sample set to obtain a third training sample set.
In step S105, the target sample set P2 is manually labeled to obtain a third training sample set P2'.
Step S106: and performing machine labeling processing on the second sample set according to the first class label of the second sample set to obtain a second training sample set.
In step S106, since the second sample set E2 is a sample that is easier to predict and the prediction result of the sample is more accurate, the prediction result in step S102 can be directly used as the class label of the sample of the second sample set E2. Since the prediction result in step S102 is predicted by using the first prediction model, that is, the second sample set E2 is machine-labeled by directly using the first prediction model as the class label of the samples of the second sample set E2. More specifically, the first prediction model predicts the second sample set E2 to obtain a prediction probability of a class label corresponding to a sample of the second sample set E2, and uses the class label with the prediction probability larger than a preset threshold as a final class label of the second sample set E2 to obtain a second training sample set E2'. The second training sample set E2' is obtained through the above steps.
Step S107: and merging the first training sample set, the second training sample set and the third training sample set to obtain a target training sample set.
In step S107, the first training sample set D0', the second training sample set E2', and the third training sample set P2 'are combined to obtain a target training sample set P3'. The reason why the second training sample set E2' is added is that only training samples (the third training sample set P2 ') whose model prediction is uncertain cannot be added to improve the model effect, and training samples (the second training sample set E2 ') whose model prediction is easy to determine need to be added, so that the training samples are relatively balanced.
Step S108: and training the first prediction model by adopting the target training sample set to obtain a target prediction model.
In step S108, the first prediction model is trained by using the target training sample set P3', so as to obtain a target prediction model.
It should be noted that, in the training process of the prediction model, multiple iterative training is usually required to make the model have a good prediction effect, so that the steps of steps S102 to S108 need to be repeated, and a new sample is added in each iterative process for prediction and training. And after iteration is carried out for a certain number of times, evaluating the target prediction model to see whether the accuracy of the prediction result reaches the expectation or not, if not, continuing repeating the steps S102-S108 until the accuracy of the prediction result reaches the expectation, and stopping iteration.
The training method of the prediction model of the first embodiment of the invention adopts the first prediction model to predict by using the samples to be predicted of a new batch, and divides the samples into the second sample set which is easy to determine by the model and the first sample set which is difficult to determine by the model according to the prediction probability; predicting the sample to be predicted based on a plurality of different pre-training models to obtain a third sample set with inconsistent prediction results; merging the first sample set and the third sample set to obtain a merged sample set, and screening out samples with low similarity to the first training sample set from the merged sample set as a target sample set; manually marking the target sample set, and performing machine marking on a prediction result of the second sample set based on the first prediction model; and finally, combining the first training sample set, the second training sample set and the third training sample set to obtain a target training sample set, and training the first prediction model by using the target training sample set to obtain a target prediction model. According to the method, the samples which are difficult to predict of the first prediction model can be screened out from the samples to be predicted through the steps, the samples with high similarity are eliminated through similarity calculation, and the target sample set is obtained.
Further, referring to fig. 4, on the basis of the above embodiment, step S107 in one embodiment includes the following steps:
step S401: and adjusting the number of samples of the second training sample set according to the total number of samples of the third training sample set.
In step S401, in the actual training process, in order to ensure that the newly added training samples are more balanced, that is, it is required to ensure that the more predictable samples and the less predictable samples of the first prediction model are added simultaneously, the ratio of the number of samples in the third training sample set P2 'and the second training sample set E2' needs to be controlled. Therefore, referring to fig. 5, step S401 specifically includes the following steps:
step S501: and calculating the number of target samples of the second training sample set according to the total number of samples of the third training sample set and a preset proportion, wherein the preset proportion is the proportion of the total number of samples of the third training sample set and the number of target samples of the second training sample set.
Step S502: and adjusting the number of samples in the second training sample set according to the target sample number calculation result.
In steps S501 to S502, the preset ratio may be set according to the actual situation. If the total number of samples in the third training sample set P2' is slightly greater in the initial training period, i.e., when the number of iterations is smaller, the ratio of the total number of samples in the third training sample set P2' to the number of target samples in the second training sample set E2' is 1:7. in the later training stage, that is, after the iterative algebra reaches a certain algebra, the proportion of the total number of samples in the third training sample set P2' is gradually reduced, for example, the ratio of the total number of samples in the third training sample set P2' to the number of target samples in the second training sample set E2' is 1:9 or 1. And calculating the number of the target samples according to the total number of the samples in the third training sample set P2' and a preset proportion. For example, the total number of samples in the third training sample set P2' is 100, and the preset ratio is 1:7, the number of target samples is 700. If the total number of samples in the second training sample set E2' is 1000, then 300 samples need to be eliminated.
Step S402: and merging the first training sample set, the third training sample set and the adjusted second training sample set to obtain a target training sample set.
Through steps S501 to S502, the balance of the newly added training samples is further controlled, so that the proportion of the samples which are difficult to predict and the samples which are easy to predict in the newly added samples is more balanced.
Further, referring to fig. 6, on the basis of the above embodiment, step S502 in one embodiment includes the following steps:
step S601: and counting the number of samples of each class label of the third training sample set.
Step S602: and adjusting the sample number of each class label of the second training sample set according to the statistical result and the target sample number calculation result.
In steps S601-S602, the class labels of the samples of the third training sample set P2' are counted. For example, for the task of two classes, the class labels of the samples of the third training sample set P2', i.e., one common class label is both positive and negative. For example, in the third training sample set P2' with a total number of samples of 100, there are 40 positive samples and 60 negative samples (generally, there are significantly fewer positive samples than negative samples). At this time, if the number of positive samples is greater, the number of negative samples should be greater in the samples retained in the second training sample set E2'; of the rejected samples, the positive samples should be more. I.e., the second training sample set E2' selects the remaining number of negative samples to be greater among the remaining 700. In this implementation, it is expected that the ratio of the positive and negative samples of the newly added training samples (the third training sample set P2 'and the second training sample set E2') is within 1. And the ratio of the positive samples to the negative samples of the third training sample set P2 'is 2, then the newly added training samples can be guaranteed to meet the expectation by adjusting the number of samples of each class label of the second training sample set E2'. For example, it is necessary to adjust the ratio of the positive samples to the negative samples to be 1. If the third training sample set P2' already has 40 positive samples, then the ratio of the positive samples to the negative samples of the newly added final training samples (the third training sample set P2' and the second training sample set E2 ') is 1. Of course, this is only data for convenience of illustration, and does not require that 1.
And calculating the specific number of the positive samples and the negative samples according to the number of the target samples, and finally reserving and removing the samples of the second training sample set E2 'to finish the adjustment of the samples of the second training sample set E2'.
In the multi-classification task, it is expected that the ratio of the sample number of the class label with the highest sample number to the sample number of the class label with the lowest sample number in the newly added training samples (the third training sample set P2 'and the second training sample set E2') is within 1. Of course, the ratio can be modified according to actual conditions.
Through the setting of the steps S601-S602, the balance of the newly added training samples is further increased, so that the samples are more representative, and the model learning efficiency is improved.
In an alternative embodiment, referring to fig. 7 on the basis of the above embodiment, after step S108, the following steps are further included:
step S701: and predicting the samples of the target sample set by adopting the target prediction model to obtain a third prediction result.
Step S702: and comparing the third prediction result of the target sample set with the corresponding labeling processing result to obtain a comparison result.
Step S703: and obtaining the sample labeled with the error in the third training sample set according to the comparison result, correcting the sample, and updating the target training sample set.
And predicting the target sample set P2 by adopting a target prediction model through the settings of the steps S701-S703 to obtain a third prediction result. And comparing the class label obtained by prediction of the sample of the target sample set P2 with the manually marked class label, and judging the sample with the larger difference between the obtained class label and the manually marked class label as the sample with wrong labeling. And manually reviewing and correcting the sample labeled with the error, and updating a third training sample set P2'. By the step, the manually marked class labels can be rechecked, and influence on model learning caused by manual marking errors is prevented.
Fig. 8 is a schematic structural diagram of a training apparatus 80 for testing a model according to an embodiment of the present invention. As shown in fig. 8, the apparatus 80 includes a first training module 81, a first prediction module 82, a second prediction module 83, a target sample set obtaining module 84, a first labeling module 85, a second labeling module 86, a merging module 87, and a second training module 88.
The first training module 81 is configured to obtain a first training sample set and a first prediction model trained based on the first training sample set;
the first prediction module 82 is configured to obtain a sample to be predicted, predict the sample to be predicted by using the first prediction model to obtain a first prediction result, where the first prediction result includes a first class label and a corresponding prediction probability, and divide the sample to be predicted into a first sample set and a second sample set according to the prediction probability;
the second prediction module 83 is configured to predict the sample to be predicted based on a plurality of different second prediction models to obtain a second prediction result, and select a third sample set from the sample to be predicted according to the second prediction result;
the target sample set obtaining module 84 is configured to combine the first sample set and the third sample set to obtain a combined sample set, perform similarity calculation on the combined sample set and the samples of the first training sample set, and obtain a target sample set according to a similarity calculation result;
the first labeling module 85 is configured to label each sample of the target sample set to obtain a third training sample set;
the second labeling module 86 is configured to perform machine labeling processing on the second sample set according to the first class label of the second sample set, so as to obtain a second training sample set;
the merging module 87 is configured to merge the first training sample set, the second training sample set, and the third training sample set to obtain a target training sample set;
the second training module 88 is configured to train the first prediction model by using the target training sample set to obtain a target prediction model.
Referring to fig. 9, fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present invention. As shown in fig. 9, the computer device 90 includes a processor 91 and a memory 92 coupled to the processor 91. The memory 92 stores program instructions for implementing the training method of the predictive model according to any of the embodiments described above. Processor 91 is operative to execute program instructions stored in memory 92.
The processor 91 may also be referred to as a CPU (Central Processing Unit). The processor 91 may be an integrated circuit chip having signal processing capabilities. The processor 91 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Referring to fig. 10, fig. 10 is a schematic structural diagram of a computer storage medium according to an embodiment of the present invention. The computer storage medium of the embodiment of the present invention stores a computer program 101 capable of implementing all the methods described above, where the computer program 101 may be stored in the computer storage medium in the form of a software product, and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the methods described in the embodiments of the present invention. And the aforementioned computer storage media include: various media capable of storing program codes, such as a usb disk, a mobile hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or terminal devices, such as a computer, a server, a mobile phone, and a tablet.
In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is only a logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A method for training a predictive model, comprising:
acquiring a first training sample set and a first prediction model obtained based on the training of the first training sample set;
obtaining a sample to be predicted, predicting the sample to be predicted by adopting the first prediction model to obtain a first prediction result, wherein the first prediction result comprises a first class label and a corresponding prediction probability, and dividing the sample to be predicted into a first sample set and a second sample set according to the prediction probability;
predicting the samples to be predicted based on a plurality of different second prediction models to obtain second prediction results, and selecting a third sample set from the samples to be predicted according to the second prediction results;
merging the first sample set and the third sample set to obtain a merged sample set, performing similarity calculation on the merged sample set and the samples of the first training sample set, and obtaining a target sample set according to a similarity calculation result;
labeling each sample of the target sample set to obtain a third training sample set;
performing machine labeling processing on the second sample set according to the first class label of the second sample set to obtain a second training sample set;
merging the first training sample set, the second training sample set and the third training sample set to obtain a target training sample set;
and training the first prediction model by adopting the target training sample set to obtain a target prediction model.
2. The method of claim 1, wherein the merging the first sample set and the third sample set to obtain a merged sample set, performing similarity calculation between the merged sample set and the samples of the first training sample set, and obtaining a target sample set according to a similarity calculation result comprises:
merging the first sample set and the third sample set to obtain a merged sample set;
characterizing each sample of the first training sample set as a first semantic vector, and characterizing each sample of the merged sample set as a second semantic vector;
normalizing the first semantic vector and the second semantic vector;
calculating the similarity between the normalized second semantic vector and the normalized first semantic vector;
and selecting a sample with the similarity lower than a preset threshold from the merged sample set according to the similarity calculation result to obtain the target sample set.
3. The method of claim 1, wherein the merging the first sample set and the third sample set to obtain a merged sample set, performing similarity calculation between the merged sample set and the samples of the first training sample set, and obtaining a target sample set according to a similarity calculation result comprises:
merging the first sample set and the third sample set to obtain a merged sample set;
representing each sample of the first training sample set into a first semantic vector by using a SimBERT model;
establishing an index base based on the first semantic vector by using a Faiss base;
characterizing each sample of the merged sample set into a second semantic vector by using a SimBERT model;
inputting the second semantic vector into the index library to obtain the corresponding first semantic vector with the highest similarity, and calculating to obtain the similarity;
and selecting a sample with the similarity lower than a preset threshold value from the merged sample set according to the similarity calculation result to obtain the target sample set.
4. The method for training the predictive model according to claim 1, wherein the combining the first training sample set, the second training sample set, and the third training sample set to obtain a target training sample set comprises:
adjusting the number of samples of the second training sample set according to the total number of samples of the third training sample set;
and merging the first training sample set, the third training sample set and the adjusted second training sample set to obtain a target training sample set.
5. The method for training the predictive model according to claim 4, wherein the adjusting the number of samples of the second training sample set according to the total number of samples of the third training sample set comprises:
calculating the number of target samples of the second training sample set according to the total number of samples of the third training sample set and a preset proportion, wherein the preset proportion is the proportion of the total number of samples of the third training sample set to the number of target samples of the second training sample set;
and adjusting the number of samples in the second training sample set according to the target sample number calculation result.
6. The training method of the prediction model according to claim 5, wherein the adjusting the number of samples in the second training sample set according to the target number of samples calculation result comprises:
counting the number of samples of each class label of the third training sample set;
and adjusting the sample number of each class label of the second training sample set according to the statistical result and the target sample number calculation result.
7. The method for training a prediction model according to claim 1, wherein after the training of the first prediction model with the target training sample set to obtain a target prediction model, the method further comprises:
predicting the samples of the target sample set by adopting the target prediction model to obtain a third prediction result;
comparing the third prediction result of the target sample set with the corresponding labeling processing result to obtain a comparison result;
and obtaining the sample labeled with the error in the third training sample set according to the comparison result, correcting the sample, and updating the third training sample set.
8. An apparatus for training a predictive model, comprising:
the system comprises a first training module, a second training module and a third prediction module, wherein the first training module is used for acquiring a first training sample set and a first prediction model obtained based on the training of the first training sample set;
the first prediction module is used for obtaining a sample to be predicted, predicting the sample to be predicted by adopting the first prediction model to obtain a first prediction result, wherein the first prediction result comprises a first class label and a corresponding prediction probability, and dividing the sample to be predicted into a first sample set and a second sample set according to the prediction probability;
the second prediction module is used for predicting the samples to be predicted based on a plurality of different second prediction models to obtain second prediction results, and selecting a third sample set from the samples to be predicted according to the second prediction results;
a target sample set obtaining module, configured to combine the first sample set and the third sample set to obtain a combined sample set, perform similarity calculation on the combined sample set and the samples of the first training sample set, and obtain a target sample set according to a similarity calculation result;
the first labeling module is used for labeling each sample of the target sample set to obtain a third training sample set;
the second labeling module is used for performing machine labeling processing on the second sample set according to the first class label of the second sample set to obtain a second training sample set;
a merging module, configured to merge the first training sample set, the second training sample set, and the third training sample set to obtain a target training sample set;
and the second training module is used for training the first prediction model by adopting the target training sample set to obtain a target prediction model.
9. A computer device comprising a processor and a memory coupled to the processor, wherein the memory stores program instructions for implementing a training method for a predictive model as claimed in any one of claims 1 to 7.
10. A computer storage medium on which a computer program is stored, which computer program, when being executed by a processor, carries out a method of training a prediction model according to any one of claims 1 to 7.
CN202210950841.3A 2022-08-09 2022-08-09 Training method, device and equipment of prediction model and storage medium Pending CN115293271A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210950841.3A CN115293271A (en) 2022-08-09 2022-08-09 Training method, device and equipment of prediction model and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210950841.3A CN115293271A (en) 2022-08-09 2022-08-09 Training method, device and equipment of prediction model and storage medium

Publications (1)

Publication Number Publication Date
CN115293271A true CN115293271A (en) 2022-11-04

Family

ID=83828814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210950841.3A Pending CN115293271A (en) 2022-08-09 2022-08-09 Training method, device and equipment of prediction model and storage medium

Country Status (1)

Country Link
CN (1) CN115293271A (en)

Similar Documents

Publication Publication Date Title
CN109949290B (en) Pavement crack detection method, device, equipment and storage medium
CN106919957B (en) Method and device for processing data
JP2022512065A (en) Image classification model training method, image processing method and equipment
CN110263326B (en) User behavior prediction method, prediction device, storage medium and terminal equipment
CN110909868A (en) Node representation method and device based on graph neural network model
CN115391561A (en) Method and device for processing graph network data set, electronic equipment, program and medium
CN115081613A (en) Method and device for generating deep learning model, electronic equipment and storage medium
CN110991538A (en) Sample classification method and device, storage medium and computer equipment
CN114443483A (en) Test method and device of artificial intelligence system, electronic equipment and medium
CN114169460A (en) Sample screening method, sample screening device, computer equipment and storage medium
CN114692889A (en) Meta-feature training model for machine learning algorithm
US11036980B2 (en) Information processing method and information processing system
CN111783883A (en) Abnormal data detection method and device
CN116137061A (en) Training method and device for quantity statistical model, electronic equipment and storage medium
CN115293271A (en) Training method, device and equipment of prediction model and storage medium
CN114116688B (en) Data processing and quality inspection method and device and readable storage medium
CN114519520A (en) Model evaluation method, model evaluation device and storage medium
CN112328951B (en) Processing method of experimental data of analysis sample
CN114416462A (en) Machine behavior identification method and device, electronic equipment and storage medium
CN111612023A (en) Classification model construction method and device
CN117390292B (en) Application program information recommendation method, system and equipment based on machine learning
CN113554126B (en) Sample evaluation method, device, equipment and computer readable storage medium
CN113535805B (en) Data mining method, related device, electronic equipment and storage medium
JP7238907B2 (en) Machine learning device, method and program
CN115828911A (en) Test question knowledge point identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination