CN114169460A

CN114169460A - Sample screening method, sample screening device, computer equipment and storage medium

Info

Publication number: CN114169460A
Application number: CN202111524368.4A
Authority: CN
Inventors: 李策; 郝芳; 李熠; 黄寅
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2021-12-14
Filing date: 2021-12-14
Publication date: 2022-03-11

Abstract

The present application relates to the field of artificial intelligence, and in particular, to a method, an apparatus, a computer device, a storage medium, and a computer program product for screening a sample. The method comprises the following steps: obtaining a model algorithm of the trained first model; determining a plurality of initialized sub-models according to a model algorithm of the first model; the model hyper-parameters of each sub-model in the plurality of sub-models are kept consistent; respectively training each sub-model through a full amount of sample sets to be screened to obtain a plurality of test models; predicting each sample in the full sample set to be screened through a plurality of test models to obtain a plurality of prediction results respectively corresponding to each sample; screening out candidate abnormal samples from the whole to-be-screened sample set based on the prediction results corresponding to each training sample; and screening the target abnormal samples from the candidate abnormal samples, and removing the target abnormal samples from the total sample set to be screened to obtain a normal sample set. By adopting the method, the efficiency of sample screening can be improved.

Description

Sample screening method, sample screening device, computer equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a computer device, a storage medium, and a computer program product for screening a sample.

Background

With the development of machine learning technology, the selection of learning samples becomes more and more important, which is the basis for whether the learning model can be successfully constructed or not and also determines the online of model effect. For mass data in the financial industry, the data is limited by various reasons such as human and system, so that some noise data also exists in the data in the financial industry, and the noise can interfere the quality of a sample used by a machine learning model, thereby influencing the training effect of the model.

It is clear that high quality learning samples contain fewer noise samples, from which the model is easier to learn valuable information. However, high-quality learning samples are very expensive, and a large amount of manpower and material resources are generally needed to remove noise from massive sample information to complete the screening of the samples, so that the screening mode has the problem of low sample screening efficiency.

Disclosure of Invention

In view of the above, it is necessary to provide a sample screening method, an apparatus, a computer device, a computer readable storage medium, and a computer program product, which can improve the efficiency of sample screening, in view of the above technical problems.

In a first aspect, the present application provides a method of screening a sample. The method comprises the following steps:

obtaining a model algorithm of a trained first model, wherein the first model is obtained by training a full sample set to be screened;

determining a plurality of initialized sub-models according to a model algorithm of the first model; the model hyper-parameters of each sub-model in the plurality of sub-models are kept consistent;

respectively training each sub-model through the full sample set to be screened to obtain a plurality of test models, wherein each sub-model adopts different random number seeds during training;

predicting each sample in the full sample set to be screened through the plurality of test models to obtain a plurality of prediction results respectively corresponding to each sample;

screening out candidate abnormal samples from the total amount of samples to be screened based on the prediction results corresponding to each training sample;

and screening a target abnormal sample from the candidate abnormal samples, and removing the target abnormal sample from the total sample set to be screened to obtain a normal sample set.

In one embodiment, the training step of the first model includes:

constructing a first model to be trained based on a target machine learning algorithm;

acquiring a total sample set to be screened, and dividing the total sample set to be screened into a training set and a testing set;

and training the first model to be trained on the training set, testing on the test set until a training stopping condition is reached, and obtaining the trained first model.

In one embodiment, the training each sub-model through the full sample set to be screened to obtain a plurality of test models includes:

forming an initial test model sequence by the plurality of sub models;

training each sub-model in the initial test model sequence on the training set by respectively adopting different random number seeds, and testing on the testing set to obtain a plurality of trained test models; and the trained test models corresponding to the plurality of sub-models jointly form a trained test model sequence.

In one embodiment, the screening out candidate abnormal samples from the full-scale sample set to be screened based on the prediction result corresponding to each training sample respectively includes:

for each training sample, counting the number of prediction results representing insufficient prediction confidence in a plurality of prediction results corresponding to the corresponding training sample;

determining a judgment coefficient corresponding to a corresponding training sample based on the number of prediction results representing insufficient prediction confidence and the total number of the test models;

and when the judgment coefficient is larger than a preset threshold value, determining the corresponding training sample as a candidate abnormal sample.

In one embodiment, the screening out the target abnormal sample from the candidate abnormal samples includes:

and checking the candidate abnormal samples one by one based on a preset checking rule, judging whether each candidate abnormal sample is a real abnormal sample, and if so, taking the real abnormal sample as a target abnormal sample.

In one embodiment, the screening a target abnormal sample from the candidate abnormal samples, and removing the target abnormal sample from the total to-be-screened sample set to obtain a normal sample set includes:

acquiring a second model obtained by training a full sample set to be screened;

candidate abnormal samples are removed from the full sample set to be screened one by one, and each time the candidate abnormal samples are removed, the second model is retrained in the rest samples in the training set of the sample set to be screened, and is tested in the testing set of the sample set to be screened, so that whether the training effect is improved or not is checked;

if the training effect is improved, determining the candidate abnormal sample eliminated at the current time as a target abnormal sample, and if the training effect is reduced, determining the candidate abnormal sample eliminated at the current time as a normal sample;

after all the candidate abnormal samples in the training set of the sample set to be screened are removed, the training set and the test set of the sample set to be screened are divided again, and all the samples in the last test set are ensured to be in the training set at this time;

returning to the step of eliminating the candidate abnormal samples one by one from the full sample set to be screened and continuing to execute until the sample inspection of all the candidate abnormal samples is completed;

and (4) all the residual samples after the target abnormal samples in the total sample set to be screened are removed to form a normal sample set.

In a second aspect, the present application also provides a sample screening device. The device comprises:

the acquisition module is used for acquiring a model algorithm of a trained first model, wherein the first model is obtained by training a full sample set to be screened;

the initialization module is used for determining a plurality of initialized sub-models according to the model algorithm of the first model; the model hyper-parameters of each sub-model in the plurality of sub-models are kept consistent;

the training module is used for respectively training each sub-model through the full sample set to be screened to obtain a plurality of test models, wherein each sub-model adopts different random number seeds during training;

the prediction module is used for predicting each sample in the full sample set to be screened through the plurality of test models to obtain a plurality of prediction results respectively corresponding to each sample;

the screening module is used for screening out candidate abnormal samples from the total amount of samples to be screened based on the prediction results corresponding to each training sample;

the screening module is further configured to screen a target abnormal sample from the candidate abnormal samples, and remove the target abnormal sample from the total to-be-screened sample set to obtain a normal sample set.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the following steps when executing the computer program:

In a fourth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

In a fifth aspect, the present application further provides a computer program product. The computer program product comprising a computer program which when executed by a processor performs the steps of:

According to the sample screening method, the sample screening device, the computer equipment, the storage medium and the computer program product, the model algorithm of the first model obtained through training of the full-scale sample set to be screened is obtained, the plurality of initialized sub-models are determined according to the model algorithm of the first model, then each sub-model is trained in the full-scale sample set to be screened, and then the plurality of inspection models corresponding to each sub-model are obtained, so that a plurality of inspection models corresponding to the current full-scale sample set to be screened are constructed first, and a foundation is laid for subsequent removal of target abnormal samples. And then, predicting each sample to be screened in the total sample set to be screened by a plurality of test models, so that the prediction results of the plurality of test models on the same sample can be integrated, and whether the sample is a suspected abnormal sample can be judged quickly and accurately. If so, further screening confirmation is carried out, and finally real abnormal samples can be quickly and accurately screened out from the total amount of samples to be screened and eliminated, so that a normal sample set is obtained.

Drawings

FIG. 1 is a diagram showing an environment in which the sample screening method is applied in one embodiment;

FIG. 2 is a schematic flow chart of a sample screening method according to an embodiment;

FIG. 3 is a block diagram showing the structure of a sample screening apparatus according to an embodiment;

FIG. 4 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The sample screening method provided by the embodiment of the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104, or may be located on the cloud or other network server. The sample screening method mentioned in the embodiments of the present application can be implemented by the terminal and the server separately, or by the terminal and the server cooperatively. Taking a sample screening method in the application implemented by cooperation of a terminal and a server as an example for explanation, a user can obtain a model algorithm of a trained first model through the terminal, determine a plurality of initialized sub-models according to the model algorithm of the first model, train each sub-model through a full-amount sample set to be screened respectively to obtain a plurality of inspection models, predict each sample in the full-amount sample set to be screened through the plurality of inspection models to obtain a plurality of prediction results corresponding to each sample respectively, screen out candidate abnormal samples from the full-amount sample set to be screened through the terminal based on the prediction results corresponding to each training sample respectively, screen out target abnormal samples from the candidate abnormal samples, and finally remove the target abnormal samples from the full-amount sample set to be screened to obtain a normal sample set. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart car-mounted devices, and the like. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device, and the like. The server 104 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.

In an embodiment, as shown in fig. 2, a sample screening method is provided, which is described by taking an example that the method is applied to a computer device in fig. 1 (the computer device may specifically be a terminal or a server in fig. 1), and includes the following steps:

step S202, obtaining a model algorithm of the trained first model, wherein the first model is obtained by training a full sample set to be screened.

The model algorithm is used for constructing the first model, generally, the model algorithm is selected to have strong interpretability, and particularly, the model algorithm can be a simple and quick machine learning algorithm. For example, the algorithm may be a decision tree, a logistic regression, or a linear regression, which is not limited herein. The whole sample set to be screened comprises a large amount of sample data, and the sample quality is uneven.

Specifically, the computer device obtains a model algorithm of the first model trained and completed through the full set of samples to be screened.

In one embodiment, the training step of obtaining the first model comprises: constructing a first model to be trained based on a target machine learning algorithm; acquiring a total sample set to be screened, and dividing the total sample set to be screened into a training set and a testing set; and training the first model to be trained on the training set, testing on the test set until a training stopping condition is reached, and obtaining the trained first model.

Wherein the target machine learning algorithm is a model algorithm of the determined first model. The training stopping condition is used for judging whether the training model meets the requirements, and specifically, the training reaches a preset iteration number, the training performance cannot be further improved, or the model performance reaches a preset performance index, and the like. For example, for a binary model commonly used in the financial industry, the evaluation index may be AUC (Area Under Curve), KS (Kolmogorov-Smirnov), F1 (Area Under Curve), and the like, and needs to be selected according to business requirements, which is not limited herein.

Specifically, the computer device constructs a first model to be trained according to the determined target machine learning algorithm, and divides the acquired full sample set to be screened into a training set and a testing set. The computer equipment trains the first model to be trained on a training set of a full sample set to be screened, tests the first model on a testing set of the full sample set to be screened until a training stopping condition is met, so that the trained first model is obtained, and the computer equipment obtains a model algorithm of the trained first model from the trained first model.

In one embodiment, the computer device constructs a first model to be trained according to a decision tree algorithm and obtains a full sample set to be screened. The computer equipment divides a full sample set to be screened into a training set and a testing set according to the proportion of 4:1, trains a first model to be trained on the training set of the full sample set to be screened, tests the first model on the testing set of the full sample set to be screened until the testing effect of the first model on the testing set cannot be further improved, the first model at the moment is used as a trained first model, and the computer equipment obtains a model algorithm of the first model. It should be noted that the distribution ratio of the training set and the test set may also be other preset ratios, such as 3:2, and the embodiment of the present application does not limit this.

In the embodiment, the first model to be trained is constructed based on the determined target machine learning algorithm, the first model is trained and tested in the whole sample set to be screened, so that the trained first model is obtained, the corresponding model algorithm is obtained based on the trained first model, and the model algorithm lays a foundation for screening of subsequent samples.

Step S204, determining a plurality of initialized sub-models according to the model algorithm of the first model; the model hyper-parameters of each of the plurality of sub-models remain consistent.

The model hyper-parameters are predefined configuration parameters, can be directly specified by engineers, and can specify different model hyper-parameters according to different model algorithms.

Specifically, the computer device determines a plurality of initialized sub-models by using the same model algorithm or a model algorithm which is the same as or the same type as the model algorithm of the first model but is more simplified according to the model algorithm of the first model. And the computer device determines the model hyper-parameters of each sub-model according to the determined model algorithm used by the initialized sub-model, and the model hyper-parameters of each sub-model are kept consistent.

In one embodiment, the computer device determines a model hyper-parameter corresponding to a model algorithm based on the model algorithm of the first model, selects the same model algorithm and the corresponding model hyper-parameter, copies to obtain N sub-models, and initializes the N sub-models. Typically, N needs to be greater than 10.

In one embodiment, the computer device selects a model algorithm that is the same type as the model algorithm of the first model but is more simplified as the model algorithm of the sub-model based on the model algorithm of the first model, for example, if the model algorithm of the first model is the type of linear model or the type of decision tree model, then the model algorithm of the sub-model also selects the type of linear model or the type of decision tree model, but the model algorithm of the sub-model needs to be simpler than the model algorithm of the first model. And then determining the model hyper-parameter of the corresponding sub-model according to the model algorithm of the selected sub-model, determining a sub-model according to the model algorithm and the model hyper-parameter of the sub-model, copying the sub-model to obtain N sub-models, and initializing the N sub-models.

In one embodiment, each sub-model comprises a plurality of model hyper-parameters, part of the model hyper-parameters are influenced by the random number seeds and change along with the difference of the random number seeds, and part of the model hyper-parameters are not influenced by the random number seeds. The model hyper-parameters affected by the random number seeds may be different between the sub-models, while the model hyper-parameters not affected by the random number seeds remain the same.

And S206, training each sub-model through a full sample set to be screened to obtain a plurality of test models, wherein each sub-model adopts different random number seeds during training.

The random number seed can generate a series of random numbers, and different random number seeds can obtain different results even if models with the same parameters are trained on the same data.

Specifically, the computer device trains and tests each submodel in the full sample set to be screened respectively, so as to obtain a test model corresponding to each submodel.

In one embodiment, if the machine learning task is an image recognition task, the corresponding full-amount sample set to be screened may be an image sample set, and the computer device trains each sub-model through the image sample set to obtain a plurality of inspection models.

In one embodiment, training each sub-model through a full sample set to be screened to obtain a plurality of test models, including: acquiring a total sample set to be screened, and dividing the total sample set to be screened into a training set and a testing set; forming an initial test model sequence by the plurality of sub models; training each sub-model in the initial test model sequence on a training set by respectively adopting different random number seeds, and testing on a testing set to obtain a plurality of trained test models; and the trained test models corresponding to the plurality of sub-models jointly form a trained test model sequence.

Specifically, the computer device first obtains a full sample set to be screened, and divides the full sample set to be screened into a training set and a testing set. The computer equipment jointly forms an initial test model sequence by the multiple submodels, trains each submodel in the initial test model sequence on a training set by adopting different random number seeds, and tests each submodel in the initial test model sequence on a test set, thereby obtaining multiple trained test models corresponding to each submodel in the initial test model sequence. And the trained test models corresponding to the plurality of sub-models jointly form a trained test model sequence.

In one embodiment, the computer device divides the acquired total sample set to be screened into a training set and a testing set, trains the N sub-models on the training set by respectively adopting different random number seeds, tests the testing set, and calls the N trained and tested sub-models as trained testing models, wherein the N trained testing models jointly form a trained testing model sequence. The random number seeds can enable the sub-model A to be different from the test model A1 and the test model A2 which are obtained after the sub-model A is trained on the same sample in sequence.

In the above embodiment, the submodel is trained on the training set of the full amount of sample sets to be screened by adopting different random number seeds, and the submodel is tested on the test set, so that each sample encountered by the submodel in the full amount of sample sets to be screened is ensured to be different, and the accuracy of the training of the submodel is improved.

And S208, predicting each training sample in the full sample set to be screened through a plurality of test models to obtain a plurality of prediction results respectively corresponding to each training sample.

The test model predicts each training sample in the full sample set to be screened, and substantially processes each training sample through the corresponding model algorithm and model parameters to obtain a processing result. For example, when the test model is a classification model, the test model predicting the training sample may specifically be that the test model performs feature extraction on the input training sample, performs classification processing based on the extracted features, and outputs the probability that the training sample belongs to the target class.

Specifically, the computer device predicts each sample in the total sample set to be screened by using a plurality of test models, and obtains a plurality of prediction results of the plurality of test models on the same sample.

In one embodiment, the computer device uses a two-classification model to realize the prediction of each sample by N inspection models, and the two-classification model considers the prediction probability p to be in

Samples of the interval are samples in which there is a possibility of abnormality, wherein

The confidence adjustment coefficients are predicted for the model,

the larger the representation model, the more strict the requirements for correctly predicted samples are, more abnormal samples can be screened out, but at the same time, part of normal samples can be misjudged as abnormal samples. In general

The value of (2) is 0.05, which is a recommended value with a good screening effect on the binary classification models of most scenes of the financial industry. Based on the prediction probability of the two classification models, N prediction results of the N detection models on the same sample can be obtained.

And step S210, screening out candidate abnormal samples from the whole set of samples to be screened based on the prediction result corresponding to each training sample.

Wherein, the candidate abnormal sample is a sample in which an abnormality may exist.

Specifically, the computer device screens out candidate abnormal samples from the total amount of samples to be screened according to the prediction result of the plurality of test models on each sample.

In one embodiment, screening out candidate abnormal samples from a total sample set to be screened based on a prediction result corresponding to each training sample respectively comprises: for each training sample, counting the number of prediction results representing insufficient prediction confidence in a plurality of prediction results corresponding to the corresponding training sample; determining a judgment coefficient corresponding to the corresponding training sample based on the number of prediction results representing insufficient prediction confidence and the total number of the test models; and when the judgment coefficient is larger than the preset threshold value, determining the corresponding training sample as a candidate abnormal sample.

The prediction confidence is insufficient, the fact that the training sample is considered to have high probability by the test model is the candidate abnormal sample is shown, and the judgment coefficient is used for judging whether the training sample is the candidate abnormal sample.

Specifically, for each training sample, the results of prediction confidence of different test models may be different, the computer device counts the number of prediction results with insufficient prediction confidence in the plurality of prediction results corresponding to the corresponding training sample, and the determination coefficient corresponding to the corresponding training sample can be determined according to the number of prediction results with insufficient prediction confidence and the total number of test models. And comparing the judgment coefficient with a preset threshold value of the judgment coefficient, and if the judgment coefficient is larger than the preset threshold value, determining the corresponding training sample as a candidate abnormal sample.

In one embodiment, the computer device uses a two-classification model method to realize the prediction of each sample by N test models, if N probabilities in the prediction probabilities p of the same sample by the N test models fall into

In the interval, if the number of the prediction results representing the prediction confidence shortage in the N prediction results corresponding to the training sample is N, the determination coefficient δ corresponding to the training sample is defined to be N/N, and if the threshold value preset by the determination coefficient is 0.5, when δ is>And 0.5, determining the training sample as a candidate abnormal sample.

In the embodiment, the same training sample is predicted through a plurality of test models, and the judgment coefficient is determined, so that the accuracy of the abnormal prediction result of the training sample is improved, and the prediction error is reduced.

And S212, screening the target abnormal samples from the candidate abnormal samples, and removing the target abnormal samples from the total sample set to be screened to obtain a normal sample set.

The normal sample set is used for training the machine learning model, and the quality of the normal sample set determines the quality of the machine learning model.

Specifically, the computer device screens out target abnormal samples from the candidate abnormal samples, eliminates the target abnormal samples from the total amount of sample sets to be screened, and finally obtains a normal sample set.

According to the sample screening method, a model algorithm of a first model obtained through training of a full sample set to be screened is obtained, a plurality of initialized sub-models are determined according to the model algorithm of the first model, then each sub-model is trained in the full sample set to be screened, and then a plurality of inspection models corresponding to each sub-model are obtained. And then, predicting each sample to be screened in the total sample set to be screened by a plurality of test models, so that the prediction results of the plurality of test models on the same sample can be integrated, and whether the sample is a suspected abnormal sample can be judged quickly and accurately. If so, further screening confirmation is carried out, and finally real abnormal samples can be quickly and accurately screened out from the total amount of samples to be screened and eliminated, so that a normal sample set is obtained.

In addition, the sample screening method in the application can quickly and accurately screen abnormal samples from a large number of samples to be screened, so that the steps of repeated execution of computer equipment due to inaccurate screening are reduced, and the processing resources of the computer equipment are saved. And moreover, abnormal samples are quickly and accurately screened from a large number of samples to be screened to obtain high-quality normal samples, when model training is carried out on the normal samples subsequently, the model training efficiency and effect can be improved, and the waste of computing resources and storage resources of computer equipment due to poor model training effect or more times is reduced, so that the computer equipment is prevented from doing useless work to a greater extent, the use frequency of corresponding modules in the computer equipment is reduced, and the service life of corresponding modules in the computer equipment is prolonged.

In one embodiment, screening the target abnormal sample from the candidate abnormal samples comprises: and checking the plurality of candidate abnormal samples one by one based on a preset checking rule, judging whether each candidate abnormal sample is a real abnormal sample, and if so, taking the real abnormal sample as a target abnormal sample.

The preset checking rule clearly specifies which candidate abnormal sample is a real abnormal sample.

Specifically, the computer device checks the determined candidate abnormal samples one by one, judges whether each candidate abnormal sample is a real abnormal sample according to the existing checking rule, and if so, takes the real abnormal sample as a target abnormal sample.

In the embodiment, the target abnormal sample can be definitely and quickly determined according to the preset checking rule, so that the sample screening speed is increased.

In one embodiment, screening a target abnormal sample from candidate abnormal samples, and removing the target abnormal sample from a total sample set to be screened to obtain a normal sample set, includes: acquiring a second model obtained by training a full sample set to be screened; candidate abnormal samples are removed from the total sample set to be screened one by one, and each time the candidate abnormal samples are removed, the second model is retrained in the rest samples in the training set of the sample set to be screened, and is tested in the testing set of the sample set to be screened, so that whether the training effect is improved or not is checked; if the training effect is improved, determining the candidate abnormal sample eliminated at the current time as a target abnormal sample, and if the training effect is reduced, determining the candidate abnormal sample eliminated at the current time as a normal sample; after all candidate abnormal samples in the training set of the sample set to be screened are removed, the training set and the test set of the sample set to be screened are divided again to ensure that all samples in the last test set are in the training set of the previous time; returning to the step of eliminating the candidate abnormal samples one by one from the total sample set to be screened and continuing to execute until the sample inspection of all the candidate abnormal samples is completed; and (4) all the residual samples after all the target abnormal samples in the total sample set to be screened are removed to form a normal sample set.

The second model is obtained by training in the full sample set to be screened according to a common machine learning algorithm, and the second model may also be the first model, which is not limited herein.

In one embodiment, the computer device obtains a second model obtained by training a full sample set to be screened, eliminates a candidate abnormal sample from the training set of the full sample set to be screened, retrains the second model in the rest samples in the training set, tests the second model in the testing set, and checks whether the training effect is improved. And if the training effect is improved, the candidate abnormal sample removed this time is confirmed as a target abnormal sample, and if the training effect is reduced, the candidate abnormal sample removed this time is confirmed as a normal sample, and the candidate abnormal sample can be placed back into the training set again. And removing one candidate abnormal sample from the training set again, continuously training the second model, circulating the steps until all the candidate abnormal samples in the training set of the sample set to be screened are removed, re-dividing the training set and the test set of the rest sample set to be screened by the computer equipment, completely dividing the samples in the last test set into the training set of the last time, and continuously removing the candidate abnormal samples in the training set one by one until the sample inspection of all the candidate abnormal samples is completed. And taking the residual samples after the target samples are removed as a normal sample set.

In the embodiment, the candidate abnormal samples in the total sample set to be screened are checked one by using the second model, the real abnormal samples are removed, and the normal samples are reserved, so that the high-quality normal sample set is obtained, and a foundation is laid for obtaining the high-quality machine learning model.

In a specific embodiment, the computer device constructs a first model based on a determined machine learning algorithm, trains the first model in a training set of a corresponding full sample set to be screened, tests in a testing set of the full sample set to be screened to obtain a model algorithm and a model hyperparameter of the trained first model, obtains a plurality of submodels with the same model algorithm and model hyperparameter based on the model algorithm and the model hyperparameter of the first model, and initializes the plurality of submodels to form an initial test model sequence. The computer equipment adopts different random number seeds to train each sub-model in the initial test model sequence in a full sample set to be screened respectively to obtain a plurality of trained test models, and the trained test models form a test model sequence. The computer equipment carries out probability prediction on abnormal samples on each sample in a full-quantity sample set to be screened through a plurality of test models, counts the number of prediction results representing insufficient prediction confidence in a plurality of prediction results corresponding to each sample, takes the ratio of the number of the test models representing insufficient prediction confidence of each sample to the total number of the test models as a judgment coefficient corresponding to each sample, and if the judgment coefficient is larger than a preset threshold value X, determines a corresponding training sample as a candidate abnormal sample. The computer equipment can check the candidate abnormal samples one by one according to a preset checking rule to determine whether the candidate abnormal samples are the target abnormal samples, and if so, the candidate abnormal samples are removed from the full amount of samples to be screened in a centralized manner. The computer equipment can also obtain a second model trained by the full sample set to be screened again, repeatedly train the second model in the sample set to be screened, in which the candidate abnormal samples are removed one by one, and determine whether the candidate abnormal samples are real abnormal samples by checking the training effect of the second model after each candidate abnormal sample is removed. And after the candidate abnormal samples in the training set in the sample set to be screened are checked one by one, the computer equipment divides the sample set to be screened again, divides all samples in the test set into the training set, and then eliminates the candidate abnormal samples divided into the training set one by using the same method, so that all target abnormal samples in the whole sample set to be screened are eliminated, normal samples are reserved, and a normal sample set is formed.

In the above embodiment, the computer device obtains the model algorithm and the model hyper-parameter of the first model obtained by training the full amount of sample sets to be screened, and obtains a plurality of inspection models based on the model algorithm and the model hyper-parameter, so that a plurality of inspection models corresponding to the current full amount of sample sets to be screened are constructed based on the determined model algorithm, and a foundation is laid for the training of subsequent inspection model sequences. And then, the computer equipment predicts each sample to be screened in the total sample set to be screened through the plurality of test models, so that the prediction results of the plurality of test models on the same sample can be integrated, and whether the sample is a suspected abnormal sample can be judged quickly and accurately. If so, further screening confirmation is carried out, and finally, real abnormal samples can be quickly and accurately screened out from the total amount of samples to be screened and eliminated to obtain a normal sample set.

It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.

Based on the same inventive concept, the embodiment of the application also provides a sample screening device for realizing the sample screening method. The solution to the problem provided by the device is similar to the solution described in the above method, so the specific limitations in one or more embodiments of the sample screening device provided below can be referred to the limitations of the sample screening method in the above, and are not described herein again.

In one embodiment, as shown in fig. 3, there is provided a sample screening apparatus comprising: an acquisition module 301, an initialization module 302, a training module 303, a prediction module 304, and a screening module 305, wherein:

the obtaining module 301 is configured to obtain a model algorithm of a trained first model, where the first model is obtained by training a full sample set to be screened.

An initialization module 302 for determining a plurality of initialized sub-models according to a model algorithm of the first model; the model hyper-parameters of each of the plurality of sub-models remain consistent.

And the training module 303 is configured to train each sub-model through a full amount of sample sets to be screened, so as to obtain a plurality of test models, wherein each sub-model adopts different random number seeds during training.

The predicting module 304 is configured to predict each sample in the full sample set to be screened through the multiple inspection models, so as to obtain multiple prediction results corresponding to each sample.

The screening module 305 is configured to screen out candidate abnormal samples from the total number of samples to be screened based on the prediction result corresponding to each training sample.

The screening module 305 is further configured to screen a target abnormal sample from the candidate abnormal samples, and remove the target abnormal sample from the total to-be-screened sample set to obtain a normal sample set.

In one embodiment, the obtaining module 301 is further configured to construct a first model to be trained based on a target machine learning algorithm; acquiring a total sample set to be screened, and dividing the total sample set to be screened into a training set and a testing set; and training the first model to be trained on the training set, testing on the testing set until a training stopping condition is reached, and obtaining the trained first model.

In one embodiment, the training module 303 is further configured to obtain a total sample set to be screened, and divide the total sample set to be screened into a training set and a testing set; forming an initial test model sequence by the plurality of sub models; training each sub-model in the initial test model sequence on a training set by respectively adopting different random number seeds, and testing on a testing set to obtain a plurality of trained test models; and the trained test models corresponding to the plurality of sub-models jointly form a trained test model sequence.

In one embodiment, the screening module 305 is further configured to count, for each training sample, the number of prediction results that characterize insufficient prediction confidence in the plurality of prediction results corresponding to the corresponding training sample; determining a judgment coefficient corresponding to the corresponding training sample based on the number of prediction results representing insufficient prediction confidence and the total number of the test models; and when the judgment coefficient is larger than the preset threshold value, determining the corresponding training sample as a candidate abnormal sample.

In an embodiment, the screening module 305 is further configured to check the plurality of candidate abnormal samples one by one based on a preset check rule, determine whether each candidate abnormal sample is a real abnormal sample, and if so, take the real abnormal sample as the target abnormal sample.

In one embodiment, the screening module 305 is further configured to obtain a second model obtained by training a full sample set to be screened; candidate abnormal samples are removed from the total sample set to be screened one by one, and each time the candidate abnormal samples are removed, the second model is retrained in the rest samples in the training set of the sample set to be screened, and is tested in the testing set of the sample set to be screened, so that whether the training effect is improved or not is checked; if the training effect is improved, determining the candidate abnormal sample eliminated at the current time as a target abnormal sample, and if the training effect is reduced, determining the candidate abnormal sample eliminated at the current time as a normal sample; after all candidate abnormal samples in the training set of the sample set to be screened are removed, the training set and the test set of the sample set to be screened are divided again to ensure that all samples in the last test set are in the training set of the previous time; returning to the step of eliminating the candidate abnormal samples one by one from the total sample set to be screened and continuing to execute until the sample inspection of all the candidate abnormal samples is completed; and (4) all the residual samples after all the target abnormal samples in the total sample set to be screened are removed to form a normal sample set.

According to the sample screening device, the model algorithm of the first model obtained through training of the full-scale sample set to be screened is obtained, the plurality of initialized sub-models are determined according to the model algorithm of the first model, then each sub-model is trained in the full-scale sample set to be screened, and then the plurality of inspection models corresponding to each sub-model are obtained, so that a plurality of inspection models corresponding to the current full-scale sample set to be screened are constructed first, and a foundation is laid for subsequent elimination of target abnormal samples. And then, predicting each sample to be screened in the total sample set to be screened by a plurality of test models, so that the prediction results of the plurality of test models on the same sample can be integrated, and whether the sample is a suspected abnormal sample can be judged quickly and accurately. If so, further screening confirmation is carried out, and finally real abnormal samples can be quickly and accurately screened out from the total amount of samples to be screened and eliminated, so that a normal sample set is obtained.

In addition, the sample screening device in the application can rapidly and accurately screen abnormal samples from a large number of samples to be screened, so that the steps of repeated execution of computer equipment due to inaccurate screening are reduced, and the processing resources of the computer equipment are saved. And moreover, abnormal samples are quickly and accurately screened from a large number of samples to be screened to obtain high-quality normal samples, when model training is carried out on the normal samples subsequently, the model training efficiency and effect can be improved, and the waste of computing resources and storage resources of computer equipment due to poor model training effect or more times is reduced, so that the computer equipment is prevented from doing useless work to a greater extent, the use frequency of corresponding modules in the computer equipment is reduced, and the service life of corresponding modules in the computer equipment is prolonged.

The modules in the sample screening device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 4. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store sample data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of sample screening.

Those skilled in the art will appreciate that the architecture shown in fig. 4 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program: obtaining a model algorithm of a trained first model, wherein the first model is obtained by training a full sample set to be screened; determining a plurality of initialized sub-models according to a model algorithm of the first model; the model hyper-parameters of each sub-model in the plurality of sub-models are kept consistent; respectively training each sub-model through a full amount of sample sets to be screened to obtain a plurality of test models, wherein each sub-model adopts different random number seeds during training; predicting each sample in the full sample set to be screened through a plurality of test models to obtain a plurality of prediction results respectively corresponding to each sample; screening out candidate abnormal samples from the whole to-be-screened sample set based on the prediction results corresponding to each training sample; and screening the target abnormal samples from the candidate abnormal samples, and removing the target abnormal samples from the total sample set to be screened to obtain a normal sample set.

In one embodiment, the processor, when executing the computer program, further performs the steps of: constructing a first model to be trained based on a target machine learning algorithm; acquiring a total sample set to be screened, and dividing the total sample set to be screened into a training set and a testing set; and training the first model to be trained on the training set, testing on the testing set until a training stopping condition is reached, and obtaining the trained first model.

In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring a total sample set to be screened, and dividing the total sample set to be screened into a training set and a testing set; forming an initial test model sequence by the plurality of sub models; training each sub-model in the initial test model sequence on a training set by respectively adopting different random number seeds, and testing on a testing set to obtain a plurality of trained test models; and the trained test models corresponding to the plurality of sub-models jointly form a trained test model sequence.

In one embodiment, the processor, when executing the computer program, further performs the steps of: for each training sample, counting the number of prediction results representing insufficient prediction confidence in a plurality of prediction results corresponding to the corresponding training sample; determining a judgment coefficient corresponding to the corresponding training sample based on the number of prediction results representing insufficient prediction confidence and the total number of the test models; and when the judgment coefficient is larger than the preset threshold value, determining the corresponding training sample as a candidate abnormal sample.

In one embodiment, the processor, when executing the computer program, further performs the steps of: and checking the plurality of candidate abnormal samples one by one based on a preset checking rule, judging whether each candidate abnormal sample is a real abnormal sample, and if so, taking the real abnormal sample as a target abnormal sample.

In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring a second model obtained by training a full sample set to be screened; candidate abnormal samples are removed from the total sample set to be screened one by one, and each time the candidate abnormal samples are removed, the second model is retrained in the rest samples in the training set of the sample set to be screened, and is tested in the testing set of the sample set to be screened, so that whether the training effect is improved or not is checked; if the training effect is improved, determining the candidate abnormal sample eliminated at the current time as a target abnormal sample, and if the training effect is reduced, determining the candidate abnormal sample eliminated at the current time as a normal sample; after all candidate abnormal samples in the training set of the sample set to be screened are removed, the training set and the test set of the sample set to be screened are divided again to ensure that all samples in the last test set are in the training set of the previous time; returning to the step of eliminating the candidate abnormal samples one by one from the total sample set to be screened and continuing to execute until the sample inspection of all the candidate abnormal samples is completed; and (4) all the residual samples after all the target abnormal samples in the total sample set to be screened are removed to form a normal sample set.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: obtaining a model algorithm of a trained first model, wherein the first model is obtained by training a full sample set to be screened; determining a plurality of initialized sub-models according to a model algorithm of the first model; the model hyper-parameters of each sub-model in the plurality of sub-models are kept consistent; respectively training each sub-model through a full amount of sample sets to be screened to obtain a plurality of test models, wherein each sub-model adopts different random number seeds during training; predicting each sample in the full sample set to be screened through a plurality of test models to obtain a plurality of prediction results respectively corresponding to each sample; screening out candidate abnormal samples from the whole to-be-screened sample set based on the prediction results corresponding to each training sample; and screening the target abnormal samples from the candidate abnormal samples, and removing the target abnormal samples from the total sample set to be screened to obtain a normal sample set.

In one embodiment, the computer program when executed by the processor further performs the steps of: constructing a first model to be trained based on a target machine learning algorithm; acquiring a total sample set to be screened, and dividing the total sample set to be screened into a training set and a testing set; and training the first model to be trained on the training set, testing on the testing set until a training stopping condition is reached, and obtaining the trained first model.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring a total sample set to be screened, and dividing the total sample set to be screened into a training set and a testing set; forming an initial test model sequence by the plurality of sub models; training each sub-model in the initial test model sequence on a training set by respectively adopting different random number seeds, and testing on a testing set to obtain a plurality of trained test models; and the trained test models corresponding to the plurality of sub-models jointly form a trained test model sequence.

In one embodiment, the computer program when executed by the processor further performs the steps of: for each training sample, counting the number of prediction results representing insufficient prediction confidence in a plurality of prediction results corresponding to the corresponding training sample; determining a judgment coefficient corresponding to the corresponding training sample based on the number of prediction results representing insufficient prediction confidence and the total number of the test models; and when the judgment coefficient is larger than the preset threshold value, determining the corresponding training sample as a candidate abnormal sample.

In one embodiment, the computer program when executed by the processor further performs the steps of: and checking the plurality of candidate abnormal samples one by one based on a preset checking rule, judging whether each candidate abnormal sample is a real abnormal sample, and if so, taking the real abnormal sample as a target abnormal sample.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring a second model obtained by training a full sample set to be screened; candidate abnormal samples are removed from the total sample set to be screened one by one, and each time the candidate abnormal samples are removed, the second model is retrained in the rest samples in the training set of the sample set to be screened, and is tested in the testing set of the sample set to be screened, so that whether the training effect is improved or not is checked; if the training effect is improved, determining the candidate abnormal sample eliminated at the current time as a target abnormal sample, and if the training effect is reduced, determining the candidate abnormal sample eliminated at the current time as a normal sample; after all candidate abnormal samples in the training set of the sample set to be screened are removed, the training set and the test set of the sample set to be screened are divided again to ensure that all samples in the last test set are in the training set of the previous time; returning to the step of eliminating the candidate abnormal samples one by one from the total sample set to be screened and continuing to execute until the sample inspection of all the candidate abnormal samples is completed; and (4) all the residual samples after all the target abnormal samples in the total sample set to be screened are removed to form a normal sample set.

In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the steps of: obtaining a model algorithm of a trained first model, wherein the first model is obtained by training a full sample set to be screened; determining a plurality of initialized sub-models according to a model algorithm of the first model; the model hyper-parameters of each sub-model in the plurality of sub-models are kept consistent; respectively training each sub-model through a full amount of sample sets to be screened to obtain a plurality of test models, wherein each sub-model adopts different random number seeds during training; predicting each sample in the full sample set to be screened through a plurality of test models to obtain a plurality of prediction results respectively corresponding to each sample; screening out candidate abnormal samples from the whole to-be-screened sample set based on the prediction results corresponding to each training sample; and screening the target abnormal samples from the candidate abnormal samples, and removing the target abnormal samples from the total sample set to be screened to obtain a normal sample set.

It should be noted that, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), Magnetic Random Access Memory (MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A method of screening a sample, the method comprising:

2. The method of claim 1, wherein the step of training the first model comprises:

3. The method of claim 1, wherein the training of each of the submodels separately through the full set of samples to be screened results in a plurality of test models, comprising:

forming an initial test model sequence by the plurality of sub models;

4. The method according to claim 1, wherein the screening out candidate abnormal samples from the total set of samples to be screened based on the prediction result corresponding to each training sample comprises:

5. The method of claim 1, wherein the screening of the candidate anomaly samples for a target anomaly sample comprises:

6. The method according to claim 1, wherein the screening of the target abnormal sample from the candidate abnormal samples and the removing of the target abnormal sample from the total sample set to be screened to obtain a normal sample set comprises:

acquiring a second model obtained by training a full sample set to be screened;

7. A sample screening device, comprising:

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 6.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 6 when executed by a processor.