CN111461164A

CN111461164A - Sample data set capacity expansion method and model training method

Info

Publication number: CN111461164A
Application number: CN202010117161.4A
Authority: CN
Inventors: 李丹; 蒋藜薇
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-02-25
Filing date: 2020-02-25
Publication date: 2020-07-28
Anticipated expiration: 2040-02-25
Also published as: CN111461164B

Abstract

The embodiment of the invention provides a sample data set capacity expansion method and a model training method, wherein the sample data set capacity expansion method comprises the following steps: acquiring an original sample data set, wherein samples of the original sample data set comprise a plurality of unlabeled samples and a plurality of original positive samples; determining the similarity of each unlabeled sample and the original positive sample; acquiring random numbers which obey [0,1] uniform distribution; labeling the unlabeled sample as an updated positive sample or an updated negative sample based on the comparison result of the similarity corresponding to the unlabeled sample and the random number; and obtaining a sample updating data set based on the updating positive sample, the updating negative sample and the original positive sample, wherein the positive sample of the sample updating data set comprises the updating positive sample and the original positive sample, and the model training method trains the model on the basis of the sample data set capacity expansion method. The method for expanding the sample data set provided by the embodiment of the invention can dig out more positive samples so as to realize expansion of the sample data set.

Description

Sample data set capacity expansion method and model training method

Technical Field

The invention relates to the technical field of machine learning, in particular to a sample data set capacity expansion method and a model training method.

Background

With the development of internet technology, positive sample learning is widely applied to multiple fields, such as credit card fraud detection and e-commerce recommendation. In the positive sample learning, a model is usually trained by using a sample data set, and the sample data set needs a certain number of positive samples to ensure a good training effect.

However, in an actual use environment, only a small number of positive samples and a large number of unlabeled samples can be obtained, for example, in many cases, the number of collected positive samples is usually less than 5% of the total number of samples, in such a scenario where data is extremely unbalanced, positive sample learning cannot accurately function, and a trained model is not accurate enough. That is, the number of positive samples in the sample data set becomes an important factor restricting the model training, and there is room for improvement.

Disclosure of Invention

Embodiments of the present invention provide a method for expanding a sample data set, which overcomes or at least partially solves the above problems.

In a first aspect, an embodiment of the present invention provides a method for expanding a sample data set, including: obtaining an original sample data set, wherein samples of the original sample data set comprise a plurality of unlabeled samples and a plurality of original positive samples; determining the similarity of each unlabeled sample and the original positive sample; acquiring random numbers which obey [0,1] uniform distribution; marking the unlabeled sample as an updated positive sample or an updated negative sample based on the comparison result of the similarity corresponding to the unlabeled sample and the random number; based on the updated positive sample, the updated negative sample, and the original positive sample, a sample update dataset is derived, the positive samples of the sample update dataset including the updated positive sample and the original positive sample.

In some embodiments, said determining a similarity of each said unlabeled exemplar to said original positive exemplar comprises: acquiring k samples adjacent to each unlabeled sample; determining the proportion of original positive samples in the k samples; and taking the proportion of the original positive sample as the similarity of the unlabeled sample and the original positive sample.

In some embodiments, said obtaining k samples adjacent to each of said unlabeled samples comprises: representing samples of the original sample data set by vectors; acquiring Euclidean distances between each unlabeled sample and other samples of the original sample data set; and selecting the k samples with the shortest Euclidean distance as the k samples adjacent to the unlabeled sample.

In some embodiments, the obtaining follows a [0,1] uniformly distributed random number; marking the unlabeled sample as an updated positive sample or an updated negative sample based on the comparison result of the similarity corresponding to the unlabeled sample and the random number; obtaining a sample update dataset based on the update positive sample, the update negative sample, and the original positive sample, positive samples of the sample update dataset including the update positive sample and the original positive sample, comprising: and labeling the unlabeled sample once for each random number to obtain a plurality of sample updating data sets corresponding to the random numbers one by one.

In a second aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method for expanding the sample data set provided in any one of the possible implementation schemes in the first aspect when executing the program.

In a third aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the method for expanding the sample data set provided in any one of the possible implementation schemes in the first aspect.

In a fourth aspect, an embodiment of the present invention provides a model training method, where the model includes multiple sub-models, and the model training method includes: using a sample data set capacity expansion method provided by any one of the possible implementation schemes of the first aspect, obtaining the multiple sample update data sets, where sample tags corresponding to the update positive sample and the original positive sample are 1, and a sample tag corresponding to the update negative sample is-1; training the plurality of submodels using the sample update data sets, and at least two of the submodels using different sample update data sets, respectively; and determining the model based on the trained plurality of sub-models.

In a fifth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the model training method according to the implementation of the fourth aspect.

In a sixth aspect, an embodiment of the present invention provides a method for applying a model, including: inputting data to be judged to the model to obtain a test result output by the model; wherein the model is obtained by training according to the model training method of the implementation scheme of the fourth aspect.

In a seventh aspect, an embodiment of the present invention provides a non-transitory computer readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the model application method according to the implementation scheme of the sixth aspect.

According to the sample data set capacity expansion method, the electronic device, the non-transitory computer readable storage medium, the model training method, the non-transitory computer readable storage medium, the model application method and the non-transitory computer readable storage medium provided by the embodiment of the invention, the similarity between the unlabeled sample and the positive sample in the sample data set is compared with the size of the acquired random number uniformly distributed according to [0,1], the unlabeled sample is marked as the positive sample or the negative sample according to the comparison result, more positive samples can be mined to realize capacity expansion of the sample data set, and the sample data set after capacity expansion can help the positive sample to learn to more accurately play a role, so that the accuracy and the efficiency of model training are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a flowchart of a sample data set capacity expansion method according to an embodiment of the present invention;

fig. 2 is a flowchart of a method for expanding a sample data set according to another embodiment of the present invention;

fig. 3 is a flowchart of a method for expanding a sample data set according to another embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an electronic device in an embodiment of the invention;

FIG. 5 is a schematic diagram of a model training method according to an embodiment of the present invention;

FIG. 6 is a flow chart of a model training method according to an embodiment of the present invention;

FIG. 7 is a flowchart of a method for applying a model according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The following describes a method for expanding a sample data set according to an embodiment of the present invention with reference to fig. 1 to fig. 3.

As shown in fig. 1, the method for expanding the sample data set according to the embodiment of the present invention includes steps 100 to 500.

Step 100, an original sample data set is obtained, wherein samples of the original sample data set comprise a plurality of unlabeled samples and a plurality of original positive samples.

It can be understood that a sample data set theoretically consists of positive samples, negative samples and unlabeled samples, in some special scenarios, such as practical applications in credit card fraud or savings card recommendation scenarios, the collection of the negative samples is often difficult, only the positive samples and a large number of unlabeled samples can be obtained, in many cases, the number of the collected positive samples is usually less than 5% of the total number of samples, and in order to make the sample data set more effective, an expansion method should be adopted to mark the unlabeled samples in the sample data set, and the marking result is the positive samples or the negative samples. That is, prior to tagging, the original sample data set includes a number of unlabeled samples and a number of original positive samples, the number of unlabeled samples being much greater than the number of original positive samples.

Such as: in a credit card fraud scene, a credit card fraud sample data set theoretically comprises a client with a fraud condition, a client without the fraud condition and a client with an unknown condition, in practical application, the client with the fraud condition can be determined through complaints, but whether the fraud condition exists in the client which is not complained can not be determined, so that the original credit card fraud sample data set comprises a plurality of clients with unknown conditions and a plurality of original clients with the fraud condition, and the number of the clients with unknown conditions is far greater than that of the original clients with the fraud condition. In this scenario, the sample in the credit card fraud sample dataset is the customer sample, and the corresponding tag is the fraud or no fraud.

Also for example: in a situation of recommending a savings card, a savings card recommendation sample data set theoretically comprises a client willing to transact, a client unwilling to transact and a client with unknown condition, in practical application, the client willing to transact can be determined according to the transaction condition of the existing savings card, but whether the user is willing to transact the savings card cannot be determined for the user who does not transact the savings card, so that the original savings card recommendation sample data set comprises a plurality of clients with unknown condition and a plurality of clients willing to transact originally, and the number of clients with unknown condition is far greater than that of the clients willing to transact originally. In such a scenario, the sample in the sample set recommended by the deposit card is the client sample, and the corresponding label is willing to be handled or unwilling to be handled.

And step 200, determining the similarity of each unlabeled sample and the original positive sample.

It can be understood that the higher the similarity between the unlabeled sample and the original positive sample is, the more likely the unlabeled sample can be labeled as the positive sample, the samples adjacent to the unlabeled sample can be found by the algorithm of kd-tree in sklern, and the proportion of the original positive samples in the samples is further determined to calculate the similarity between the unlabeled sample and the original positive sample.

Such as: in order to determine as much as possible whether a fraud condition exists in an unsuspecting customer sample in a credit card fraud scenario, the similarity between the unsuspecting customer sample and the original fraudulent customer sample is determined.

Also for example: in the case of a savings card recommendation, to determine as much as possible whether an ill-conditioned customer sample is willing to transact a savings card, the approach taken is to determine the similarity of the ill-conditioned customer sample to the customer sample that was originally willing to transact.

And step 300, obtaining random numbers which obey [0,1] uniform distribution.

It will be appreciated that a random number β is derived from the uniform distribution obeying [0,1], and that this random number β is used to compare the similarity corresponding to the calculated unlabeled exemplars described above to reasonably label the unlabeled exemplars.

And step 400, marking the unlabeled samples as update positive samples or update negative samples based on the comparison result of the similarity corresponding to the unlabeled samples and the random number β.

It is understood that the similarity between each unlabeled exemplar and the original positive exemplar calculated in step 200 is compared with the random number β obtained from the uniform distribution obeying [0,1] in step 300, and the unlabeled exemplar is labeled according to the comparison result, and the labeled unlabeled exemplar is the updated positive exemplar or negative exemplar.

The method for labeling the unlabeled sample according to the comparison between the similarity and the size of the random number β may be to label the unlabeled sample as an update positive sample when the similarity is greater than or equal to the random number β and as an update negative sample when the similarity is less than the random number β, or may be to label the unlabeled sample as an update positive sample when the similarity is greater than the random number β and as an update negative sample when the similarity is less than or equal to the random number β.

For example, in a credit card fraud scenario, the similarity between an unknown customer sample and a fraudulent customer sample is compared with the random number β obtained from the uniform distribution obeying [0,1], and the unknown customer sample is labeled according to the comparison result, wherein the labeled unknown customer sample is the updated fraud customer sample and fraud-free customer sample.

For example, in the situation of the recommendation of the savings card, the similarity between the customer sample with unknown condition and the customer sample willing to be transacted is compared with the size of the random number β obtained from the uniform distribution obeying [0,1], and the customer sample with unknown condition is labeled according to the comparison result, wherein the labeled customer sample with unknown condition is the updated customer sample willing to be transacted and the customer sample unwilling to be transacted.

Step 500, obtaining a sample update data set based on the update positive sample, the update negative sample and the original positive sample, wherein the positive sample of the sample update data set comprises the update positive sample and the original positive sample.

It is understood that step 400 has obtained updated positive and negative examples by labeling the unlabeled examples, and that the unlabeled examples are no longer present in the updated data set of examples, where the original positive, updated positive and updated negative examples are present, and the number of positive examples in the updated data set is augmented.

Such as: in a credit card fraud scene, after the marking processing of the customer sample with unknown condition, the credit card fraud sample updating data set does not have the customer sample with unknown condition, wherein the customer sample with original fraud condition, the customer sample with updated fraud condition and the customer sample without fraud condition exist, and the customer sample with fraud condition in the credit card fraud sample data set is mined.

Also for example: in a savings card recommendation scene, after the marking processing of the customer samples with unknown conditions, the savings card recommendation sample updating data set does not have the customer samples with unknown conditions, wherein the customer samples which are willing to be transacted originally, the customer samples which are willing to be transacted by updating and the customer samples which are unwilling to transact exist, and the customer samples which are willing to be transacted in the savings card recommendation sample data set are mined.

According to the sample data set capacity expansion method provided by the embodiment of the invention, the similarity between the unlabeled sample and the positive sample in the sample data set is compared with the obtained random number β uniformly distributed according to [0,1], the unlabeled sample is marked as the positive sample or the negative sample according to the comparison result, more positive samples can be excavated, so that the capacity expansion of the sample data set is realized, the sample data set after capacity expansion can help the positive sample to learn to play a role more accurately, and the accuracy and the efficiency of model training are improved.

As shown in fig. 2, in an embodiment of the present invention, the determining the similarity of each unlabeled exemplar and the original positive exemplar in step 200 may include steps 210-230.

And step 210, acquiring k samples adjacent to each unlabeled sample.

It can be understood that in the sample data set, the unlabeled samples and the positive samples are arranged in a certain positional relationship, where k samples adjacent to each unlabeled sample are obtained for determining the similarity between the unlabeled sample and the original positive sample.

Such as: in a credit card fraud scene, in an original credit card fraud sample data set, an unknown customer sample and an original fraud customer sample are arranged in a certain positional relationship, and k customer samples adjacent to each unknown customer sample are obtained here and used for judging the similarity between the unknown customer sample and the fraud customer sample.

Also for example: in a savings card recommendation scene, in an original savings card recommendation sample data set, customer samples with unknown conditions and customer samples willing to be transacted originally are arranged in a certain position relationship, and k customer samples adjacent to each customer sample with unknown conditions are obtained here and used for judging the similarity between the customer samples with unknown conditions and the customer samples willing to be transacted originally.

Step 220, determine the proportion of the original positive samples in the k samples.

It can be understood that, when a part of the k samples obtained in step 210 are original positive samples and a part of the k samples are unlabeled samples, the proportion of the original positive samples in the k samples is calculated and used for the similarity between the unlabeled samples and the original positive samples.

Such as: in a credit card fraud scene, a part of the k obtained customer samples is original customer samples with fraud conditions, and a part of the k obtained customer samples is unknown customer samples, at the moment, the proportion of the original customer samples with fraud conditions in the k customer samples is calculated, and the proportion is used for judging the similarity between the unknown customer samples and the original customer samples with fraud conditions.

Also for example: in a situation of recommending a deposit card, a part of the k acquired customer samples are customer samples which are willing to be transacted originally, and a part of the k acquired customer samples are customer samples with unknown conditions, and at the moment, the proportion of the customer samples which are willing to be transacted originally in the k customer samples is calculated and used for judging the similarity between the customer samples with unknown conditions and the customer samples which are willing to be transacted originally.

And step 230, taking the proportion of the original positive sample as the similarity of the unlabeled sample and the original positive sample.

It can be understood that the ratio of the original positive samples in the k samples calculated in step 220 is taken as the similarity between the unlabeled samples and the original positive samples.

Such as: and under the credit card fraud scene, taking the calculated proportion of the original fraud client sample in the k client samples as the similarity between the unknown client sample and the original fraud client sample.

Also for example: and under the situation of recommending the savings card, taking the calculated proportion of the customer samples which are willing to be transacted originally in the k customer samples as the similarity between the customer samples which are not in the condition and the customer samples which are willing to be transacted originally.

The method for determining the similarity is simpler, and the similarity of the unlabeled sample and the positive sample can be conveniently and quickly obtained.

As shown in fig. 3, in an embodiment of the present invention, the step 210 of obtaining k samples adjacent to each unlabeled sample may include

steps

211 and 213.

Step 211, representing the samples of the original sample data set by vectors.

It can be understood that, in order to represent the positions of the samples in the original sample data set in space, it is convenient to find samples adjacent to the unlabeled samples, and the samples of the original sample data set are represented by vectors.

Such as: in a credit card fraud scene, in order to express the position relation of the customer samples in the credit card fraud sample data set, the customer samples adjacent to the customer samples with unknown conditions can be conveniently found out, and the customer samples in the original credit card fraud sample data set are expressed through vectors.

Also for example: in a savings card recommendation scene, in order to show the position relationship of the customer samples in the savings card recommendation sample data set, the customer samples adjacent to the customer samples with unknown conditions can be conveniently found out, and the customer samples in the original savings card recommendation sample data set are shown through vectors.

Step 212, the euclidean distance between each unlabeled sample and other samples of the original sample data set is obtained.

It can be understood that, in order to represent the position relationship of the samples in the original sample data set, it is convenient to find samples adjacent to the unlabeled sample through the position relationship, and obtain the euclidean distance between each unlabeled sample and other samples of the original sample data set. Euclidean distance is the "normal" (i.e., straight line) distance between two points in euclidean space, and refers to the true distance between two points in m-dimensional space, or the natural length of a vector (i.e., the distance of the point to the origin), and euclidean distance in two-dimensional and three-dimensional spaces is the actual distance between two points.

Such as: in a credit card fraud scene, in order to show the position relationship of the client samples in the original credit card fraud sample data set, the client samples adjacent to the client samples with unknown conditions can be conveniently found out through the position relationship, and the Euclidean distance between each client sample with unknown conditions and other samples in the original credit card fraud sample data set is obtained.

Also for example: in a savings card recommendation scene, in order to show the position relationship of the customer samples in the original savings card recommendation sample data set, the customer samples adjacent to the customer samples with unknown conditions can be conveniently found through the position relationship, and the Euclidean distance between each customer sample with unknown conditions and other samples in the original savings card recommendation sample data set is obtained.

And step 213, selecting the k samples with the shortest Euclidean distance as the k samples adjacent to the unlabeled sample.

It will be appreciated that using the euclidean distance as the distance metric, having obtained the euclidean distances of each unlabeled sample from the other samples of the original sample data set at step 212, the k samples that are the shortest in euclidean distance from the unlabeled sample are selected.

It should be noted that, for each unlabeled sample, an algorithm of a kd-tree in a sklern (a machine learning library) may be used to find k samples whose euclidean distance is shortest.

Such as: in a credit card fraud scenario, in the case that the euclidean distances of each of the unnoticeable customer samples from other customer samples of the original credit card fraud sample set have been obtained, the k customer samples having the shortest euclidean distance from the unnoticeable customer sample are selected.

Also for example: in a savings card recommendation scene, under the condition that the Euclidean distance between each customer sample with unknown condition and other customer samples of the original savings card recommendation sample data set is obtained, k customer samples with the shortest Euclidean distance to the customer sample with unknown condition are selected.

The method for obtaining the samples adjacent to the unlabeled sample provided by the embodiment of the invention is more scientific, and can quickly and accurately determine the samples adjacent to the unlabeled sample.

In some embodiments, obtaining β random numbers evenly distributed according to [0,1], labeling unlabeled samples as updated positive samples or updated negative samples based on a comparison of similarity corresponding to the unlabeled samples with the magnitude of the random number β, and obtaining an updated sample data set based on the updated positive samples, the updated negative samples, and the original positive samples, wherein the positive samples of the updated sample data set comprise the updated positive samples and the original positive samples, comprises labeling the unlabeled samples once for each random number β to obtain a plurality of sample updated data sets corresponding to the plurality of random numbers β.

It can be understood that, after a random number β is acquired each time, the similarity corresponding to each unlabeled sample in the sample data set is compared with the random number β, and the unlabeled sample is marked, that is, each time a random number β is acquired, a sample update data set can be acquired, and since a num-time random number β is acquired, the sample data set is subjected to num-time sample data set expansion processing in

step

100 and 500, so that num-sample update data sets corresponding to the num-number random numbers β one by one can be acquired.

Note that, due to the existence of the random number β, the labels of multiple labels for the same unlabeled exemplar are not necessarily the same, but the number of labels for the positive exemplar is expected to be num × similarity.

For example, in a credit card fraud scenario, an unexplained sample of customers is labeled once for each random number β to obtain a plurality of credit card fraud sample update data sets corresponding to a plurality of random numbers β.

For example, in a savings card recommendation scenario, each random number β is labeled once for an unsuspecting customer sample to obtain a plurality of savings card recommendation sample update data sets corresponding to the plurality of random numbers β.

The embodiment of the invention provides the method for marking the sample data set for multiple times to obtain the plurality of sample updating data sets, which can improve the capacity expansion accuracy of the sample data set

An embodiment of the present invention provides a sample data set capacity expansion device, where the device includes: the device comprises an acquisition unit, a first processing unit, a second processing unit, a third processing unit and an output unit.

The acquiring unit is used for acquiring an original sample data set, wherein samples of the original sample data set comprise a plurality of unlabeled samples and a plurality of original positive samples.

The first processing unit is used for determining the similarity of each unlabeled sample and the original positive sample.

The second processing unit is used for acquiring random numbers which obey [0,1] uniform distribution.

And the third processing unit is used for marking the unlabeled sample as an updated positive sample or an updated negative sample based on the comparison result of the similarity corresponding to the unlabeled sample and the random number.

The output unit is used for obtaining a sample update data set based on the update positive sample, the update negative sample and the original positive sample, wherein the positive sample of the sample update data set comprises the update positive sample and the original positive sample.

The sample data set capacity expansion device provided in the embodiment of the present invention is used to execute the capacity expansion method for the sample data set, and a specific implementation manner of the sample data set capacity expansion device is consistent with that of the method, and details are not described here.

The sample data set capacity expansion device provided by the embodiment of the invention compares the similarity between the unlabeled sample and the positive sample in the sample data set with the acquired random number uniformly distributed according to [0,1] through the first processing unit, the second processing unit and the third processing unit, marks the unlabeled sample as the positive sample or the negative sample according to the comparison result, can excavate more positive samples to realize the capacity expansion of the sample data set, and can help the positive sample to learn to play a role more accurately by using the expanded sample data set, thereby improving the accuracy and efficiency of model training.

Fig. 4 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 4: a processor (processor)410, a communication Interface 420, a memory (memory)430 and a communication bus 440, wherein the processor 410, the communication Interface 420 and the memory 430 are communicated with each other via the communication bus 440. The processor 410 may call logic instructions in the memory 430 to perform a method of expanding a sample data set, the method comprising: acquiring an original sample data set, wherein samples of the original sample data set comprise a plurality of unlabeled samples and a plurality of original positive samples; determining the similarity of each unlabeled sample and the original positive sample; acquiring random numbers which obey [0,1] uniform distribution; labeling the unlabeled sample as an updated positive sample or an updated negative sample based on the comparison result of the similarity corresponding to the unlabeled sample and the random number; a sample update dataset is derived based on the update positive samples, the update negative samples, and the original positive samples, the positive samples of the sample update dataset comprising the update positive samples and the original positive samples.

It should be noted that, when being implemented specifically, the electronic device in this embodiment may be a server, a PC, or other devices, as long as the structure includes the processor 410, the communication interface 420, the memory 430, and the communication bus 440 shown in fig. 4, where the processor 410, the communication interface 420, and the memory 430 complete mutual communication through the communication bus 440, and the processor 410 may call the logic instruction in the memory 430 to execute the above method. The embodiment does not limit the specific implementation form of the electronic device.

In addition, the logic instructions in the memory 430 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Further, an embodiment of the present invention discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, the computer is capable of executing a method of expanding a sample data set, the method comprising: acquiring an original sample data set, wherein samples of the original sample data set comprise a plurality of unlabeled samples and a plurality of original positive samples; determining the similarity of each unlabeled sample and the original positive sample; acquiring random numbers which obey [0,1] uniform distribution; labeling the unlabeled sample as an updated positive sample or an updated negative sample based on the comparison result of the similarity corresponding to the unlabeled sample and the random number; a sample update dataset is derived based on the update positive samples, the update negative samples, and the original positive samples, the positive samples of the sample update dataset comprising the update positive samples and the original positive samples.

In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform a method for expanding a sample data set when executed by a processor, and the method includes: acquiring an original sample data set, wherein samples of the original sample data set comprise a plurality of unlabeled samples and a plurality of original positive samples; determining the similarity of each unlabeled sample and the original positive sample; acquiring random numbers which obey [0,1] uniform distribution; labeling the unlabeled sample as an updated positive sample or an updated negative sample based on the comparison result of the similarity corresponding to the unlabeled sample and the random number; a sample update dataset is derived based on the update positive samples, the update negative samples, and the original positive samples, the positive samples of the sample update dataset comprising the update positive samples and the original positive samples.

In another aspect, an embodiment of the present invention provides a model training method, and fig. 5 illustrates a basic principle of the model training method.

As shown in FIG. 6, the model training method of the present invention includes steps 600-800.

Step 600: by using the sample data set expansion method in the above embodiment, a plurality of sample update data sets are obtained, and the sample tags corresponding to the update positive sample and the original positive sample are 1, and the sample tag corresponding to the update negative sample is-1.

It can be understood that, since the random number β is obtained q times in the method for expanding the sample data set in the above embodiment, the sample data set is subjected to the sample data set expansion processing q times in the step 100 and the step 500, so that q sample update data sets corresponding to the q random numbers β one to one can be obtained, and due to the existence of the random number β, the labels of the same unlabeled sample labeled multiple times are not necessarily the same, but the expectation of the times of labeling the sample as a positive sample is q × similarity.

Such as: in a credit card fraud scene, the capacity expansion method of the sample data set in the above embodiment is used for the credit card fraud sample data set to obtain a plurality of credit card fraud sample update data sets, and the sample label corresponding to the client sample with the fraud condition updated and the client sample with the original fraud condition is 1, and the sample label corresponding to the client sample without the fraud condition updated is-1.

Also for example: in a savings card recommendation scene, a plurality of savings card recommendation sample update data sets are obtained by using the sample data set capacity expansion method in the embodiment for a savings card recommendation sample data set, sample tags corresponding to client samples which are updated intentionally and transacted originally are 1, and sample tags corresponding to client samples which are updated unintentionally and transacted are-1.

Step 700: the data sets are updated using the samples, a plurality of submodels are trained, and at least two submodels respectively update the data sets using different samples.

It should be noted that, for the same purpose, the plurality of submodels classify samples in the sample data set, and the same type of sample data set is input for different submodels and output.

Such as: in a credit card fraud scene, a credit card fraud sample is used for updating a data set, a plurality of credit card fraud submodels are trained, at least two credit card fraud submodels respectively use different credit card fraud samples for updating the data set, the purpose of the credit card fraud submodels is the same, the credit card fraud sample data sets are all input for different credit card fraud submodels, and the output credit card fraud sample data sets are all output.

Also for example: in a savings card recommendation scene, a savings card recommendation sample is used for updating a data set, a plurality of savings card recommendation submodels are trained, at least two savings card recommendation submodels respectively use different savings card recommendation samples for updating the data set, the purpose of the plurality of savings card recommendation submodels is the same, all the savings card recommendation sample data sets are input into different savings card recommendation submodels and all the savings card recommendation sample data sets are output.

Step 800: and determining the model based on the trained sub-models.

It should be noted that, the plurality of submodels trained in step 700 are combined to form a final model, and in the use process of the combined model, when a data to be determined is input, the output of the model is the average value of the outputs of the plurality of submodels.

Such as: in a credit card fraud scene, combining a plurality of trained credit card fraud submodels to form a final credit card fraud model, wherein in the using process of the combined credit card fraud model, when a credit card fraud sample data set is input, each credit card fraud submodel outputs different results for a single client sample to be predicted in each credit card fraud sample data set, and averaging the results output by the plurality of credit card fraud submodels is the result output by the credit card fraud model.

Also for example: in a deposit card recommending scene, combining a plurality of trained deposit card recommending submodels to form a final deposit card recommending model, wherein in the using process of the combined deposit card recommending model, when a deposit card recommending sample data set is input, each credit card fraud submodel can output different results for a single customer sample to be predicted in each credit card fraud sample data set, and averaging the results output by the credit card fraud submodels is the result output by the credit card fraud model.

According to the model training method provided by the embodiment of the invention, the sample data set expansion method in the embodiment is used, the plurality of sub-models are trained by using the plurality of sample update data sets in a one-to-one correspondence manner, and finally the sub-models are combined into the model, so that the positive sample learning can more accurately play a role, and the accuracy and efficiency of model training are improved.

The embodiment of the invention provides a model training device, which comprises: the device comprises an acquisition module, a processing module and an output module.

The obtaining module is configured to obtain a plurality of sample update data sets by using the sample data set expansion method in the foregoing embodiment, where sample tags corresponding to the update positive sample and the original positive sample are 1, and a sample tag corresponding to the update negative sample is-1.

The processing module is used for using the sample to update the data set and training a plurality of sub-models, and at least two sub-models respectively use different samples to update the data set.

The output module is used for determining the model based on the trained sub models.

The model training device provided in the embodiment of the present invention is used for executing the above model training method, and the specific implementation manner thereof is consistent with the implementation manner of the method, and is not described herein again.

The model training device provided by the embodiment of the invention uses the sample data set expansion method in the embodiment through the acquisition module, the processing module and the output module, uses a plurality of sample update data sets to train a plurality of sub models in a one-to-one correspondence manner, and finally combines the sub models into a model, so that the positive sample learning can more accurately play a role, and the accuracy and the efficiency of model training are improved.

An embodiment of the present invention provides an electronic device, which may include: the system comprises a processor (processor), a communication Interface (communication Interface), a memory (memory) and a communication bus, wherein the processor, the communication Interface and the memory are communicated with each other through the communication bus. The processor may call logic instructions in the memory to perform an application method of the model, the method comprising: inputting data to be judged into the model to obtain a test result output by the model; the model is obtained by training according to the model training method in the embodiment.

It should be noted that, when being implemented specifically, the electronic device in this embodiment may be a server, a PC, or other devices as long as the structure includes a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus, and the processor may call a logic instruction in the memory to execute the method. The embodiment does not limit the specific implementation form of the electronic device.

Further, an embodiment of the present invention discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, the computer is capable of implementing a method for executing an application of a model, the method comprising: inputting data to be judged into the model to obtain a test result output by the model; the model is obtained by training according to the model training method in the embodiment.

In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented to perform a model training method when executed by a processor, the method including: by using the sample data set expansion method provided by the embodiment of the invention, a plurality of sample update data sets are obtained, and the sample tags corresponding to the update positive sample and the original positive sample are 1, and the sample tag corresponding to the update negative sample is-1; training a plurality of submodels by using the sample updating data set, wherein at least two submodels respectively use different samples to update the data set; and determining the model based on the trained sub-models.

As shown in fig. 7, the method for applying the model provided by the embodiment of the present invention may include steps 910 to 920.

Wherein, step 910: and inputting the data to be judged into the model.

The model is obtained by training according to the model training method in the embodiment.

Step 920: and obtaining a test result output by the model.

It should be noted that, when data to be judged is input into the model, each sub-model in the model processes the data to be judged once, each sub-model outputs a result, and the model averages a plurality of results to output the data which is already predicted.

Such as: in a credit card fraud scene, when a to-be-predicted credit card fraud sample data set is input into a credit card fraud model, each credit card fraud submodel in the credit card fraud model processes the to-be-predicted credit card fraud sample data set once, each credit card fraud submodel outputs a result, and the credit card fraud model averages a plurality of results to output the predicted credit card fraud sample data set.

Also for example: in a savings card recommendation scene, when a savings card recommendation sample data set to be predicted is input into a savings card recommendation model, each savings card recommendation sub-model in the savings card recommendation model processes the savings card recommendation sample data set to be predicted once, each savings card recommendation sub-model outputs one result, and the savings card recommendation model averages a plurality of results to output the savings card recommendation sample data set which is already predicted.

According to the model application method provided by the embodiment of the invention, the model is trained by using the sample data set capacity expansion method in the embodiment, and the model is actually applied, so that positive sample learning can more accurately play a role, the model training accuracy and efficiency are improved, and the result output by the model is more accurate.

The embodiment of the invention provides an application device of a model, which comprises: an input module and an output module.

The input module is used for inputting data to be judged to the model.

The output module is used for obtaining the test result output by the model.

The application apparatus of the model provided in the embodiment of the present invention is used for executing the application method of the model, and the specific implementation manner of the application apparatus of the model is consistent with the implementation manner of the method, and is not described herein again.

According to the application device of the model provided by the embodiment of the invention, the input module and the output module are used for training the model by using the sample data set capacity expansion method in the embodiment, and the model is actually applied, so that the positive sample learning can more accurately play a role, the model training accuracy and efficiency are improved, and the result output by the model is more accurate.

In addition, the logic instructions in the memory may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Further, an embodiment of the present invention discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, the computer is capable of implementing a method for executing an application of a model, the method comprising: inputting data to be judged into the model to obtain a test result output by the model; the model is obtained by training according to the model training method in the embodiment. In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program, when executed by a processor, implementing an application method for executing a model, the method including: inputting data to be judged into the model to obtain a test result output by the model; the model is obtained by training according to the model training method in the embodiment.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

It is worth mentioning that the capacity expansion method of the invention is applied to the credit card fraud scene, and the following scheme can be obtained:

in a first aspect, an embodiment of the present invention provides a method for expanding a credit card fraud sample data set, where in some embodiments, the method for expanding the credit card fraud sample data set includes: obtaining an original credit card fraud sample data set, wherein client samples of the original credit card fraud sample data set comprise a plurality of unappreciated client samples and a plurality of original client samples with fraud conditions; determining a similarity of each of the non-situational customer samples to the original fraudulent customer sample; acquiring random numbers which obey [0,1] uniform distribution; marking the customer sample with the unknown condition as a customer sample with updated fraud condition or a customer sample without updated fraud condition based on the comparison result of the similarity corresponding to the customer sample with the unknown condition and the random number; obtaining a credit card fraud sample update data set based on the updated fraud-bearing customer sample, the updated non-fraud-bearing customer sample, and the original fraud-bearing customer sample, the fraud-bearing customer samples of the credit card fraud sample update data set including the updated fraud-bearing customer sample and the original fraud-bearing customer sample.

In some embodiments, said determining a similarity of each of said non-situational customer samples to said original fraudulent customer sample comprises: acquiring k customer samples adjacent to each customer sample with unknown conditions; determining the proportion of the original client samples with the fraud condition in the k client samples; and taking the proportion of the original client sample with the fraud condition as the similarity of the client sample with the unknown condition and the original client sample with the fraud condition.

In some embodiments, said obtaining k customer samples adjacent to each of said ambiguous customer samples comprises: representing a customer sample of the original credit card fraud sample data set by a vector; obtaining Euclidean distance between each customer sample with unknown condition and other samples of the original credit card fraud sample data set; and selecting the k customer samples with the shortest Euclidean distance as the k customer samples adjacent to the customer sample with the unknown condition.

In some embodiments, the obtaining follows a [0,1] uniformly distributed random number; marking the customer sample with the unknown condition as a customer sample with updated fraud condition or a customer sample without updated fraud condition based on the comparison result of the similarity corresponding to the customer sample with the unknown condition and the random number; obtaining a credit card fraud sample update dataset based on the updated fraud-bearing customer sample, the updated non-fraud-bearing customer sample, and the original fraud-bearing customer sample, the fraud-bearing customer samples of the sample update dataset including the updated fraud-bearing customer sample and the original fraud-bearing customer sample, comprising: and for each random number, marking the customer sample with unknown condition once to obtain a plurality of credit card fraud sample updating data sets corresponding to the random numbers one by one.

In a second aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where in some embodiments, the processor implements the steps of the method for expanding a credit card fraud sample data set according to any of the above embodiments when executing the program.

In a third aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, where in some embodiments, the computer program is executed by a processor to implement the steps of the method for expanding the credit card fraud sample data set according to any of the above embodiments.

In a fourth aspect, embodiments of the present invention provide a credit card fraud model training method, in some embodiments, the model includes a plurality of sub-models, the credit card fraud model training method including: using the capacity expansion method for a credit card fraud sample data set according to any of the embodiments described above, obtaining the multiple credit card fraud sample update data sets, where the client sample with the updated fraud condition and the client sample label corresponding to the original client sample with the fraud condition are 1, and the client sample label corresponding to the client sample without the updated fraud condition is-1; training the plurality of sub-models using the credit card fraud sample update data sets, and at least two of the sub-models using different ones of the credit card fraud sample update data sets, respectively; and determining the model based on the trained plurality of sub-models.

In a fifth aspect, embodiments of the present invention provide a non-transitory computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, in some embodiments, implements the steps of the credit card fraud model training method according to any of the above embodiments.

In a sixth aspect, embodiments of the present invention provide a method for applying a credit card fraud model, and in some embodiments, the method for applying the credit card fraud model includes: inputting data to be judged to the model to obtain a test result output by the model; the model is obtained by training according to the credit card fraud model training method in the embodiment.

In a seventh aspect, an embodiment of the present invention provides a non-transitory computer readable storage medium, on which a computer program is stored, and in some embodiments, the computer program is executed by a processor to implement the steps of the model application method described in the above embodiments.

Meanwhile, the capacity expansion method of the invention can be applied to a recommendation scene of the savings card, and the following scheme can be obtained:

in a first aspect, an embodiment of the present invention provides a method for expanding a sample data set recommended by a savings card, where in some embodiments, the method for expanding the sample data set recommended by the savings card includes: obtaining an original savings card recommendation sample data set, wherein client samples of the original savings card recommendation sample data set comprise a plurality of client samples with unknown conditions and a plurality of client samples which are originally willing to be transacted; determining the similarity of each unknown customer sample and the original customer sample willing to be transacted; acquiring random numbers which obey [0,1] uniform distribution; marking the customer sample with the unknown condition as a customer sample which is willing to be updated or a customer sample which is unwilling to be updated based on the comparison result of the similarity corresponding to the customer sample with the unknown condition and the random number; and obtaining a savings card recommended sample updating data set based on the update willing-to-handle customer sample, the update unwilling customer sample and the original willing-to-handle customer sample, wherein the willing-to-handle customer sample of the savings card recommended sample updating data set comprises the update willing-to-handle customer sample and the original willing-to-handle customer sample.

In some embodiments, the determining a similarity of each of the non-situational customer samples to the original willing to handle customer sample comprises: acquiring k customer samples adjacent to each customer sample with unknown conditions; determining the proportion of the client samples which are originally willing to be transacted in the k client samples; and taking the proportion of the originally willing-to-handle client sample as the similarity of the uncertain client sample and the originally willing-to-handle client sample.

In some embodiments, said obtaining k customer samples adjacent to each of said ambiguous customer samples comprises: representing a customer sample of the original savings card recommendation sample data set by a vector; acquiring Euclidean distance between each customer sample with unknown condition and other samples of the original savings card recommendation sample data set; and selecting the k customer samples with the shortest Euclidean distance as the k customer samples adjacent to the customer sample with the unknown condition.

In some embodiments, the obtaining follows a [0,1] uniformly distributed random number; marking the customer sample with the unknown condition as a customer sample which is willing to be updated or a customer sample which is unwilling to be updated based on the comparison result of the similarity corresponding to the customer sample with the unknown condition and the random number; obtaining a savings card recommended sample update dataset based on the update willing customer sample, the update unwilling customer sample, and the original willing customer sample, the willing customer samples of the sample update dataset including the update willing customer sample and the original willing customer sample, comprising: and for each random number, marking the customer samples with unknown conditions once to obtain a plurality of savings card recommended sample updating data sets corresponding to the random numbers one by one.

In a second aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where in some embodiments, when the processor executes the computer program, the processor implements the steps of the method for expanding the set of recommended sample data of the savings card according to any one of the above embodiments.

In a third aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, where in some embodiments, the computer program is executed by a processor to implement the steps of the method for expanding a sample data set recommended by a savings card according to any of the embodiments described above.

In a fourth aspect, an embodiment of the present invention provides a savings card recommendation model training method, where the model includes multiple sub models, and the savings card recommendation model training method includes: using the method for expanding the volume of the sample data set recommended by the savings card according to any of the above embodiments, obtaining the update data set of the plurality of recommended sample data sets of the savings card, wherein the client sample label corresponding to the client sample willing to be updated and the client sample label corresponding to the client sample willing to be updated originally are 1, and the client sample label corresponding to the client sample unwilling to be updated is-1; training the plurality of sub-models by using the savings card recommendation sample update data sets, wherein at least two sub-models respectively use different savings card recommendation sample update data sets; and determining the model based on the trained plurality of sub-models.

In a fifth aspect, embodiments of the present invention provide a non-transitory computer-readable storage medium, on which a computer program is stored, where in some embodiments, the computer program is executed by a processor to implement the steps of the savings card recommendation model training method according to any of the above embodiments.

In a sixth aspect, an embodiment of the present invention provides a method for applying a savings card recommendation model, where in some embodiments, the method for applying the savings card recommendation model includes: inputting data to be judged to the model to obtain a test result output by the model; the model is obtained by training according to the savings card recommendation model training method in the embodiment.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for expanding a sample data set is characterized by comprising the following steps:

obtaining an original sample data set, wherein samples of the original sample data set comprise a plurality of unlabeled samples and a plurality of original positive samples;

determining the similarity of each unlabeled sample and the original positive sample;

acquiring random numbers which obey [0,1] uniform distribution;

marking the unlabeled sample as an updated positive sample or an updated negative sample based on the comparison result of the similarity corresponding to the unlabeled sample and the random number;

based on the updated positive sample, the updated negative sample, and the original positive sample, a sample update dataset is derived, the positive samples of the sample update dataset including the updated positive sample and the original positive sample.

2. The method for expanding the sample data set according to claim 1, wherein the determining the similarity between each unlabeled sample and the original positive sample includes:

acquiring k samples adjacent to each unlabeled sample;

determining the proportion of original positive samples in the k samples;

and taking the proportion of the original positive sample as the similarity of the unlabeled sample and the original positive sample.

3. The method for expanding the sample data set according to claim 2, wherein the obtaining k samples adjacent to each unlabeled sample includes:

representing samples of the original sample data set by vectors;

acquiring Euclidean distances between each unlabeled sample and other samples of the original sample data set;

and selecting the k samples with the shortest Euclidean distance as the k samples adjacent to the unlabeled sample.

4. A method for expanding the sample data set according to any one of claims 1 to 3, wherein the obtaining is subject to [0,1] evenly distributed random numbers; marking the unlabeled sample as an updated positive sample or an updated negative sample based on the comparison result of the similarity corresponding to the unlabeled sample and the random number; obtaining a sample update dataset based on the update positive sample, the update negative sample, and the original positive sample, positive samples of the sample update dataset including the update positive sample and the original positive sample, comprising:

and labeling the unlabeled sample once for each random number to obtain a plurality of sample updating data sets corresponding to the random numbers one by one.

5. An electronic device comprising a memory, a processor and a computer program stored on the memory and operable on the processor, wherein the processor implements the steps of the method of expanding a sample data set according to any one of claims 1 to 4 when executing the program.

6. A non-transitory computer readable storage medium, having stored thereon a computer program, which, when executed by a processor, performs the steps of the method of expanding a sample data set according to any one of claims 1 to 4.

7. A model training method, wherein the model comprises a plurality of submodels, the model training method comprising:

the capacity expansion method of the sample data set according to claim 4, obtaining the updated sample data sets, where the sample tags corresponding to the updated positive sample and the original positive sample are 1, and the sample tag corresponding to the updated negative sample is-1;

training the plurality of submodels using the sample update data sets, and at least two of the submodels using different sample update data sets, respectively;

and determining the model based on the trained plurality of sub-models.

8. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the model training method according to claim 7.

9. A method for applying a model, comprising:

inputting data to be judged to the model to obtain a test result output by the model;

wherein the model is trained according to the model training method of claim 5.

10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the model application method according to claim 9.