CN109344862B

CN109344862B - Positive sample acquisition method, device, computer equipment and storage medium

Info

Publication number: CN109344862B
Application number: CN201810956661.XA
Authority: CN
Inventors: 黄移军
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2018-08-21
Filing date: 2018-08-21
Publication date: 2023-11-28
Anticipated expiration: 2038-08-21
Also published as: CN109344862A

Abstract

The application relates to the field of processing a large amount of data, and provides a positive sample acquisition method, a device, computer equipment and a storage medium, wherein the method comprises the following steps: inputting a first positive sample set and a first negative sample set in the sample set into a preset model for training to obtain a first initial model; inputting the remaining samples in the sample set into the first initial model for calculation to obtain a first probability value that each remaining sample in the sample set is a positive sample; judging whether the sample is a second positive sample according to the first probability value; if yes, performing a true degree test on the second positive sample according to a first preset rule; if the second positive sample is verified to be a real positive sample, the label of the second positive sample is defaulted to be a positive sample, otherwise, the label of the second positive sample is modified to be a negative sample, so that the scale of the positive sample is gradually enlarged through big data analysis processing, and the cost of manpower and material resources for collecting the positive sample is reduced.

Description

Positive sample acquisition method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of big data processing technologies, and in particular, to a method and apparatus for acquiring a positive sample, a computer device, and a storage medium.

Background

When the supervised algorithm trains the model, explicit positive and negative samples are needed to train the model, but some situations are difficult to obtain enough positive samples, or a large amount of manpower and material resources are needed to obtain the positive samples, taking registration situations as an example: black production requires a large number of accounts, such as the doctor's eye, for profit, and generally defines that one account can only be pulled once, and to obtain more benefits, as many accounts as possible are required, and accordingly, a large number of malicious registration requirements are generated. If malicious registration and normal registration are accurately identified through the model during registration, as many malicious registration samples as possible are needed to train the model, and the sample labels are generally difficult to acquire, and the conventional method is to determine the positive and negative labels of the samples after a great deal of manpower and material resources are required to analyze and survey the registered accounts one by one, so that the cost is high.

Disclosure of Invention

The main object of the present invention is to provide a positive sample acquisition method, apparatus, computer device and storage medium, which reduce the cost of acquiring a sample.

The invention provides a positive sample acquisition method, which comprises the following steps: inputting a first positive sample set and a first negative sample set in the sample set into a preset model for training to obtain a first initial model;

Inputting the remaining samples in the sample set into the first initial model for calculation to obtain a first probability value that each remaining sample in the sample set is a positive sample;

judging whether the sample is a second positive sample according to the first probability value;

if yes, performing a true degree test on the second positive sample according to a first preset rule;

and if the second positive sample is verified to be a true positive sample, defaulting the label of the second positive sample to be a positive sample, otherwise, modifying the label of the second positive sample to be a negative sample.

Further, the step of determining whether the sample is a second positive sample according to the first probability value includes:

judging whether the first probability value exceeds a preset probability value or not;

if yes, the sample is judged to be the second positive sample, and if not, the sample is judged to be the second negative sample.

Further, before the step of inputting the first positive sample set and the first negative sample set in the sample set to a preset model to train to obtain a first initial model, the method includes:

obtaining a plurality of samples according to a second preset rule to form the sample set;

obtaining part of first positive samples in the sample set to form the first positive sample set;

And selecting first negative samples corresponding to the first positive sample number from the remaining samples of the sample set according to the number of the first positive samples in the first positive sample set and the preset ratio of the first positive sample set to the first negative sample set, so as to form the first negative sample set.

Further, the second positive sample is a malicious registered account, and the step of performing the authenticity test on the second positive sample according to the first preset rule includes:

sealing the second positive sample;

if the response of the user for deblocking the second positive sample is obtained after the number is sealed, judging that the second positive sample is not a real positive sample; and if the user response is not obtained after the number is sealed, judging that the second positive sample is a real positive sample.

Further, after the step of defaulting the label of the second positive sample to a positive sample, the method includes:

adding the second positive sample into the first positive sample set to form a second positive sample set;

selecting a second negative sample corresponding to the second positive sample number from the remaining samples of the sample set according to the number of the second positive samples in the second positive sample set and the preset ratio of the second positive sample set to the second negative sample set, so as to form the second negative sample set;

And inputting the second positive sample set and the second negative sample set into the preset model for training to obtain a second initial model.

Further, after the step of inputting the second positive sample set and the second negative sample set into the preset model to train to obtain a second initial model, the method includes:

inputting the test sample into the second initial model to calculate to obtain a second probability value;

judging whether the difference value between the second probability value and a preset third probability value exceeds a preset threshold value, wherein the third probability value is obtained by judging according to experience of service personnel of the test sample;

if not, defaulting the second initial model to a result model to be built.

Further, before the step of performing the authenticity test on the second positive sample according to the first preset rule, the method includes:

grading the second positive sample according to a range exceeding the preset probability value;

and respectively calling the authenticity test corresponding to the grade according to the grade of the second positive sample.

The invention also provides a positive sample acquisition device, which comprises:

the training module is used for inputting a first positive sample set and a first negative sample set in the sample set into a preset model for training to obtain a first initial model;

The calculation module is used for inputting the remaining samples in the sample set into the initial model to calculate so as to obtain a first probability value that each remaining sample in the sample set is a positive sample;

the judging module is used for judging whether the sample is a second positive sample according to the first probability value;

the testing module is used for testing the authenticity of the second positive sample according to a first preset rule when the sample is judged to be the second positive sample;

and the correction module is used for defaulting the label of the second positive sample to be a positive sample when the second positive sample is verified to be a real positive sample, otherwise, modifying the label of the second positive sample to be a negative sample.

The invention also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of the above method when executing the computer program.

The invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the above method.

The beneficial effects of the invention are as follows: by using a small amount of positive samples and combining model loop iteration and gray level test, the scale of the positive samples is gradually enlarged, the cost of manpower and material resources for collecting the positive samples is reduced, and the sample set for training the model is enlarged, so that the accuracy of model calculation is improved.

Drawings

FIG. 1 is a schematic diagram illustrating steps of a method for acquiring a positive sample according to an embodiment of the present invention;

FIG. 2 is a block diagram schematically illustrating a positive sample acquiring apparatus according to an embodiment of the present invention;

FIG. 3 is a schematic block diagram illustrating a judging module according to an embodiment of the present invention;

FIG. 4 is a schematic block diagram of a test module according to an embodiment of the present invention;

FIG. 5 is a block diagram schematically illustrating a positive sample acquiring apparatus according to another embodiment of the present invention;

FIG. 6 is a block diagram schematically illustrating a positive sample acquiring apparatus according to another embodiment of the present invention;

FIG. 7 is a block diagram schematically illustrating a positive sample acquiring apparatus according to another embodiment of the present invention;

fig. 8 is a block diagram schematically illustrating a structure of a computer device according to an embodiment of the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

When a model with a supervision algorithm is established, data for training the model comprises positive samples and negative samples, and the method for acquiring the positive samples is mainly used in the situation that the positive samples are difficult to acquire, for example, when a malicious registered account is taken as the positive sample, information such as trace of logging in a website through the account, user feedback and the like is required to be acquired in a one-by-one mode, and a large amount of manpower and material resources are required to be consumed.

Referring to fig. 1, the method for acquiring a positive sample in the present embodiment includes:

step S1: inputting a first positive sample set and a first negative sample set in the sample set into a preset model for training to obtain a first initial model;

step S2: inputting the remaining samples in the sample set into the first initial model for calculation to obtain a first probability value that each remaining sample in the sample set is a positive sample;

step S3: judging whether the sample is a second positive sample according to the first probability value;

step S4: if yes, performing a true degree test on the second positive sample according to a first preset rule;

step S5: and if the second positive sample is verified to be a true positive sample, defaulting the label of the second positive sample to be a positive sample, otherwise, modifying the label of the second positive sample to be a negative sample.

In step S1, the sample set may select all samples within a period of time or a range, such as a month or a region of samples, where the sample set includes positive samples and negative samples for training the model, and the number of positive samples in the sample set is far less than the number of negative samples. The first positive sample set is a set formed by partial positive samples in the sample set, the first negative sample set is obtained by dividing the rest samples of the partial positive samples in the sample set, and even if the rest samples of the sample set comprise positive samples, the proportion of positive samples contained in the selected first negative sample set is small because the negative samples are far more than the positive samples, and the influence on a training model is very little. After the first positive sample set and the first negative sample set are determined, the first positive sample set and the first negative sample set are input into a preset model for training to obtain a first initial model, wherein the preset model is a model with a supervision algorithm.

In step S2, after the first initial model is obtained, the remaining samples in the sample set may be input into the first initial model to perform calculation to obtain a calculation result, specifically, the first initial model may be used to primarily identify a label of a sample, input data is a sample, output data is a probability value of the sample being a positive sample, in order to enlarge the scale of the positive sample, the positive samples of the remaining samples in the sample set may be identified through the first initial model, for example, one of the remaining samples in the sample set is input into the first initial model, and then the first initial model performs calculation to obtain a first probability value that the remaining sample is a positive sample, so until each of the remaining samples in the sample set is input into the first initial model to perform calculation to obtain a first probability value that the remaining samples are positive samples, respectively. The calculation process of the first initial model is different by adopting different algorithms, for example, if a model for testing whether the registered account is a malicious registered account needs to be established, a logistic regression algorithm can be adopted, wherein the probability value calculation mode in the logistic regression algorithm is to calculate corresponding parameters according to maximum likelihood estimation and then obtain probability values.

In step S3, it may be determined whether or not the remaining samples input to the first initial model are positive samples based on the first probability value output from the first initial model, and the positive samples are designated as second positive samples for convenience of distinction from the first positive samples in the first sample set.

Further, the step S3 of determining whether the sample is a second positive sample according to the first probability value includes:

step S30: judging whether the first probability value exceeds a preset probability value or not;

step S31: if yes, the sample is judged to be the second positive sample, and if not, the sample is judged to be the negative sample.

Since the probability value of the positive sample of the output data of the first initial model is higher if the probability value exceeds a certain value, it is indicated that the probability of the positive sample of the input sample is higher, and the positive sample can be judged at this time, therefore, a probability value as a judgment standard is preset before the judgment, generally, the preset probability value is 0.5, in step S31, when the first probability value of the output of the first initial model is greater than 0.5, the sample of the input first initial model is judged as the second positive sample; if the first probability value output by the first initial model is smaller than 0.5, the sample input into the first initial model can be judged to be a negative sample.

In step S4, if it is determined that the sample input to the first initial model is a second positive sample, performing a plausibility test on the second positive sample according to a first preset rule; because the sample for training the first initial model is only a small number of samples in the sample set, the initial model is unstable, the calculation result is not accurate enough, and there is a risk of misjudgment, so even if the sample input into the first initial model is judged to be a second positive sample, the second positive sample needs to be subjected to a authenticity test to verify whether the positive sample is a real positive sample, the first preset rule is set according to the sample property, and the authenticity test modes of different samples are different, namely the first preset rule is also different.

In a specific embodiment, the step S4 of performing the authenticity test on the second positive sample according to the first preset rule includes:

step S40: sealing the second positive sample;

step S41: if the response of the user for deblocking the second positive sample is obtained after the number is sealed, judging that the second positive sample is not a real positive sample; and if the user response is not obtained after the number is sealed, judging that the second positive sample is a real positive sample.

In this embodiment, when an initial model for testing whether the registered account is a malicious registered account needs to be established, the positive sample is the malicious registered account, the negative sample is the normal registered account, the first preset rule is to perform a sealing test on the registered account, when the first initial model identifies that the registered account is the malicious registered account, the second positive sample is subjected to a authenticity test according to the first preset rule, that is, the malicious account is subjected to the sealing test, if a response of a user to unseal the second positive sample is obtained after the sealing, the response can be the information of receiving the user to apply for unsealing the account, and then the second positive sample is determined not to be a real positive sample; if no user response is obtained after the number is sealed, the second positive sample is judged to be a real positive sample, and the malicious registered account user does not generally have any response because of the cost required for unsealing.

In another embodiment, when a model for judging whether the account is a library collision account needs to be established, the positive sample is the library collision account, and in this step, the authenticity test can be performed on the positive sample in a number sealing manner.

In step S5, the obtained test result may determine whether the sample is a true positive sample, that is, whether the second positive sample is a true positive sample is verified, if yes, the label of the second positive sample is defaulted to be a positive sample, and the sample label is not required to be modified; if not, the sample label needs to be modified, namely, the label of the second positive sample is modified from the positive sample to the negative sample.

In one embodiment, before step S1 of inputting the first positive sample set and the first negative sample set in the sample set to a preset model to perform training to obtain a first initial model, the method includes:

step S01': obtaining a plurality of samples according to a second preset rule to form the sample set;

step S02': obtaining part of first positive samples in the sample set to form the first positive sample set;

step S03': and selecting first negative samples corresponding to the first positive sample number from the remaining samples of the sample set according to the number of the first positive samples in the first positive sample set and the preset ratio of the first positive sample set to the first negative sample set, so as to form the first negative sample set.

Before step S1, a certain sample set is required, where the sample set is a set of unknown label samples, specifically, a plurality of samples are acquired according to a second preset rule to form the sample set, for example, the second preset rule sets a plurality of samples that may be taken from a period of time, such as a month of samples, or sets a plurality of samples that are acquired in a region, where the samples form the sample set. In the above example, a model is established for testing whether the registered account is a malicious registered account, wherein the malicious registered account is a positive sample, the normal registered account is a negative sample, the registered account obtained for a period of time is a sample set, and if it is confirmed that there are hundreds of thousands of registered accounts within one month, the hundreds of thousands of registered accounts form the sample set.

In step S02', since the number of samples in the sample set is relatively large and the sample labels in the sample set are unknown, positive samples and negative samples for training the model need to be determined, and a large number of positive samples and negative samples are needed for training the model, and a large amount of manpower and material resources are consumed by collecting the positive samples one by one, so that part of the positive samples can be collected first to form the first positive sample set.

In step S03', selecting a first negative sample corresponding to the first positive sample number from the remaining samples in the sample set according to the first positive sample number in the first positive sample set and the preset ratio of the first positive sample number to the first negative sample number, where the first negative samples corresponding to the first positive sample number form a first negative sample set; specifically, since there are far more negative samples than positive samples in the sample set, most of the remaining samples in the sample set are negative samples after the first positive sample in the sample set is acquired, and corresponding to the collected positive samples, partial samples are taken out of the remaining samples in the sample set according to a preset proportion, and these partial samples default to negative samples, so as to form the first negative sample set.

Further, after the step S5 of defaulting the label of the second positive sample to a positive sample, the method includes:

step S6: adding the second positive sample into the first positive sample set to form a second positive sample set;

step S7: selecting a second negative sample corresponding to the second positive sample number from the remaining samples of the sample set according to the number of the second positive samples in the second positive sample set and the preset ratio of the second positive sample set to the second negative sample set, so as to form the second negative sample set;

step S8: and inputting the second positive sample set and the second negative sample set into the preset model for training to obtain a second initial model.

In steps S6-S8, after the label of the second positive sample is defaulted to be a positive sample, that is, after the second positive sample is determined to be a true positive sample, in order to enlarge the sample size of the training model, the second positive sample is added to the first positive sample set to form a second positive sample set, then, according to the number of positive samples in the second positive sample set and the preset ratio of the number of samples in the second positive sample set to the number of samples in the second negative sample set, the preset ratio may refer to the preset ratio in step S03', and a second negative sample corresponding to the number of the second positive sample is selected from the remaining samples of the sample set, if the preset ratio of the number of the first positive sample to the number of the first negative sample is 3:7, then, under the condition that the number of the second positive sample is known, the second negative samples can be selected from the remaining samples of the sample set, and form a second negative sample set, and the second negative sample set is required to be trained in the training model due to the fact that the second positive sample set and the second positive sample set are more than the second positive sample set and the second positive sample set are required to be more than the second positive sample.

Still further, after step S8 of inputting the second positive sample set and the second negative sample set into a preset model to perform training, the method further includes:

step S9: inputting the test sample into the second initial model to calculate to obtain a second probability value;

step S10: judging whether the difference value between the second probability value and a preset third probability value exceeds a preset threshold value, wherein the third probability value is obtained by judging according to experience of service personnel of the test sample;

step S11: if not, defaulting the second initial model to a result model to be built.

In order to obtain a stable model with a correct output result, before the model is used, a trained model is required to be tested, in the step S9, a test sample is input into the second initial model to be calculated, and a second probability value is obtained, wherein the test sample is a known sample and is used for testing whether the second initial model is stable and accurate, the preset third probability value is obtained according to experience judgment of a business person familiar with the test sample, in the step S10, after the second probability value is obtained, the second probability value is compared with the preset third probability value to obtain a difference value between the second probability value and the preset third probability value, then whether the difference value exceeds a preset threshold value is judged, if the preset threshold value is set to be 0.01, namely, whether the difference value exceeds 0.01, in the step S11, if the difference value does not exceed the preset threshold value, the default second probability value and the third probability value are not different, and indicate that the second probability value is consistent, and the second probability value is not more than the second probability value is calculated, and the experience judgment is successful, and the second probability value is calculated, when the experience is more than the second probability value is calculated, and the experience is not successful, and the experience is not satisfied, and the first probability value is calculated; if the difference exceeds the preset threshold, it indicates that the second probability value calculated and output by the second initial model has a larger difference from the preset third probability value, that is, the result is unstable, and the data is inaccurate, more positive samples and negative samples are needed to train the model, so the steps S3-S5 can be repeated, the obtained true positive samples are added into the second positive sample set again to form a third positive sample set, then the third negative sample set is obtained in the sample set according to the steps S6-S8, the sample scale for training the model is further enlarged, the third positive sample set and the corresponding negative sample set training model are further enlarged to obtain a third initial model, then the third initial model is tested according to the steps S9-S11, whether the third initial model is stable and accurate is judged, if the third initial model is unstable, the steps are repeated until the probability value calculated and obtained by the model is consistent with the preset probability value is judged, that the model is stable, and thus, a large amount of manpower is consumed to collect the result, and the accurate data can be obtained without the need of collecting the result.

In one embodiment, before the step S4 of performing the authenticity test on the second positive sample according to the first preset rule, the method includes:

step S40': grading the second positive sample according to a range exceeding the preset probability value;

step S41': and respectively calling the authenticity test corresponding to the grade according to the grade of the second positive sample.

In order to further reduce the cost of obtaining the positive samples, in step S40', before the second positive samples are subjected to the plausibility test according to the first preset rule, the second positive samples may be classified according to the range exceeding the preset probability value, and since the remaining samples are judged to be the second positive samples by exceeding the preset probability value, the classification may be set according to the range of probability values, for example, since the probability is greater than or equal to 0.5, that is, the positive samples are judged to be positive samples, the positive samples judged by the model may be classified into three classes, the probability is more than 0.85, the probability is 0.65-0.85, the probability is 0.5-0.65, the probability is three, and the plausibility test is correspondingly classified into three classes, and each grade from the first grade to the third grade is sequentially provided with tests with different degrees according to the probability, the severity degree of the authenticity test mode is sequentially decreased from the first grade to the third grade, when the authenticity test is carried out, the grade of the corresponding positive sample is used, the corresponding authenticity test is called, for example, the first grade positive sample with the highest probability is used, the corresponding second grade positive sample with the highest cost is used, and the like, the probability is 0.65-0.85, the corresponding third grade positive sample with the relatively medium severity degree is used, the probability is 0.5-0.65, and the corresponding test is the lightest, namely, the test with the least cost is used. This saves the cost of a realistic test of positive samples identified by the model.

For example, for positive samples of different grades, the authenticity tests are also correspondingly different, in the above example, for malicious registered accounts of different grades, tests of different degrees can be performed, and thus, different degrees of sealing can be performed, for example, a first-level sealing number can be unsealed only by complaints of users of the accounts, a second-level sealing number can be unsealed only by uplink short messages of the accounts, namely, users can unseal the accounts by independently sending out unsealed short messages, and a third-level sealing number can be unsealed only by downlink short messages of the accounts, wherein the downlink short messages are issued to the users by companies corresponding to the services, and the users need to fill out a mobile phone number capable of receiving the short messages, the companies can send verification codes to the corresponding mobile phone numbers, and the users need to fill out the verification codes to unlock, or the common verification codes are replaced by verification codes such as numbers and characters, and the users can fill out the numbers or the characters to unlock; the method for testing the primary positive sample is the primary seal number with highest cost, the user complains to unseal, the method for testing the secondary sample is the secondary seal number, the user uplink short message is needed to unseal, the method for testing the tertiary sample is the tertiary seal number, and the user downlink short message is needed to unseal, so that the cost of testing can be relatively reduced by the graded test method, such as the cost from the primary seal number to the tertiary seal number is sequentially reduced, and the cost for obtaining the positive sample is further reduced.

Referring to fig. 2, the positive sample acquiring apparatus in the present embodiment includes:

the training module 100 is configured to input a first positive sample set and a first negative sample set in a sample set into a preset model for training, so as to obtain a first initial model;

the calculation module 200 is configured to input the samples remaining in the sample set into the first initial model to perform calculation to obtain a first probability value that each of the samples remaining in the sample set is a positive sample;

a judging module 300, configured to judge whether the sample is a second positive sample according to the first probability value;

the testing module 400 is configured to perform a plausibility test on the second positive sample according to a first preset rule when the sample is determined to be the second positive sample;

and the correction module 500 is configured to default the label of the second positive sample to be a positive sample when the second positive sample is verified to be a true positive sample, and otherwise, modify the label of the second positive sample to be a negative sample.

The sample set may be selected from all samples within a time period or a range, such as a month or a region, and includes positive samples and negative samples for training the model, wherein the number of positive samples in the sample set is far less than the number of negative samples. The first positive sample set is a set formed by partial positive samples in the sample set, the first negative sample set is obtained by dividing the rest samples of the partial positive samples in the sample set, and even if the rest samples of the sample set comprise positive samples, the proportion of positive samples contained in the selected first negative sample set is small because the negative samples are far more than the positive samples, and the influence on a training model is very little. After determining the first positive sample set and the first negative sample set, the training module 100 inputs the first positive sample set and the first negative sample set into a preset model for training to obtain a first initial model, where the preset model is a model with a supervision algorithm.

After obtaining the first initial model, the remaining samples in the sample set may be input into the first initial model to perform calculation to obtain a calculation result, specifically, the first initial model may be used for primarily identifying a label of a sample, input data is a sample, output data is a probability value of the sample being a positive sample, in order to enlarge the scale of the positive sample, the positive samples of the remaining samples in the sample set may be identified through the first initial model, for example, one of the remaining samples in the sample set is input into the first initial model, and the calculation module 200 performs calculation to obtain a first probability value that the remaining sample is the positive sample, so that the first probability value that each of the remaining samples in the sample set is input into the first initial model is calculated to obtain the first probability value that the remaining samples are positive samples respectively. The calculation process of the calculation module 200 is also different by adopting different algorithms, for example, if a model for testing whether the registered account is a malicious registered account needs to be established, a logistic regression algorithm can be adopted, wherein the probability value calculation mode in the logistic regression algorithm is to calculate the corresponding parameters according to maximum likelihood estimation and then obtain the probability value.

Based on the first probability value output by the first initial model, the determining module 300 may determine whether the remaining samples input to the first initial model are positive samples, and designate the positive samples as second positive samples for convenience of distinguishing from the first positive samples in the first set of samples.

Referring to fig. 3, the determining module 300 includes:

a judging sub-module 310, configured to judge whether the first probability value exceeds a preset probability value;

a determining submodule 320, configured to determine that the sample is the second positive sample when the first probability value exceeds a preset probability value; and when the first probability value is judged not to exceed a preset probability value, judging that the sample is a negative sample.

Since the probability value of the positive sample output by the first initial model is higher if the probability value exceeds a certain value, it is indicated that the probability of the positive sample input by the first initial model is higher, and the positive sample input by the first initial model can be judged, so that a probability value as a judgment standard is preset before the judgment, generally, the preset probability value is 0.5, when the judgment sub-module 310 judges that the first probability value output by the first initial model is greater than 0.5, namely, the judgment sub-module 320 judges that the positive sample input by the first initial model is the second positive sample; if the first probability value output by the first initial model is less than 0.5, the determination submodule 320 may determine that the sample input into the first initial model is a negative sample.

If the judging module 300 judges that the sample input into the first initial model is a second positive sample, the testing module 400 performs the authenticity test on the second positive sample according to a first preset rule; because the sample for training the first initial model is only a small number of samples in the sample set, the initial model is unstable, the calculation result is not accurate enough, and there is a risk of misjudgment, so even if the sample input into the first initial model is judged to be a second positive sample, the second positive sample needs to be subjected to a authenticity test to verify whether the positive sample is a real positive sample, the first preset rule is set according to the sample property, and the authenticity test modes of different samples are different, namely the first preset rule is also different.

In a specific embodiment, referring to fig. 4, the second positive sample is a malicious registered account, and the test module 400 includes:

a sealing sub-module 410, configured to seal the second positive sample;

a response sub-module 420, configured to obtain a response of the user to unseal the second positive sample after the number is sealed, and determine that the second positive sample is not a real positive sample; and after the sign is sealed, no user response is obtained, and the second positive sample is judged to be a real positive sample.

In this embodiment, an initial model for testing whether the registered account is a malicious registered account is established, the positive sample is the malicious registered account, the negative sample is the normal registered account, and the first preset rule is to perform a sealing test on the registered account, so when the first initial model identifies that the registered account is the malicious registered account, the sealing submodule 410 performs the sealing test on the malicious account, and if the response submodule 420 obtains a response of the user to unseal the second positive sample after the sealing, the response can be that the information of the user applying to unseal the account is received, the second positive sample is determined not to be a real positive sample; if the response sub-module 420 does not obtain the user response after the number sealing, it determines that the second positive sample is a true positive sample, and because the cost is required for the deblocking, the malicious registered account user will generally not have any response,

in another embodiment, when a model for judging whether the account is a library collision account needs to be established, the positive sample is the library collision account, and the authenticity test can be performed on the positive sample in a number sealing manner.

Through the test, the obtained test result can determine whether the sample is a true positive sample, that is, whether the second positive sample is a true positive sample is verified, if yes, the correction module 500 defaults the label of the second positive sample to be a positive sample, without modifying the sample label; if not, the sample label needs to be revised, i.e., the revision module 500 revises the label of the second positive sample from a positive sample to a negative sample.

In one embodiment, referring to fig. 5, the positive sample acquiring device further includes:

a first forming module 001, configured to obtain a plurality of samples according to a second preset rule to form the sample set;

a first obtaining module 002, configured to obtain a portion of the first positive samples in the sample set, to form the first positive sample set;

the first selecting module 003 is configured to select, from the remaining samples in the sample set, a first negative sample corresponding to the first positive sample number according to the first positive sample number in the first positive sample set and a preset ratio of the first positive sample number to the first negative sample number, so as to form the first negative sample set.

The first forming module 001 obtains a plurality of samples according to a second preset rule to form the sample set, for example, the second preset rule sets a plurality of samples which can be obtained from a period of time, for example, a month of samples, or sets a plurality of samples in a region, and the samples form the sample set. In the above example, a model is established for testing whether the registered account is a malicious registered account, wherein the malicious registered account is a positive sample, the normal registered account is a negative sample, the registered account obtained for a period of time is a sample set, and if it is confirmed that there are hundreds of thousands of registered accounts within one month, the hundreds of thousands of registered accounts form the sample set.

Since the number of samples in the sample set is relatively large and the sample labels in the sample set are unknown, positive samples and negative samples for training the model need to be determined, and a large number of positive samples and negative samples are needed for training the model, and a large amount of manpower and material resources are consumed by collecting the positive samples one by one, the first obtaining module 002 may first obtain some of the positive samples to form the first positive sample set.

The first selecting module 003 selects, from the remaining samples in the sample set, a first negative sample corresponding to the first positive sample number according to the number of the first positive samples in the first positive sample set and a preset ratio of the number of the first positive samples to the number of the samples in the first negative sample set, where the first negative samples corresponding to the first positive sample number form a first negative sample set; specifically, since there are far more negative samples than positive samples in the sample set, most of the remaining samples in the sample set are negative samples after the first positive sample in the sample set is acquired, and corresponding to the collected positive samples, partial samples are taken out of the remaining samples in the sample set according to a preset proportion, and these partial samples default to negative samples, so as to form the first negative sample set.

Further, referring to fig. 6, the positive sample acquiring apparatus further includes:

a second forming module 600, configured to add the second positive sample to the first positive sample set to form a second positive sample set;

a second selecting module 700, configured to select, from the remaining samples in the sample set, a second negative sample corresponding to the second positive sample number according to the second positive sample number in the second positive sample set and a preset ratio of the second positive sample number to the second negative sample number, so as to form the second negative sample set;

the second training module 800 is configured to input the second positive sample set and the second negative sample set into the preset model for training, so as to obtain a second initial model.

After the label of the second positive sample is defaulted to a positive sample, that is, after the second positive sample is determined to be a true positive sample, in order to enlarge the sample size of the training model, the second forming module 600 adds the second positive sample to the first positive sample set to form a second positive sample set, and then the second selecting module 700 determines the number of second negative samples in the remaining samples of the sample set according to the number of positive samples in the second positive sample set and the preset ratio of the number of samples of the second positive sample set to the number of second negative sample set, where the preset ratio may refer to the preset ratio in the first selecting module 003, and the number of second negative samples corresponding to the number of second positive samples are selected from the remaining samples of the sample set, and if the number of second positive samples is known, the preset ratio of the number of second positive samples to the first negative sample set is 3:7, and the second negative samples are not required to be trained by the second positive sample set and the second positive sample set, and the second negative sample set is not required to be trained by the second positive sample set.

Still further, referring to fig. 6, the positive sample acquiring apparatus further includes:

the second calculation module 900 is configured to input a test sample into the second initial model to calculate a second probability value;

a second judging module 1000, configured to judge whether a difference between the second probability value and a preset third probability value exceeds a preset threshold, where the third probability value is obtained by judging according to experience of a service person of the test sample;

and a default module 1100, configured to default the second initial model to be a result model to be built when it is determined that the difference between the second probability value and the preset third probability value does not exceed a preset threshold.

In order to obtain a stable model with a correct output result, before the model is used, a second calculation module 900 inputs a test sample into the second initial model to calculate and obtain a second probability value, wherein the test sample is a known sample and is used for testing whether the second initial model is stable and accurate, the preset third probability value is obtained according to experience judgment of a business person familiar with the test sample, after the second probability value is obtained, the second judgment module 1000 compares the second probability value with the preset third probability value to obtain a difference value between the two probability values, then judges whether the difference value exceeds a preset threshold value, and if the preset threshold value is set to be 0.01, namely, judges whether the difference value exceeds 0.01, if the difference value does not exceed the preset threshold value, the default module 1100 defaults the second probability value and the third probability value to be consistent, and after the second probability value is obtained by experience judgment of the business person familiar with the test sample, the second probability value is accumulated by experience judgment of the business person familiar with the test sample, and if the experience judgment of the business person is more than the second probability value is more than 0.01, and the second probability value is not consistent, and the second initial model is calculated to be more than the second probability value is judged to be accurate; if the difference exceeds a preset threshold, the second probability value calculated and output by the second initial model is larger than a preset third probability value, namely, the result is unstable, and the data is inaccurate, more positive samples and negative samples are needed to train the model, so that the obtained real positive samples are added into the second positive sample set again to form a third positive sample set, then the third negative sample set is obtained in the sample set, the sample scale for training the model is further enlarged, the third positive sample set and the corresponding negative sample set training model are further enlarged to obtain a third initial model, then the third initial model is tested, whether the third initial model is stable and accurate is judged, if the third initial model is unstable, the cycle is repeated according to the steps until the probability value calculated by the model is judged to be consistent with the preset probability value, namely, the model is stable, a great amount of manpower and material resources are not consumed to collect the samples, and the result model with stable output and accurate data can be obtained.

In one embodiment, referring to fig. 7, the positive sample acquiring device further includes:

a classification module 410, configured to classify the second positive sample according to a range exceeding the preset probability value;

and a calling module 420, configured to call the authenticity test corresponding to the level according to the level of the second positive sample.

In order to further reduce the cost of obtaining the positive samples, before the second positive samples are subjected to the authenticity test according to the first preset rule, the classification module 410 classifies the second positive samples according to the range exceeding the preset probability value, and since the remaining samples are judged to be the second positive samples according to the range exceeding the preset probability value, the classification can be set according to the probability value range, if the probability is greater than or equal to 0.5, the positive samples judged by the model can be classified into three classes, the probability is more than 0.85 and is one class, the probability is 0.65-0.85 and is two classes, the probability is 0.5-0.65 and is three classes, the corresponding authenticity test is three classes, and the severity of the authenticity test mode is gradually decreased from one class to three classes according to the probability, when the authenticity test is performed, the class corresponding to the positive samples, the corresponding authenticity test is called by the calling module 420, and the corresponding first positive sample with the highest probability is the most serious class, namely, the probability is more than 0.65, the corresponding probability is more serious than 0.65, and the corresponding probability is 0.65. This saves the cost of a realistic test of positive samples identified by the model.

Referring to fig. 8, a computer device is further provided in an embodiment of the present invention, where the computer device may be a server, and the internal structure of the computer device may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data such as the acquisition method of the positive sample. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of positive sample acquisition.

The processor executes the steps of the positive sample acquisition method: inputting a first positive sample set and a first negative sample set in the sample set into a preset model for training to obtain a first initial model; inputting the remaining samples in the sample set into the first initial model for calculation to obtain a first probability value that each remaining sample in the sample set is a positive sample; judging whether the sample is a second positive sample according to the first probability value; if yes, performing a true degree test on the second positive sample according to a first preset rule; and if the second positive sample is verified to be a true positive sample, defaulting the label of the second positive sample to be a positive sample, otherwise, modifying the label of the second positive sample to be a negative sample.

The step of judging whether the sample is a second positive sample according to the first probability value by the computer equipment comprises the following steps: judging whether the first probability value exceeds a preset probability value or not; if yes, the sample is judged to be the second positive sample, and if not, the sample is judged to be the second negative sample.

In one embodiment, before the step of inputting the first positive sample set and the first negative sample set in the sample set to a preset model to perform training to obtain a first initial model, the method includes: obtaining a plurality of samples according to a second preset rule to form the sample set; obtaining part of first positive samples in the sample set to form the first positive sample set; and selecting first negative samples corresponding to the first positive sample number from the remaining samples of the sample set according to the number of the first positive samples in the first positive sample set and the preset ratio of the first positive sample set to the first negative sample set, so as to form the first negative sample set.

In one embodiment, the second positive sample is a malicious registered account, and the step of performing the authenticity test on the second positive sample according to the first preset rule includes: sealing the second positive sample; if the response of the user for deblocking the second positive sample is obtained after the number is sealed, judging that the second positive sample is not a real positive sample; and if the user response is not obtained after the number is sealed, judging that the second positive sample is a real positive sample.

In one embodiment, after the step of defaulting the label of the second positive sample to a positive sample, the method includes: adding the second positive sample into the first positive sample set to form a second positive sample set; selecting a second negative sample corresponding to the second positive sample number from the remaining samples of the sample set according to the number of the second positive samples in the second positive sample set and the preset ratio of the second positive sample set to the second negative sample set, so as to form the second negative sample set; and inputting the second positive sample set and the second negative sample set into the preset model for training to obtain a second initial model.

In one embodiment, after the step of inputting the second positive sample set and the second negative sample set into the preset model to perform training to obtain a second initial model, the method includes: inputting the test sample into the second initial model to calculate to obtain a second probability value; judging whether the difference value between the second probability value and a preset third probability value exceeds a preset threshold value, wherein the third probability value is obtained by judging according to experience of service personnel of the test sample; if not, defaulting the second initial model to a result model to be built.

In one embodiment, before the step of performing the authenticity test on the second positive sample according to the first preset rule, the method includes: grading the second positive sample according to a range exceeding the preset probability value; and respectively calling the authenticity test corresponding to the grade according to the grade of the second positive sample.

It will be appreciated by those skilled in the art that the architecture shown in fig. 8 is merely a block diagram of a portion of the architecture in connection with the present inventive arrangements and is not intended to limit the computer devices to which the present inventive arrangements are applicable.

An embodiment of the present application further provides a computer readable storage medium having a computer program stored thereon, where the computer program when executed by a processor implements a method for acquiring a positive sample, specifically: inputting a first positive sample set and a first negative sample set in the sample set into a preset model for training to obtain a first initial model; inputting the remaining samples in the sample set into the first initial model for calculation to obtain a first probability value that each remaining sample in the sample set is a positive sample; judging whether the sample is a second positive sample according to the first probability value; if yes, performing a true degree test on the second positive sample according to a first preset rule; and if the second positive sample is verified to be a true positive sample, defaulting the label of the second positive sample to be a positive sample, otherwise, modifying the label of the second positive sample to be a negative sample.

The step of determining whether the sample is a second positive sample according to the first probability value includes: judging whether the first probability value exceeds a preset probability value or not; if yes, the sample is judged to be the second positive sample, and if not, the sample is judged to be the second negative sample.

In one embodiment, before the step of inputting the first positive sample set and the first negative sample set of the sample sets into the preset model to perform training, the method includes: obtaining a plurality of samples according to a second preset rule to form the sample set; obtaining part of first positive samples in the sample set to form the first positive sample set; and selecting first negative samples corresponding to the first positive sample number from the remaining samples of the sample set according to the number of the first positive samples in the first positive sample set and the preset ratio of the first positive sample set to the first negative sample set, so as to form the first negative sample set.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by hardware associated with a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided by the present application and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the invention, and all equivalent structures or equivalent processes using the descriptions and drawings of the present invention or directly or indirectly applied to other related technical fields are included in the scope of the invention.

Claims

1. A method for obtaining a positive sample, comprising:

inputting a first positive sample set and a first negative sample set in a sample set into a preset model for training to obtain a first initial model, wherein the first positive sample set is a set formed by partial positive samples in the sample set, the first negative sample set is taken from the sample set to divide the rest of the partial positive samples, and the number of the positive samples is less than that of the negative samples;

if the second positive sample is verified to be a real positive sample, defaulting the label of the second positive sample to be a positive sample, otherwise modifying the label of the second positive sample to be a negative sample;

after the step of defaulting the label of the second positive sample to a positive sample, the method comprises:

inputting the second positive sample set and the second negative sample set into the preset model for training to obtain a second initial model;

The step of inputting the first positive sample set and the first negative sample set in the sample set into a preset model for training to obtain a first initial model comprises the following steps:

selecting a first negative sample corresponding to the first positive sample number from the remaining samples of the sample set according to the number of the first positive samples in the first positive sample set and the preset ratio of the first positive sample set to the number of the samples of the first negative sample set, so as to form the first negative sample set;

the step of inputting the second positive sample set and the second negative sample set into the preset model to train to obtain a second initial model includes:

if not, defaulting the second initial model to a result model to be built.

2. The method according to claim 1, wherein the step of determining whether the sample is a second positive sample according to the first probability value comprises:

3. The method for obtaining positive samples according to claim 1, wherein the second positive sample is a malicious registered account, and the step of performing the authenticity test on the second positive sample according to the first preset rule includes:

sealing the second positive sample;

4. The method for obtaining positive samples according to claim 2, wherein before the step of performing the authenticity test on the second positive samples according to the first preset rule, comprises:

5. A positive sample acquisition device, comprising:

the training module is used for inputting a first positive sample set and a first negative sample set in a sample set into a preset model to train to obtain a first initial model, wherein the first positive sample set is a set formed by partial positive samples in the sample set, the first negative sample set is taken from the sample set and divides the rest of the partial positive samples, and the number of the positive samples is less than that of the negative samples;

the correction module is used for defaulting the label of the second positive sample to be a positive sample when the second positive sample is verified to be a real positive sample, otherwise, modifying the label of the second positive sample to be a negative sample;

A second forming module, configured to add the second positive sample to the first positive sample set to form a second positive sample set;

the second selecting module is used for selecting second negative samples corresponding to the second positive sample number from the remaining samples of the sample set according to the second positive sample number in the second positive sample set and the preset ratio of the second positive sample set to the second negative sample number, so as to form the second negative sample set;

the second training module is used for inputting the second positive sample set and the second negative sample set into the preset model for training to obtain a second initial model;

the first forming module is used for obtaining a plurality of samples according to a second preset rule to form the sample set;

the first acquisition module is used for acquiring part of first positive samples in the sample set to form the first positive sample set;

the first selecting module is used for selecting first negative samples corresponding to the first positive sample number from the remaining samples of the sample set according to the first positive sample number in the first positive sample set and the preset ratio of the first positive sample number to the first negative sample number, so as to form the first negative sample set;

the second calculation module is used for inputting the test sample into the second initial model to calculate to obtain a second probability value;

the second judging module is used for judging whether the difference value between the second probability value and a preset third probability value exceeds a preset threshold value, wherein the third probability value is obtained by judging according to experience of service personnel of the test sample;

and the default module is used for defaulting the second initial model to a result model to be established when the difference value between the second probability value and the preset third probability value does not exceed a preset threshold value.

6. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 4 when the computer program is executed.

7. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 4.