CN115545214A - User screening method, device, computer equipment, storage medium and program product - Google Patents

User screening method, device, computer equipment, storage medium and program product Download PDF

Info

Publication number
CN115545214A
CN115545214A CN202211257122.XA CN202211257122A CN115545214A CN 115545214 A CN115545214 A CN 115545214A CN 202211257122 A CN202211257122 A CN 202211257122A CN 115545214 A CN115545214 A CN 115545214A
Authority
CN
China
Prior art keywords
sample data
data set
training sample
model
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211257122.XA
Other languages
Chinese (zh)
Inventor
徐林嘉
陈李龙
袁如怡
李睿琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202211257122.XA priority Critical patent/CN115545214A/en
Publication of CN115545214A publication Critical patent/CN115545214A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application relates to a user screening method, a user screening device, computer equipment, a storage medium and a program product, relates to the technical field of big data, and comprises the steps of obtaining historical behavior information of a plurality of candidate users; inputting historical behavior information of a plurality of candidate users into a target preset model for user screening to generate screening results of the plurality of candidate users; the screening results of the candidate users are used for representing the probability that the candidate users are target users; the target preset model is generated by training based on a target training sample data set, and the target training sample data set is generated by performing data screening on the training sample data set and the prediction sample data set; training the sample data set to be sample data with data set offset, and predicting the sample data set to be the sample data set without data set offset; and determining a target user from the candidate users based on the screening results of the candidate users. The method and the device improve the accuracy of the screened target users.

Description

User screening method, device, computer equipment, storage medium and program product
Technical Field
The present application relates to the field of big data technologies, and in particular, to a user screening method, apparatus, computer device, storage medium, and program product.
Background
With the development of artificial intelligence, the application of machine learning models is more and more common. The machine learning model plays an important role in the fields of computer vision, natural language processing, voice recognition, intelligent wind control, accurate marketing, smart cities and the like.
In general, a machine learning model is trained based on a training sample data set to obtain an actually used model. However, there is often some discrepancy between the data distribution in the training sample data set and the data distribution in the prediction sample data set, i.e. there is a data set offset. Due to the data set offset between the training sample data set and the prediction sample data set, the model obtained by training the machine learning model based on the training sample data set has poor effect. Furthermore, when a target user is screened out from a large number of users through the model, the screened target user has the problem of inaccuracy.
Disclosure of Invention
In view of the above, it is necessary to provide a user screening method, an apparatus, a computer device, a storage medium, and a program product capable of improving the accuracy of a screened target user in view of the above technical problems.
In a first aspect, the present application provides a user screening method. The method comprises the following steps:
acquiring historical behavior information of a plurality of candidate users;
inputting the historical behavior information of the candidate users into a target preset model for user screening to generate screening results of the candidate users; the screening results of the candidate users are used for representing the probability that the candidate users are target users; the target preset model is generated by training based on a target training sample data set, and the target training sample data set is generated after data screening is performed on a training sample data set and a prediction sample data set; the training sample data set contains historical behavior information of a first sample user, and the training sample data set is a sample data set with data set offset; the prediction sample data set contains historical behavior information of a second sample user, and the prediction sample data set is a sample data set without data set offset;
and determining a target user from the candidate users based on the screening results of the candidate users.
In one embodiment, the method further comprises:
acquiring the training sample data set;
acquiring the prediction sample data set;
performing data screening on the training sample data set according to the prediction sample data set to generate a target training sample data set;
inputting the target training sample data set into a preset model for training, adjusting preset model parameters of the preset model, and generating adjusted model parameters;
and generating a target preset model based on the adjusted model parameters.
In one embodiment, the data filtering the training sample data set according to the prediction sample data set to generate a target training sample data set includes:
calculating the similarity between training sample data in the training sample data set and prediction sample data in the prediction sample data set;
and performing data screening on the training sample data set according to the similarity to generate a target training sample data set.
In one embodiment, said calculating a similarity between training sample data in said training sample data set and prediction sample data in said prediction sample data set comprises:
inputting the training sample data set into a preset data calibration model for similarity calculation, and generating the similarity between the training sample data in the training sample data set and the prediction sample data in the prediction sample data set; the preset data calibration model is a model trained on the training sample data set and the prediction sample data set.
In one embodiment, the method further comprises:
inputting the training sample data set and the prediction sample data set into an initial data calibration model for training to obtain a prediction result of each training sample data in the training sample data set; the prediction result is used for representing the prediction similarity between training sample data and prediction sample data corresponding to the training sample data in the prediction sample data set, and the initial data calibration model is a model constructed based on a machine learning model;
calculating the value of a loss function according to the prediction result of the training sample data and the labeling result of the training sample data aiming at each training sample data; the labeling result is used for representing the labeling similarity between the training sample data and the prediction sample data corresponding to the training sample data in the prediction sample data set;
and adjusting parameters of the initial data calibration model according to the value of the loss function to obtain the preset data calibration model.
In one embodiment, the acquiring the training sample data set includes:
acquiring an initial training sample data set;
determining whether a data set offset exists in an initial training sample data set corresponding to the preset model;
and if the initial training sample data set corresponding to the preset model has data set offset, taking the initial training sample data set as the training sample data set.
In one embodiment, the determining whether there is a data set offset in the initial training sample data set corresponding to the preset model includes:
obtaining a model stability index of the preset model;
and determining whether the initial training sample data set corresponding to the preset model has data set deviation or not according to the model stability index.
In one embodiment, the model stability index includes at least one of a population stability index, a discrimination evaluation index, a classification evaluation index and a related business index; determining whether the initial training sample data set corresponding to the preset model has data set offset according to the model stability index includes:
judging whether the model stability index meets a preset condition corresponding to the model stability index;
if not, determining that the initial training sample data set corresponding to the preset model has data set offset.
In a second aspect, the application further provides a user screening device. The device comprises:
the historical behavior information acquisition module is used for acquiring historical behavior information of a plurality of candidate users;
the user screening module is used for inputting the historical behavior information of the candidate users into a target preset model for user screening to generate screening results of the candidate users; the screening results of the candidate users are used for representing the probability that the candidate users are target users; the target preset model is generated by training based on a target training sample data set, and the target training sample data set is generated after data screening is performed on a training sample data set and a prediction sample data set; the training sample data set contains historical behavior information of a first sample user, and the training sample data set is a sample data set with data set offset; the prediction sample data set contains historical behavior information of a second sample user, and the prediction sample data set is a sample data set without data set offset;
and the target user determining module is used for determining a target user from the candidate users based on the screening results of the candidate users.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of the method in any of the embodiments of the first aspect described above when the processor executes the computer program.
In a fourth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method in any of the embodiments of the first aspect described above.
In a fifth aspect, the present application further provides a computer program product. The computer program product comprising a computer program that, when executed by a processor, performs the steps of the method in any of the embodiments of the first aspect described above.
The user screening method, the user screening device, the computer equipment, the storage medium and the program product acquire historical behavior information of a plurality of candidate users; inputting historical behavior information of a plurality of candidate users into a target preset model for user screening to generate screening results of the plurality of candidate users; screening results of the candidate users are used for representing the probability that the candidate users are target users; the target preset model is generated by training based on a target training sample data set, and the target training sample data set is generated by performing data screening on the training sample data set and the prediction sample data set; the training sample data set contains historical behavior information of a first sample user, and the training sample data set is a sample data set with data set offset; the prediction sample data set contains the historical behavior information of the second sample user, and the prediction sample data set is the sample data set without data set offset; and determining a target user from the candidate users based on the screening results of the candidate users. According to the method and the device, the training sample data set with the data set offset is subjected to data screening according to the prediction sample data set without the data set offset, so that a target training sample data set with small data set offset, namely a more accurate target training sample data set, can be obtained. Therefore, training is carried out based on the accurate target training sample data set, an accurate target preset model can be generated, and then the acquired historical behavior information of a plurality of candidate users is subjected to user screening by using the accurate target preset model, so that the screening results of the candidate users can be accurately generated. Furthermore, the accurate target user can be determined from the plurality of candidate users according to the accurate screening results of the plurality of candidate users, so that the accuracy of the screened target user is improved compared with that of the traditional method.
Drawings
FIG. 1 is a diagram of an application environment of a user screening method in one embodiment;
FIG. 2 is a schematic flow chart diagram illustrating a user screening method in one embodiment;
FIG. 3 is a schematic flow chart illustrating the steps of generating a target default model in one embodiment;
FIG. 4 is a schematic flow chart diagram illustrating the data filtering step in one embodiment;
FIG. 5 is a schematic flow chart diagram illustrating a user screening method in another embodiment;
FIG. 6 is a schematic flow diagram illustrating training of an initial data calibration model using a countering learning model in one embodiment;
FIG. 7 is a flowchart illustrating the training sample data set determining step in one embodiment;
FIG. 8 is a schematic flow chart of the data set migration determination step in one embodiment;
FIG. 9 is a flowchart illustrating the data set migration determination step in one embodiment;
FIG. 10 is a schematic flow chart diagram illustrating a user screening method in one embodiment;
FIG. 11 is a flowchart illustrating an overall process of the target pre-set model generation step in one embodiment;
FIG. 12 is a block diagram showing the construction of a user screening apparatus according to an embodiment;
FIG. 13 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
With the development of artificial intelligence, the application of machine learning models is more and more common. The machine learning model plays an important role in the fields of computer vision, natural language processing, voice recognition, intelligent wind control, accurate marketing, smart cities and the like.
In general, a machine learning model is trained based on a training sample data set to obtain an actually used model. However, there is often some difference between the data distribution in the training sample data set and the data distribution in the prediction sample data set, i.e. there is a data set offset. Due to the data set offset between the training sample data set and the prediction sample data set, the model obtained by training the machine learning model based on the training sample data set has poor effect. Therefore, in the process of training the machine learning model based on the training sample data set, the problem of data set shift is often faced. Further, when a target user is selected from a large number of users using the model having the data set offset, the selected target user has a problem of inaccuracy.
The user screening method provided by the embodiment of the application can be applied to the application environment shown in fig. 1. Wherein the data storage system may store data that the computer device 120 needs to process. The data storage system may be integrated on the computer device 120, or may be located on the cloud or other network server. The computer device 120 obtains historical behavior information for a plurality of candidate users; inputting historical behavior information of a plurality of candidate users into a target preset model for user screening to generate screening results of the plurality of candidate users; the screening results of the candidate users are used for representing the probability that the candidate users are target users; the target preset model is generated by training based on a target training sample data set, and the target training sample data set is generated by performing data screening on the training sample data set and the prediction sample data set; the training sample data set contains the historical behavior information of the first sample user, and the training sample data set is a sample data set with data set offset; the prediction sample data set contains the historical behavior information of the second sample user, and the prediction sample data set is the sample data set without data set offset; and determining a target user from the candidate users based on the screening results of the candidate users. The computer device 120 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart car-mounted devices, and the like. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device, and the like. The computer device 120 may also be implemented as a stand-alone server or as a server cluster of multiple servers.
In one embodiment, as shown in fig. 2, a user filtering method is provided, which is illustrated by applying the method to the computer device 120 in fig. 1, and includes the following steps:
step 220, obtaining historical behavior information of a plurality of candidate users.
Specifically, the computer device 120 may obtain historical behavior information of a plurality of candidate users within a preset time period from a preset platform, and optionally, the preset time period may be about 1 year, about 3 years, about 5 years, and the like, which is not limited in this application. The historical behavior information of the candidate users is various information capable of identifying the personal historical behaviors of the users, and the historical behavior information of the candidate users includes, but is not limited to, the dominable income proportion of the users, the bank card guarantee amount, the credit history, the credit investigation record, the credit condition, the credit limit and the like. In addition, the embodiment of the application relates to the acquisition, storage, use, processing and the like of data, which all conform to relevant regulations of national laws and regulations.
Step 240, inputting historical behavior information of a plurality of candidate users into a target preset model for user screening, and generating screening results of the plurality of candidate users; screening results of the candidate users are used for representing the probability that the candidate users are target users; the target preset model is generated by training based on a target training sample data set, and the target training sample data set is generated by performing data screening on the training sample data set and the prediction sample data set; the training sample data set contains the historical behavior information of the first sample user, and the training sample data set is a sample data set with data set offset; the prediction sample data set contains the historical behavior information of the second sample user, and the prediction sample data set is the sample data set without data set offset.
Specifically, the data set offset refers to a phenomenon that a certain difference exists between data distribution in the training sample data set and data distribution in the prediction sample data set, and the data set offset may cause a poor effect of a model obtained by training a machine learning model based on the training sample data set with the data set offset. The training sample data set in this embodiment contains the historical behavior information of the first sample user, and the training sample data set is a sample data set with data set offset; the prediction sample data set contains the historical behavior information of the second sample user, and the prediction sample data set is the sample data set without data set offset. Wherein the historical behavior information of the first sample user represents the historical behavior information of the candidate user with the data set offset, and the historical behavior information of the second sample user represents the historical behavior information of the candidate user without the data set offset. The computer device 120 may perform data screening from the training sample data set and the prediction sample data set, and use a training sample data set formed by the training sample data after the data screening as a target training sample data set. Thereafter, the computer device 120 may perform training based on the target training sample data set, generating a target preset model. Thus, the computer device 120 may input the historical behavior information of the plurality of candidate users into the target preset model for user screening, so as to generate a screening result of the plurality of candidate users. The screening results of the candidate users are used for representing the probability that the candidate users are the target users. When the probability that the candidate user is the target user is higher, the probability that the candidate user is the target user is higher; when the probability that the candidate user is the target user is lower, the probability that the user is the target user is indicated to be smaller.
And step 260, determining a target user from the candidate users based on the screening results of the candidate users.
Specifically, since the screening results of the plurality of candidate users represent the probability that the candidate user is the target user, the computer device 120 may determine the target user from the plurality of candidate users by taking, as the target user, the user corresponding to the screening result in which the probability that the candidate user is the target user is higher than the preset threshold value based on the screening results of the plurality of candidate users. The preset threshold is a probability value that a preset candidate user is a target user, and the probability value is not limited in the embodiment of the present application.
In the user screening method, historical behavior information of a plurality of candidate users is obtained; inputting historical behavior information of a plurality of candidate users into a target preset model for user screening to generate screening results of the plurality of candidate users; the screening results of the candidate users are used for representing the probability that the candidate users are target users; the target preset model is generated by training based on a target training sample data set, and the target training sample data set is generated by performing data screening on the training sample data set and the prediction sample data set; the training sample data set contains the historical behavior information of the first sample user, and the training sample data set is a sample data set with data set offset; the prediction sample data set contains the historical behavior information of the second sample user, and the prediction sample data set is the sample data set without data set offset; and determining a target user from the candidate users based on the screening results of the candidate users. According to the method and the device, the training sample data set with the data set offset is subjected to data screening according to the prediction sample data set without the data set offset, so that a target training sample data set with the smaller data set offset, namely a more accurate target training sample data set can be obtained. Therefore, training is carried out based on the accurate target training sample data set, an accurate target preset model can be generated, and then the acquired historical behavior information of a plurality of candidate users is subjected to user screening by using the accurate target preset model, so that the screening results of the candidate users can be accurately generated. Furthermore, the accurate target user can be determined from the candidate users according to the accurate screening results of the candidate users, so that the accuracy of the screened target user is improved compared with the traditional method.
In one embodiment, as shown in fig. 3, the user filtering method further includes:
step 302, a training sample data set is obtained.
Step 304, a prediction sample data set is obtained.
Specifically, the computer device 120 may select sample data with a data set offset from the historical behavior information of multiple candidate users, and form the training sample data set with the sample data with the data set offset. The computer device 120 may also select sample data for which no data set offset exists from the historical behavior information of the plurality of candidate users, and construct the prediction sample data set from the sample data for which no data set offset exists.
And step 306, performing data screening on the training sample data set according to the prediction sample data set to generate a target training sample data set.
Specifically, after the computer device 120 obtains the initial training sample data set and the prediction sample data set corresponding to the preset model, the computer device 120 may perform data screening on the training sample data set with data set offset according to the prediction sample data set without data set offset, so as to obtain a target training sample data set with small data set offset. Optionally, the computer device 120 may filter an initial training sample data set corresponding to the preset model according to the effect of the preset model, so as to obtain a target training sample data set with a smaller data set offset. The computer device 120 may also obtain the target training sample data set by calculating the similarity between the training sample data in the training sample data set and the prediction sample data in the prediction sample data set, and then performing sample data screening on the training sample data set according to the calculated similarity. Of course, the embodiment of the present application does not limit the specific method for data screening.
And 308, inputting the target training sample data set into a preset model for training, adjusting preset model parameters of the preset model, and generating adjusted model parameters.
And 310, generating a target preset model based on the adjusted model parameters.
Specifically, the computer device 120 may input the target training sample data set into a preset model for training according to the screened target training sample data set, so as to obtain an output result of each training sample data in the target training sample data set. And aiming at each training sample data in the target training sample data set, calculating according to the output result of each training sample data and the real result of each training sample data, thereby obtaining the value of the loss function. The computer device 120 may then adjust the parameters of the pre-set model according to the calculated values of the loss function. And for each training sample data, calculating again the value of a new loss function using the new model parameters. And when the value of the new loss function reaches the minimum value, taking the model parameter corresponding to the value of the loss function at the moment as the adjusted model parameter, thereby obtaining the target preset model according to the adjusted model parameter.
In the embodiment, a training sample data set is obtained; acquiring a prediction sample data set; performing data screening on the training sample data set according to the prediction sample data set to generate a target training sample data set; inputting a target training sample data set into a preset model for training, adjusting preset model parameters of the preset model, and generating adjusted model parameters; and generating a target preset model based on the adjusted model parameters. In the embodiment, the training sample data set is subjected to data screening according to the prediction sample data set, so that a target training sample data set is obtained, model training is performed on the preset model based on the target training sample data set with smaller data set offset after screening, the problem of data set offset in the training sample data set can be solved, and therefore the accurate target preset model can be generated.
The above embodiment describes data screening on a training sample data set according to a prediction sample data set to generate a target training sample data set, and the following describes a specific method thereof. In one embodiment, as shown in fig. 4, performing data filtering on the training sample data set according to the predicted sample data set, and generating a target training sample data set, includes:
step 420, calculating the similarity between the training sample data in the training sample data set and the prediction sample data in the prediction sample data set.
In particular, according to the prediction sample data set in which there is no data set offset and the training sample data set in which there is a data set offset, the computer device 120 may calculate a similarity between the training sample data in the training sample data set and the prediction sample data in the prediction sample data set. Optionally, the computer device 120 may directly compare the training sample data in the training sample data set with the prediction sample data in the prediction sample data set, so as to obtain the similarity between the training sample data in the training sample data set and the prediction sample data in the prediction sample data set. The computer device 120 may also input the training sample data set into a data calibration model trained based on the training sample data set and the prediction sample data set to perform similarity calculation, so as to obtain a similarity between the training sample data in the training sample data set and the prediction sample data in the prediction sample data set. The data calibration model is used for calculating the similarity between the training sample data and the prediction sample data based on the training sample data set and the prediction sample data set. Of course, the embodiment of the present application does not limit the specific method for calculating the similarity.
And 440, performing data screening on the training sample data set according to the similarity to generate a target training sample data set.
Specifically, the computer device 120 may perform sample data screening on the training sample data set according to the calculated similarity between the training sample data in the training sample data set and the prediction sample data in the prediction sample data set, screen out training sample data in the training sample data set whose similarity with the prediction sample data in the prediction sample data set is higher than a preset threshold, and generate the target training sample data set according to the training sample data in the training sample data set whose similarity is higher than the preset threshold.
In the embodiment, the similarity between the training sample data in the training sample data set and the prediction sample data in the prediction sample data set is calculated; and then, performing data screening on the training sample data set according to the similarity to generate a target training sample data set. According to the embodiment, data screening can be performed according to the similarity between the training sample data and the prediction sample data, so that the target training sample data with higher similarity to the prediction sample data can be obtained, that is, the target training sample data set with smaller data set offset can be obtained, namely, the more accurate target preset model is obtained.
The above embodiment describes calculating the similarity between training sample data in a training sample data set and prediction sample data in a prediction sample data set, and a specific method thereof is described below. In one embodiment, calculating a similarity between training sample data in the training sample data set and prediction sample data in the prediction sample data set comprises:
inputting the training sample data set into a preset data calibration model for similarity calculation, and generating the similarity between the training sample data in the training sample data set and the prediction sample data in the prediction sample data set; the preset data calibration model is a model trained on a training sample data set and a prediction sample data set.
Specifically, the computer device 120 may input the training sample data set into the preset data calibration model for similarity calculation, so as to obtain a similarity between the training sample data in the training sample data set and the prediction sample data in the prediction sample data set. The preset data calibration model is a model obtained by training based on a training sample data set and a prediction sample data set, and is used for calculating the similarity between training sample data and prediction sample data based on prediction sample data in the prediction sample data set without data set offset and training sample data in the training sample data set with data set offset. Specifically, the preset data calibration model is a model obtained by training the counterlearning model based on the training sample data set and the prediction sample data set. The specific process of using the antagonistic learning model is: first, a model (e.g., a convolutional neural network model) is obtained, such that the model obtains an output result for each input data, and the output result is consistent with the actual result as much as possible. And a discriminator is used in the course of the counterlearning, the discriminator being used to identify whether each predicted result is from the output of the model or from the true result. And adjusting parameters of the model based on the discriminator to obtain a preset data calibration model.
In the embodiment, a training sample data set is input into a preset data calibration model for similarity calculation, and the similarity between training sample data in the training sample data set and prediction sample data in a prediction sample data set is generated; the preset data calibration model is a model trained based on a training sample data set and a prediction sample data set. In this embodiment, the training sample data set is input into the preset data calibration model for similarity calculation, so that the accuracy of similarity calculation can be improved, and the similarity between the training sample data and the prediction sample data can be accurately obtained. In addition, preparation can be made for the subsequent step of data screening according to the similarity between the training sample data and the prediction sample data.
The above embodiment describes calculating the similarity between training sample data in a training sample data set and prediction sample data in a prediction sample data set, and another embodiment of the user screening method is described below. In one embodiment, as shown in fig. 5, there is provided a user filtering method, further including:
step 520, inputting the training sample data set and the prediction sample data set into an initial data calibration model for training to obtain a prediction result of each training sample data in the training sample data set; the prediction result is used for representing the prediction similarity between the training sample data and the prediction sample data corresponding to the training sample data in the prediction sample data set, and the initial data calibration model is a model constructed based on a machine learning model.
Specifically, the computer device 120 may construct a label of 0 on the training sample data set, construct a label of 1 on the original prediction sample data set of the preset model, and input the training sample data set carrying the label 0 and the prediction sample data set carrying the label 1 into the initial data calibration model for training, so as to obtain the prediction result of each training sample data in the training sample data set carrying the label 0. The prediction result of each training sample data in the training sample data set refers to the prediction similarity between the training sample data and the prediction sample data corresponding to the training sample data in the prediction sample data set. The initial data calibration model is a model constructed based on a machine learning model. Optionally, the machine learning model may be an antagonistic learning model, a random forest model, a neural network model, a gradient boosting model, or the like, which is not limited in this application.
Specifically, assuming that the machine learning model adopted in this embodiment is a counterstudy model, the specific training process is as follows: referring to fig. 6, fig. 6 is a schematic diagram illustrating a process of training an initial data calibration model using a counterlearning model according to an embodiment. First, the computer device 120 constructs a label of 0 on an original training sample data set of a preset model, constructs a label of 1 on an original prediction sample data set of the preset model, and constructs a counterstudy model using a GBDT algorithm (Gradient Boosting iterative Decision Tree).
The GBDT algorithm is based on a model established in the previous round, and a new round of model is established in such a way that a loss function descends in a negative Gradient manner most quickly, so the model is called a Gradient lifting Tree. The GBDT algorithm is to fit the residual error by the method of the minimum negative gradient value each time in a continuous iteration mode, thereby achieving the purpose of converging the true value and the predicted value. GBDT can be viewed as a strong learner from a combination of several weak learners, assuming that the integration result of the weak learners from the first t rounds is f t-1 (x) Loss functionIs L (y, f) t-1 (x) Y is the true value of the sample data, x is the input sample data, and the training is to find a weak learner h t (x) The loss function of the t round is minimized, and the calculation formula of the overall loss function is shown in the following formula (1):
L(y,f t (x))=L(y,f t-1 (x)+h t (x)) (1)
the negative gradient of the ith sample of the tth round is calculated as shown in the following equation (2):
Figure BDA0003890046690000091
the specific process of constructing an antagonistic learning model by using the GBDT algorithm (Gradient Boosting iterative Decision Tree) is as follows:
first, a training sample T is input, and a calculation formula of the training sample T is shown in the following equation (3):
T={(x 1 ,y 1 ),(x 2 ,y 2 ),...,(x N ,y N )} (3)
and initializing a regression decision tree f 0 (x) Regression decision tree f 0 (x) The formula (4) is shown below:
Figure BDA0003890046690000092
where c is the resulting output value that minimizes the loss function, N represents the total number of input sample data, y i Is the true value of the ith sample data.
Secondly, for each sample data i, the negative gradient value of the loss function at the moment is used as a residual value of each sample data under the current model, and the obtained residual result is used for training a new regression tree.
Thirdly, traversing all sample data in the leaf nodes, and determining the output value of the tth tree when the loss function is minimum, wherein the calculation formula of the output value of the tth tree is shown as the following formula (5):
Figure BDA0003890046690000093
thereby obtaining a fitting function of the t-th recurrent tree, wherein a calculation formula of the fitting function of the t-th recurrent tree is shown as the following formula (6):
Figure BDA0003890046690000094
fourth, weak learner f from front t-wheel t-1 (x) And (3) obtaining a strong learner after t rounds of training by a fitting function of the regression tree in the t th round, wherein a calculation formula of the strong learner after t rounds of training is shown as the following formula (7):
Figure BDA0003890046690000095
fifthly, repeating the steps to determine the final GBDT strong learner.
Secondly, inputting the training sample data set carrying the label 0 and the prediction sample data set carrying the label 1 into a confrontation learning model (initial data calibration model) for training, so as to obtain the prediction result of each training sample data in the training sample data set carrying the label 0. The prediction result of each training sample data in the training sample data set refers to the prediction similarity between the training sample data and the prediction sample data corresponding to the training sample data in the prediction sample data set. Then, aiming at each training sample data, calculating the value of the loss function according to the prediction result of the training sample data and the labeling result of the training sample data; and the labeling result is used for representing the labeling similarity between the training sample data and the prediction sample data corresponding to the training sample data in the prediction sample data set. And adjusting parameters of the initial data calibration model according to the value of the loss function to obtain a preset data calibration model.
Step 540, aiming at each training sample data, calculating the value of the loss function according to the prediction result of the training sample data and the labeling result of the training sample data; and the labeling result is used for representing the labeling similarity between the training sample data and the prediction sample data corresponding to the training sample data in the prediction sample data set.
Specifically, the computer device 120 may calculate, for each training sample data in the training sample data set carrying the label 0, a value of the loss function according to the prediction result of each training sample data in the training sample data set and the labeling result of each training sample data in the training sample data set. The labeling result of each training sample data in the training sample data set refers to the labeling similarity between the training sample data and the prediction sample data corresponding to the training sample data in the prediction sample data set, that is, the true similarity between the training sample data and the prediction sample data corresponding to the training sample data in the prediction sample data set. In this embodiment, the labeling similarity between the training sample data and the prediction sample data corresponding to the training sample data in the prediction sample data set may be calculated manually in advance.
Specifically, a loss function (loss function) can be used to characterize the degree of difference between the predicted data and the actual data. Wherein, the loss function may include a Softmax loss function, a cross entropy loss function, a Hinge loss function, a regression loss function, and the like. The effect of the loss function is to measure the strength of the prediction capability of the model. The smaller the loss function, the better the model representing the training. The loss function can be constructed according to the prediction result of each training sample data and the labeling result of each training sample data in the training sample data set, and the value of the loss function is calculated.
And step 560, adjusting parameters of the initial data calibration model according to the value of the loss function to obtain a preset data calibration model.
In particular, the computer device 120 may adjust the parameters of the initial data calibration model according to the calculated values of the loss function. Thereafter, for each training sample data, the value of the new loss function is calculated again using the new model parameters. And when the value of the new loss function reaches the minimum value, taking the model parameter corresponding to the value of the loss function at the moment as a target model parameter, thereby obtaining a trained preset data calibration model according to the target model parameter at the moment.
In the embodiment, a training sample data set and a prediction sample data set are input into an initial data calibration model for training to obtain a prediction result of each training sample data in the training sample data set; the prediction result is used for representing the prediction similarity between the training sample data and the prediction sample data corresponding to the training sample data in the prediction sample data set, and the initial data calibration model is a model constructed based on a machine learning model; calculating the value of a loss function according to the prediction result of the training sample data and the labeling result of the training sample data aiming at each training sample data; the labeling result is used for representing the labeling similarity between the training sample data and the prediction sample data corresponding to the training sample data in the prediction sample data set; and adjusting parameters of the initial data calibration model according to the value of the loss function to obtain a preset data calibration model. According to the method, the initial data calibration model is trained through the training sample data set and the prediction sample data set, parameters of the initial data calibration model are adjusted according to values of loss functions, so that a more accurate preset data calibration model is obtained, the more accurate preset data calibration model can be provided through model training and model parameter adjustment, and preparation is made for the subsequent step of inputting the training sample data set into the preset data calibration model to perform more accurate similarity calculation.
In one embodiment, as shown in fig. 7, obtaining a training sample data set includes:
step 720, an initial training sample data set is obtained.
Step 740, determining whether the initial training sample data set corresponding to the preset model has data set offset.
And 760, if the initial training sample data set corresponding to the preset model has data set offset, taking the initial training sample data set as a training sample data set.
Specifically, the computer device 120 may obtain an initial training sample data set from historical behavior information of a plurality of candidate users of a preset platform. The initial training sample data set is an initial training sample data set corresponding to the preset model, and the initial training sample data set corresponding to the preset model is an initial training sample data set used when the preset model is trained. Thereafter, the computer device 120 may determine whether the initial training sample data set corresponding to the preset model has data set offset according to the effect of the preset model. Optionally, the computer device 120 may determine an effect (for example, an effect such as stability) of the preset model according to the model index of the preset model, and then determine whether the initial training sample data set corresponding to the preset model has a data set offset according to the effect of the preset model. The computer device 120 may also determine whether the data set offset exists in the initial training sample data set corresponding to the preset model according to the effect of the preset model by monitoring the effect (such as the effect of accuracy) of the preset model during actual use. Of course, the specific method for determining whether the initial training sample data set corresponding to the preset model has the data set offset is not limited in the embodiment of the present application. And if the data set offset of the initial training sample data set corresponding to the preset model is determined, taking the initial training sample data set as a training sample data set.
In the embodiment, an initial training sample data set is obtained; determining whether a data set offset exists in an initial training sample data set corresponding to a preset model; and if the initial training sample data set corresponding to the preset model has data set offset, taking the initial training sample data set as a training sample data set. Therefore, the training sample data set with the data set offset can be obtained by determining whether the initial training sample data set corresponding to the preset model has the data set offset and using the initial training sample data set with the data set offset as the training sample data set.
The above embodiment describes determining whether the initial training sample data set corresponding to the preset model has data set offset, and a specific method thereof is described below. In one embodiment, as shown in fig. 8, determining whether there is a data set offset in the initial training sample data set corresponding to the preset model includes:
and step 820, obtaining a model stability index of the preset model.
Specifically, the computer device 120 may obtain a model stability index of the preset model according to an actual application condition of the preset model. The model stability represents whether the prediction capability of the model is consistent in the time dimension or not, and is used for representing the deviation between the distribution of the prediction sample data and the distribution of the historical sample data in the model. The model Stability Index is a relevant Index for evaluating the Stability of the model, and in the embodiment of the present application, the model Stability Index includes, but is not limited to, a Population Stability Index (PSI).
And step 840, determining whether the initial training sample data set corresponding to the preset model has data set offset according to the model stability index.
Specifically, the computer device 120 may determine the model stability of the preset model according to the model stability index of the preset model. Since the model stability represents the deviation between the predicted sample data distribution and the historical sample data/training sample data distribution in the model, the computer device 120 may determine whether the initial training sample data set corresponding to the preset model has data set offset according to the model stability of the preset model.
In this embodiment, by obtaining a model stability index of the preset model and determining whether the initial training sample data set corresponding to the preset model has data set offset according to the model stability index, preparation can be made for obtaining the training sample data set corresponding to the preset model with data set offset and the prediction sample data set without data set offset when it is determined that the initial training sample data set corresponding to the preset model has data set offset.
In the above embodiment, whether the initial training sample data set corresponding to the preset model has data set offset is determined according to the model stability index, and a specific method thereof is described below. In one embodiment, as shown in fig. 9, the model stability index includes at least one of a population stability index, a discrimination evaluation index, a classification evaluation index, and a related business index; determining whether the initial training sample data set corresponding to the preset model has data set offset according to the model stability index, including:
and 920, judging whether the model stability index meets a preset condition corresponding to the model stability index.
And 940, if not, determining that the initial training sample data set corresponding to the preset model has data set offset.
Specifically, whether the model stability index of the preset model meets a preset condition corresponding to the model stability index is judged by calculating the model stability index of the preset model. The model stability index comprises at least one index of a group stability index, a discrimination evaluation index, a classification evaluation index and a related business index. When any one of the group stability index, the discrimination evaluation index, the classification evaluation index and the related service index does not satisfy the preset condition corresponding to each model stability index, the computer device 120 determines that the initial training sample data set corresponding to the preset model has data set offset. Of course, when any two of the group stability index, the discrimination evaluation index, the classification evaluation index, and the related service index do not satisfy the preset condition corresponding to each model stability index, the computer device 120 may determine that the data set offset exists in the initial training sample data set corresponding to the preset model; or, when any three of the group stability index, the discrimination evaluation index, the classification evaluation index and the related service index do not satisfy the preset condition corresponding to each model stability index, the computer device 120 determines that the initial training sample data set corresponding to the preset model has data set offset; or, when all of the group stability index, the discrimination evaluation index, the classification evaluation index, and the related service index do not satisfy the preset condition corresponding to each model stability index, the computer device 120 determines that the initial training sample data set corresponding to the preset model has a data set offset, which is not limited in the present application.
The Population Stability Index (PSI) is an Index representing a distribution difference between test sample data and model training sample data, and is the most common model Stability evaluation Index. The calculation formula of the population stability index is shown in the following formula (8):
Figure BDA0003890046690000121
wherein A is i Representing the number proportion of the ith training sample data in the prediction sample data set (actual data), E i The number of the ith training sample data in the training verification sample data set (expected data) is expressed, and n represents n sample data in the training sample data. The preset condition corresponding to the population stability index is that the calculated PSI value is less than a preset PSI threshold.
Generally, when the value of PSI is greater than or equal to 0.25, it is determined that the preset model is not stable at this time. Of course, the preset condition is not limited in the embodiment of the present application.
The discrimination evaluation index (Kolmogorov-Smirnov, KS) is an index reflecting the optimal discrimination effect of the model, and is a commonly used model evaluation index. In general, the larger the KS index, the greater the discriminative power of the model. The calculation formula of the discrimination evaluation index is shown in the following formula (9):
KS=max|R 1 -R 2 | (9)
wherein R is 1 Indicating a good sample cumulative percentage, R 2 And indicating the bad sample cumulative percentage, wherein the good sample cumulative percentage and the bad sample cumulative percentage are the percentage values in a certain bin. The binning is to smooth the value of the required sample data by monitoring the sample data values around, and the binning mode may be any one of equal frequency binning, equidistant binning, chi-square binning, decision tree binning and the like.
And judging whether the preset model is stable or not by comparing the KS value of the preset model on the training verification sample data set with the KS value on the prediction sample data set. The preset condition corresponding to the discrimination evaluation index is that the difference between the value of KS in the prediction sample data set and the value of KS in the training verification sample data set is smaller than or equal to a preset KS threshold value. Generally, when the difference between the value of KS in the prediction sample data set and the value of KS in the training verification sample data set is more than 0.05, it can be determined that the preset model is unstable at this time. Of course, the preset condition is not limited in the embodiment of the present application.
The classification evaluation index (AUC index) is a receiver operating characteristic Curve (receiver operating characteristic Curve, ROC Curve), is a comprehensive index for evaluating the two classification models. And judging whether the preset model is stable or not by comparing the AUC value of the preset model on the training verification sample data set with the AUC value on the prediction sample data set. The preset condition corresponding to the classification evaluation index is that the difference between the value of the AUC in the prediction sample data set and the value of the AUC in the training verification sample data set is less than or equal to a preset AUC threshold value. Generally, when the difference between the AUC value in the prediction sample data set and the AUC value in the training verification sample data set is more than 0.1, it can be determined that the preset model is unstable at this time. Of course, the preset condition is not limited in the embodiment of the present application.
The relevant service index is an index determined by behavior data generated when a user performs a relevant service, and the relevant service index can judge whether the preset model is stable according to the actual application condition. The actual application condition may be default rate of the user, click rate of the user, or the like. The preset condition corresponding to the relevant service index is that the value of the relevant service index is smaller than a preset service threshold value in actual application.
In this embodiment, the model stability index includes at least one of a group stability index, a discrimination evaluation index, a classification evaluation index, and a related business index, and therefore, when determining whether the model stability index satisfies a preset condition corresponding to the model stability index, it is specifically determined whether at least one of the group stability index, the discrimination evaluation index, the classification evaluation index, and the related business index satisfies a preset condition corresponding thereto. Namely, whether each index meets the preset condition of each index is respectively judged from a plurality of dimensions such as a group stability index, a discrimination evaluation index, a classification evaluation index, a related service index and the like. And if at least one of the group stability index, the discrimination evaluation index, the classification evaluation index and the related service index does not meet the preset condition corresponding to each model stability index, determining that the initial training sample data set corresponding to the preset model has data set offset, so that whether the initial training sample data set corresponding to the preset model has data set offset can be comprehensively and accurately determined.
In a specific embodiment, as shown in fig. 10, there is provided a user filtering method applied to a computer device 120, including:
step 1002, obtaining historical behavior information of a plurality of candidate users;
step 1004, acquiring an initial training sample data set;
step 1006, obtaining a model stability index of a preset model;
step 1008, the model stability index includes at least one of a group stability index, a discrimination evaluation index, a classification evaluation index and a related business index; judging whether the model stability index meets a preset condition corresponding to the model stability index;
step 1010, if not, determining that the initial training sample data set corresponding to the preset model has data set offset;
step 1012, if the initial training sample data set corresponding to the preset model has data set offset, taking the initial training sample data set as a training sample data set;
step 1014, acquiring a prediction sample data set;
step 1016, inputting the training sample data set and the prediction sample data set into the initial data calibration model for training to obtain a prediction result of each training sample data in the training sample data set; the prediction result is used for representing the prediction similarity between the training sample data and the prediction sample data corresponding to the training sample data in the prediction sample data set, and the initial data calibration model is a model constructed based on a machine learning model;
step 1018, calculating a value of the loss function according to the prediction result of the training sample data and the labeling result of the training sample data for each training sample data; the labeling result is used for representing the labeling similarity between the training sample data and the prediction sample data corresponding to the training sample data in the prediction sample data set;
step 1020, adjusting parameters of the initial data calibration model according to the value of the loss function to obtain a preset data calibration model;
step 1022, inputting the training sample data set into a preset data calibration model for similarity calculation, and generating the similarity between the training sample data in the training sample data set and the prediction sample data in the prediction sample data set; the preset data calibration model is a model obtained by training based on a training sample data set and a prediction sample data set;
step 1024, performing data screening on the training sample data set according to the similarity to generate a target training sample data set;
step 1026, inputting the target training sample data set into a preset model for training, adjusting the preset model parameters of the preset model, and generating adjusted model parameters;
step 1028, generating a target preset model based on the adjusted model parameters;
step 1030, inputting historical behavior information of a plurality of candidate users into a target preset model for user screening to generate screening results of the plurality of candidate users; the screening results of the candidate users are used for representing the probability that the candidate users are target users; the target preset model is generated by training based on a target training sample data set, and the target training sample data set is generated after data screening is performed on a training sample data set and a prediction sample data set; the training sample data set contains the historical behavior information of the first sample user, and the training sample data set is a sample data set with data set offset; the prediction sample data set contains the historical behavior information of the second sample user, and the prediction sample data set is the sample data set without data set offset;
step 1032 is that the target user is determined from the plurality of candidate users based on the screening results of the plurality of candidate users.
Specifically, referring to fig. 11, fig. 11 is a schematic overall flowchart of a target default model generation step in a specific embodiment. First, the computer device 120 obtains an initial training sample data set and obtains a model stability index of a preset model; the model stability index comprises at least one index of a group stability index, a discrimination evaluation index, a classification evaluation index and a related business index. And secondly, judging whether the model stability index meets a preset condition corresponding to the model stability index. Thirdly, if the model stability index does not meet the preset condition corresponding to the model stability index, determining that the data set offset exists in the initial training sample data set corresponding to the preset model.
Fourthly, when the initial training sample data set corresponding to the preset model has data set offset, taking the initial training sample data set as a training sample data set so as to obtain the training sample data set and a prediction sample data set; the predicted sample data set is a sample data set in which there is no data set offset. Fifthly, a label of 0 is established on the training sample data set, a label of 1 is established on the original prediction sample data set of the preset model, and the training sample data set carrying the label of 0 and the prediction sample data set carrying the label of 1 are input into the initial data calibration model for training of antagonistic learning, so that a prediction result of each training sample data in the training sample data set is obtained; the prediction result is used for representing the prediction similarity between the training sample data and the prediction sample data corresponding to the training sample data in the prediction sample data set, and the initial data calibration model is a model constructed based on a machine learning model. Sixthly, aiming at each training sample data, calculating the value of the loss function according to the prediction result of the training sample data and the labeling result of the training sample data; and the labeling result is used for representing the labeling similarity between the training sample data and the prediction sample data corresponding to the training sample data in the prediction sample data set. And seventhly, adjusting parameters of the initial data calibration model according to the value of the loss function to obtain a preset data calibration model.
Eighthly, inputting the training sample data set into a preset data calibration model for similarity calculation, and generating the similarity between the training sample data in the training sample data set and the prediction sample data in the prediction sample data set; the preset data calibration model is a model trained based on a training sample data set and a prediction sample data set. Ninthly, determining the type of the data set offset according to the obtained similarity. There are three types of common dataset migration: 1) Covariate Shift (Covariate Shift): an offset of sample data x of an input model; 2) Prior Probability Shift (priority Probability Shift): outputting the deviation of the prediction result y of the model; 3) Concept Shift (Concept Shift): and the corresponding relation between the input sample data x and the output prediction result y is deviated. When the distribution difference of the training sample data and the prediction sample data is large, namely the similarity is lower than a preset similarity threshold value, covariate deviation is shown; when the distribution of the training sample data is similar to that of the prediction sample data, namely the similarity is higher than a preset similarity threshold, but the model stability is not high, the concept deviation appears. And performing data screening on the training sample data set according to the obtained similarity to generate a target training sample data set.
And tenth, inputting the target training sample data set into a preset model for training, adjusting preset model parameters of the preset model, and generating adjusted model parameters. Eleventh, a target preset model is generated based on the adjusted model parameters.
In the embodiment, historical behavior information of a plurality of candidate users is obtained; inputting historical behavior information of a plurality of candidate users into a target preset model for user screening to generate screening results of the plurality of candidate users; screening results of the candidate users are used for representing the probability that the candidate users are target users; the target preset model is generated by training based on a target training sample data set, and the target training sample data set is generated by performing data screening on the training sample data set and the prediction sample data set; the training sample data set contains historical behavior information of a first sample user, and the training sample data set is a sample data set with data set offset; the prediction sample data set contains the historical behavior information of the second sample user, and the prediction sample data set is the sample data set without data set offset; and determining a target user from the candidate users based on the screening results of the candidate users. According to the method and the device, the training sample data set with the data set offset is subjected to data screening according to the prediction sample data set without the data set offset, so that a target training sample data set with the smaller data set offset, namely a more accurate target training sample data set can be obtained. Therefore, training is carried out based on the accurate target training sample data set, an accurate target preset model can be generated, and then the acquired historical behavior information of a plurality of candidate users is subjected to user screening by using the accurate target preset model, so that the screening results of the candidate users can be accurately generated. Furthermore, the accurate target user can be determined from the candidate users according to the accurate screening results of the candidate users, so that the accuracy of the screened target user is improved compared with the traditional method.
It should be understood that, although the steps in the flowcharts related to the embodiments are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not limited to being performed in the exact order illustrated and, unless explicitly stated herein, may be performed in other orders. Moreover, at least a part of the steps in the flowcharts related to the above embodiments may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.
Based on the same inventive concept, the embodiment of the application also provides a user screening device for realizing the user screening method. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme recorded in the method, so the specific limitations in one or more embodiments of the user screening device provided below can be referred to the limitations of the user screening method in the foregoing, and are not described herein again.
In one embodiment, as shown in fig. 12, there is provided a user filtering apparatus 1200, including: a historical behavior information obtaining module 1220, a user filtering module 1240 and a target user determining module 1260, wherein:
a historical behavior information obtaining module 1220, configured to obtain historical behavior information of multiple candidate users;
the user screening module 1240 is used for inputting the historical behavior information of the plurality of candidate users into the target preset model for user screening to generate screening results of the plurality of candidate users; the screening results of the candidate users are used for representing the probability that the candidate users are target users; the target preset model is generated by training based on a target training sample data set, and the target training sample data set is generated by performing data screening on the training sample data set and the prediction sample data set; the training sample data set contains the historical behavior information of the first sample user, and the training sample data set is a sample data set with data set offset; the prediction sample data set contains the historical behavior information of the second sample user, and the prediction sample data set is the sample data set without data set offset;
a target user determination module 1260, configured to determine a target user from the plurality of candidate users based on the screening results of the plurality of candidate users.
In one embodiment, the user filtering apparatus 1200 further includes:
the training sample data set acquisition module is used for acquiring a training sample data set;
the prediction sample data set acquisition module is used for acquiring a prediction sample data set;
the target training sample data set generating module is used for carrying out data screening on the training sample data set according to the prediction sample data set to generate a target training sample data set;
the model parameter adjusting module is used for inputting a target training sample data set into a preset model for training, adjusting preset model parameters of the preset model and generating adjusted model parameters;
and the target preset model generating module is used for generating a target preset model based on the adjusted model parameters.
In one embodiment, the target training sample data set generation module includes:
the similarity calculation unit is used for calculating the similarity between the training sample data in the training sample data set and the prediction sample data in the prediction sample data set;
and the target training sample data set generating unit is used for performing data screening on the training sample data set according to the similarity and generating a target training sample data set.
In one embodiment, the similarity calculation unit includes:
the similarity operator unit is used for inputting the training sample data set into a preset data calibration model for similarity calculation to generate the similarity between the training sample data in the training sample data set and the prediction sample data in the prediction sample data set; the preset data calibration model is a model trained based on a training sample data set and a prediction sample data set.
In one embodiment, the user filtering apparatus 1100 further includes:
the prediction result generation unit is used for inputting the training sample data set and the prediction sample data set into the initial data calibration model for training to obtain the prediction result of each training sample data in the training sample data set; the prediction result is used for representing the prediction similarity between the training sample data and the prediction sample data corresponding to the training sample data in the prediction sample data set, and the initial data calibration model is a model constructed based on a machine learning model;
the loss function calculation unit is used for calculating the value of the loss function according to the prediction result of the training sample data and the labeling result of the training sample data aiming at each training sample data; the labeling result is used for representing the labeling similarity between the training sample data and the prediction sample data corresponding to the training sample data in the prediction sample data set;
and the preset data calibration model generation unit is used for adjusting the parameters of the initial data calibration model according to the values of the loss function to obtain the preset data calibration model.
In one embodiment, the training sample data set obtaining module includes:
an initial training sample data set obtaining unit, configured to obtain an initial training sample data set;
the data set offset determining unit is used for determining whether the initial training sample data set corresponding to the preset model has data set offset;
and the training sample data set determining unit is used for taking the initial training sample data set as the training sample data set if the initial training sample data set corresponding to the preset model has data set offset.
In one embodiment, the data set offset determination unit includes:
the stability index obtaining subunit is used for obtaining a model stability index of the preset model;
and the data set offset determining subunit is used for determining whether the data set offset exists in the initial training sample data set corresponding to the preset model according to the model stability index.
In one embodiment, the model stability index includes at least one of a group stability index, a discrimination evaluation index, a classification evaluation index and a related business index; a data set offset determination subunit comprising:
a preset condition judgment subunit, configured to judge whether the model stability index satisfies a preset condition corresponding to the model stability index;
and the data set offset determining subunit is used for determining that the data set offset exists in the initial training sample data set corresponding to the preset model if the data set offset does not exist in the initial training sample data set corresponding to the preset model.
The modules in the user screening apparatus may be implemented wholly or partially by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 13. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The database of the computer device is used for storing data screening data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a user screening method.
Those skilled in the art will appreciate that the architecture shown in fig. 13 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:
acquiring historical behavior information of a plurality of candidate users;
inputting historical behavior information of a plurality of candidate users into a target preset model for user screening to generate screening results of the plurality of candidate users; the screening results of the candidate users are used for representing the probability that the candidate users are target users; the target preset model is generated by training based on a target training sample data set, and the target training sample data set is generated by performing data screening on the training sample data set and the prediction sample data set; the training sample data set contains the historical behavior information of the first sample user, and the training sample data set is a sample data set with data set offset; the prediction sample data set contains the historical behavior information of the second sample user, and the prediction sample data set is the sample data set without data set offset;
and determining a target user from the candidate users based on the screening results of the candidate users.
In one embodiment, the processor when executing the computer program further performs the steps of:
acquiring a training sample data set;
acquiring a prediction sample data set;
performing data screening on the training sample data set according to the prediction sample data set to generate a target training sample data set;
inputting a target training sample data set into a preset model for training, adjusting preset model parameters of the preset model, and generating adjusted model parameters;
and generating a target preset model based on the adjusted model parameters.
In one embodiment, the training sample data set is subjected to data filtering according to the prediction sample data set to generate a target training sample data set, and the processor, when executing the computer program, further implements the following steps:
calculating the similarity between training sample data in the training sample data set and prediction sample data in the prediction sample data set;
and performing data screening on the training sample data set according to the similarity to generate a target training sample data set.
In one embodiment, the similarity between training sample data in the training sample data set and prediction sample data in the prediction sample data set is calculated, and the processor, when executing the computer program, further performs the following steps:
inputting the training sample data set into a preset data calibration model for similarity calculation to generate similarity between the training sample data in the training sample data set and the prediction sample data in the prediction sample data set; the preset data calibration model is a model trained on a training sample data set and a prediction sample data set.
In one embodiment, the processor when executing the computer program further performs the steps of:
inputting a training sample data set and a prediction sample data set into an initial data calibration model for training to obtain a prediction result of each training sample data in the training sample data set; the prediction result is used for representing the prediction similarity between the training sample data and the prediction sample data corresponding to the training sample data in the prediction sample data set, and the initial data calibration model is a model constructed based on a machine learning model;
calculating the value of the loss function according to the prediction result of the training sample data and the labeling result of the training sample data aiming at each training sample data; the labeling result is used for representing the labeling similarity between the training sample data and the prediction sample data corresponding to the training sample data in the prediction sample data set;
and adjusting parameters of the initial data calibration model according to the value of the loss function to obtain a preset data calibration model.
In one embodiment, a training sample data set is obtained, and the processor when executing the computer program further performs the steps of:
acquiring an initial training sample data set;
determining whether a data set offset exists in an initial training sample data set corresponding to a preset model;
and if the initial training sample data set corresponding to the preset model has data set offset, taking the initial training sample data set as a training sample data set.
In one embodiment, it is determined whether there is a data set offset in the initial training sample data set corresponding to the preset model, and the processor, when executing the computer program, further performs the following steps:
obtaining a model stability index of a preset model;
and determining whether the initial training sample data set corresponding to the preset model has data set deviation or not according to the model stability index.
In one embodiment, the model stability index includes at least one of a population stability index, a discrimination evaluation index, a classification evaluation index, and a related business index; according to the model stability index, determining whether the initial training sample data set corresponding to the preset model has data set offset, and when the processor executes the computer program, further realizing the following steps:
judging whether the model stability index meets a preset condition corresponding to the model stability index;
if not, determining that the initial training sample data set corresponding to the preset model has data set offset.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, performs the steps of:
acquiring historical behavior information of a plurality of candidate users;
inputting historical behavior information of a plurality of candidate users into a target preset model for user screening to generate screening results of the plurality of candidate users; the screening results of the candidate users are used for representing the probability that the candidate users are target users; the target preset model is generated by training based on a target training sample data set, and the target training sample data set is generated by performing data screening on the training sample data set and the prediction sample data set; the training sample data set contains the historical behavior information of the first sample user, and the training sample data set is a sample data set with data set offset; the prediction sample data set contains the historical behavior information of the second sample user, and the prediction sample data set is the sample data set without data set offset;
and determining a target user from the candidate users based on the screening results of the candidate users.
In one embodiment, the computer program when executed by the processor further performs the steps of:
acquiring a training sample data set;
acquiring a prediction sample data set;
performing data screening on the training sample data set according to the prediction sample data set to generate a target training sample data set;
inputting a target training sample data set into a preset model for training, adjusting preset model parameters of the preset model, and generating adjusted model parameters;
and generating a target preset model based on the adjusted model parameters.
In one embodiment, the training sample data set is data filtered according to the prediction sample data set to generate a target training sample data set, and the computer program when executed by the processor further performs the steps of:
calculating the similarity between training sample data in the training sample data set and prediction sample data in the prediction sample data set;
and performing data screening on the training sample data set according to the similarity to generate a target training sample data set.
In one embodiment, the similarity between training sample data in the training sample data set and prediction sample data in the prediction sample data set is calculated, the computer program when executed by the processor further performing the steps of:
inputting the training sample data set into a preset data calibration model for similarity calculation, and generating the similarity between the training sample data in the training sample data set and the prediction sample data in the prediction sample data set; the preset data calibration model is a model trained based on a training sample data set and a prediction sample data set.
In one embodiment, the computer program when executed by the processor further performs the steps of:
inputting a training sample data set and a prediction sample data set into an initial data calibration model for training to obtain a prediction result of each training sample data in the training sample data set; the prediction result is used for representing the prediction similarity between the training sample data and the prediction sample data corresponding to the training sample data in the prediction sample data set, and the initial data calibration model is a model constructed based on a machine learning model;
calculating the value of a loss function according to the prediction result of the training sample data and the labeling result of the training sample data aiming at each training sample data; the labeling result is used for representing the labeling similarity between the training sample data and the prediction sample data corresponding to the training sample data in the prediction sample data set;
and adjusting parameters of the initial data calibration model according to the value of the loss function to obtain a preset data calibration model.
In one embodiment, a set of training sample data is obtained, the computer program when executed by the processor further performing the steps of:
acquiring an initial training sample data set;
determining whether a data set offset exists in an initial training sample data set corresponding to a preset model;
and if the initial training sample data set corresponding to the preset model has data set offset, taking the initial training sample data set as a training sample data set.
In one embodiment, determining whether there is a data set offset in the initial training sample data set corresponding to the preset model, the computer program when executed by the processor further performs the following steps:
obtaining a model stability index of a preset model;
and determining whether the initial training sample data set corresponding to the preset model has data set offset or not according to the model stability index.
In one embodiment, the model stability index includes at least one of a group stability index, a discrimination evaluation index, a classification evaluation index, and a related business index; according to the model stability index, determining whether the initial training sample data set corresponding to the preset model has data set offset, and when being executed by the processor, the computer program further realizes the following steps:
judging whether the model stability index meets a preset condition corresponding to the model stability index;
if not, determining that the initial training sample data set corresponding to the preset model has data set offset.
In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the steps of:
acquiring historical behavior information of a plurality of candidate users;
inputting historical behavior information of a plurality of candidate users into a target preset model for user screening to generate screening results of the plurality of candidate users; screening results of the candidate users are used for representing the probability that the candidate users are target users; the target preset model is generated by training based on a target training sample data set, and the target training sample data set is generated by performing data screening on the training sample data set and the prediction sample data set; the training sample data set contains historical behavior information of a first sample user, and the training sample data set is a sample data set with data set offset; the prediction sample data set contains the historical behavior information of the second sample user, and the prediction sample data set is the sample data set without data set offset;
and determining a target user from the candidate users based on the screening results of the candidate users.
In one embodiment, the computer program when executed by the processor further performs the steps of:
acquiring a training sample data set;
acquiring a prediction sample data set;
performing data screening on the training sample data set according to the prediction sample data set to generate a target training sample data set;
inputting a target training sample data set into a preset model for training, adjusting preset model parameters of the preset model, and generating adjusted model parameters;
and generating a target preset model based on the adjusted model parameters.
In one embodiment, the training sample data set is data filtered according to the prediction sample data set to generate a target training sample data set, and the computer program when executed by the processor further performs the steps of:
calculating the similarity between training sample data in the training sample data set and prediction sample data in the prediction sample data set;
and performing data screening on the training sample data set according to the similarity to generate a target training sample data set.
In one embodiment, the similarity between training sample data in the training sample data set and prediction sample data in the prediction sample data set is calculated, the computer program when executed by the processor further performing the steps of:
inputting the training sample data set into a preset data calibration model for similarity calculation to generate similarity between the training sample data in the training sample data set and the prediction sample data in the prediction sample data set; the preset data calibration model is a model trained based on a training sample data set and a prediction sample data set.
In one embodiment, the computer program when executed by the processor further performs the steps of:
inputting the training sample data set and the prediction sample data set into an initial data calibration model for training to obtain a prediction result of each training sample data in the training sample data set; the prediction result is used for representing the prediction similarity between the training sample data and the prediction sample data corresponding to the training sample data in the prediction sample data set, and the initial data calibration model is a model constructed based on a machine learning model;
calculating the value of the loss function according to the prediction result of the training sample data and the labeling result of the training sample data aiming at each training sample data; the labeling result is used for representing the labeling similarity between the training sample data and the prediction sample data corresponding to the training sample data in the prediction sample data set;
and adjusting parameters of the initial data calibration model according to the value of the loss function to obtain a preset data calibration model.
In one embodiment, a set of training sample data is obtained, the computer program when executed by the processor further performing the steps of:
acquiring an initial training sample data set;
determining whether a data set offset exists in an initial training sample data set corresponding to a preset model;
and if the initial training sample data set corresponding to the preset model has data set offset, taking the initial training sample data set as a training sample data set.
In one embodiment, determining whether there is a data set offset in an initial training sample data set corresponding to a preset model, the computer program when executed by the processor further performs the following steps:
obtaining a model stability index of a preset model;
and determining whether the initial training sample data set corresponding to the preset model has data set offset or not according to the model stability index.
In one embodiment, the model stability index includes at least one of a group stability index, a discrimination evaluation index, a classification evaluation index, and a related business index; according to the model stability index, determining whether the initial training sample data set corresponding to the preset model has data set offset, and when being executed by the processor, the computer program further realizes the following steps:
judging whether the model stability index meets a preset condition corresponding to the model stability index;
if not, determining that the initial training sample data set corresponding to the preset model has data set offset.
It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, displayed data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), magnetic Random Access Memory (MRAM), ferroelectric Random Access Memory (FRAM), phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims (12)

1. A method for user screening, the method comprising:
acquiring historical behavior information of a plurality of candidate users;
inputting the historical behavior information of the candidate users into a target preset model for user screening to generate screening results of the candidate users; the screening results of the candidate users are used for representing the probability that the candidate users are target users; the target preset model is generated by training based on a target training sample data set, and the target training sample data set is generated after data screening is performed on a training sample data set and a prediction sample data set; the training sample data set contains historical behavior information of a first sample user, and the training sample data set is a sample data set with data set offset; the prediction sample data set contains historical behavior information of a second sample user, and the prediction sample data set is a sample data set without data set offset;
and determining a target user from the candidate users based on the screening results of the candidate users.
2. The method of claim 1, further comprising:
acquiring the training sample data set;
acquiring the prediction sample data set;
performing data screening on the training sample data set according to the prediction sample data set to generate a target training sample data set;
inputting the target training sample data set into a preset model for training, adjusting preset model parameters of the preset model, and generating adjusted model parameters;
and generating a target preset model based on the adjusted model parameters.
3. The method of claim 2, wherein the data filtering the training sample data set according to the predicted sample data set to generate a target training sample data set comprises:
calculating the similarity between training sample data in the training sample data set and prediction sample data in the prediction sample data set;
and performing data screening on the training sample data set according to the similarity to generate a target training sample data set.
4. The method according to claim 3, wherein said calculating a similarity between training sample data in the training sample data set and prediction sample data in the prediction sample data set comprises:
inputting the training sample data set into a preset data calibration model for similarity calculation, and generating the similarity between the training sample data in the training sample data set and the prediction sample data in the prediction sample data set; the preset data calibration model is a model trained on the training sample data set and the prediction sample data set.
5. The method of claim 4, further comprising:
inputting the training sample data set and the prediction sample data set into an initial data calibration model for training to obtain a prediction result of each training sample data in the training sample data set; the prediction result is used for representing the prediction similarity between training sample data and prediction sample data corresponding to the training sample data in the prediction sample data set, and the initial data calibration model is a model constructed based on a machine learning model;
calculating the value of a loss function according to the prediction result of the training sample data and the labeling result of the training sample data aiming at each training sample data; the labeling result is used for representing the labeling similarity between the training sample data and the prediction sample data corresponding to the training sample data in the prediction sample data set;
and adjusting parameters of the initial data calibration model according to the value of the loss function to obtain the preset data calibration model.
6. The method of claim 2, wherein the obtaining the set of training sample data comprises:
acquiring an initial training sample data set;
determining whether the initial training sample data set corresponding to the preset model has data set offset or not;
and if the initial training sample data set corresponding to the preset model has data set offset, taking the initial training sample data set as the training sample data set.
7. The method of claim 6, wherein determining whether there is a data set offset in an initial training sample data set corresponding to the predetermined model comprises:
obtaining a model stability index of the preset model;
and determining whether the initial training sample data set corresponding to the preset model has data set offset or not according to the model stability index.
8. The method of claim 7, wherein the model stability indicator comprises at least one of a population stability indicator, a discrimination evaluation indicator, a classification evaluation indicator, and a related business indicator; the determining whether the initial training sample data set corresponding to the preset model has data set offset according to the model stability index includes:
judging whether the model stability index meets a preset condition corresponding to the model stability index;
if not, determining that the initial training sample data set corresponding to the preset model has data set offset.
9. A user screening apparatus, the apparatus comprising:
the historical behavior information acquisition module is used for acquiring historical behavior information of a plurality of candidate users;
the user screening module is used for inputting the historical behavior information of the candidate users into a target preset model for user screening to generate screening results of the candidate users; the screening results of the candidate users are used for representing the probability that the candidate users are target users; the target preset model is generated by training based on a target training sample data set, and the target training sample data set is generated after data screening is performed on a training sample data set and a prediction sample data set; the training sample data set contains historical behavior information of a first sample user, and the training sample data set is a sample data set with data set offset; the prediction sample data set contains historical behavior information of a second sample user, and the prediction sample data set is a sample data set without data set offset;
and the target user determining module is used for determining a target user from the candidate users based on the screening results of the candidate users.
10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 8.
11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 8.
12. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 8.
CN202211257122.XA 2022-10-14 2022-10-14 User screening method, device, computer equipment, storage medium and program product Pending CN115545214A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211257122.XA CN115545214A (en) 2022-10-14 2022-10-14 User screening method, device, computer equipment, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211257122.XA CN115545214A (en) 2022-10-14 2022-10-14 User screening method, device, computer equipment, storage medium and program product

Publications (1)

Publication Number Publication Date
CN115545214A true CN115545214A (en) 2022-12-30

Family

ID=84734047

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211257122.XA Pending CN115545214A (en) 2022-10-14 2022-10-14 User screening method, device, computer equipment, storage medium and program product

Country Status (1)

Country Link
CN (1) CN115545214A (en)

Similar Documents

Publication Publication Date Title
CN110738527A (en) feature importance ranking method, device, equipment and storage medium
CN112700324A (en) User loan default prediction method based on combination of Catboost and restricted Boltzmann machine
CN110634060A (en) User credit risk assessment method, system, device and storage medium
CN114358197A (en) Method and device for training classification model, electronic equipment and storage medium
CN112884569A (en) Credit assessment model training method, device and equipment
CN116629435A (en) Risk prediction method, risk prediction device, computer equipment and storage medium
CN114169460A (en) Sample screening method, sample screening device, computer equipment and storage medium
CN114881343B (en) Short-term load prediction method and device for power system based on feature selection
CN110991247A (en) Electronic component identification method based on deep learning and NCA fusion
CN116258923A (en) Image recognition model training method, device, computer equipment and storage medium
CN114784795A (en) Wind power prediction method and device, electronic equipment and storage medium
CN115545214A (en) User screening method, device, computer equipment, storage medium and program product
CN114170000A (en) Credit card user risk category identification method, device, computer equipment and medium
CN114529136A (en) Electronic part component evaluation method and device based on principal component analysis and Topsis
CN114529399A (en) User data processing method, device, computer equipment and storage medium
CN112884028A (en) System resource adjusting method, device and equipment
CN115392594B (en) Electrical load model training method based on neural network and feature screening
CN112801362B (en) Academic early warning method based on artificial neural network and LSTM network
CN115907969A (en) Account risk assessment method and device, computer equipment and storage medium
CN115660810A (en) Resource attribute evaluation method, device, computer equipment, storage medium and product
CN116452308A (en) Risk assessment method, apparatus, computer device, storage medium, and program product
CN115985430A (en) Semen donator abstinence day number determination method and device based on semen donation qualification rate
Liu et al. An Integrated Learning-Based Prediction Model for Purchasing Propensity of Jingdong Visitors
CN117575772A (en) Abnormal user detection method and device, computer equipment and storage medium
CN116975621A (en) Model stability monitoring method and device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination