CN113610132A

CN113610132A - User equipment identification method and device and computer equipment

Info

Publication number: CN113610132A
Application number: CN202110865160.2A
Authority: CN
Inventors: 沈赟; 杨雪君; 朱维娜
Original assignee: Shanghai Qiyue Information Technology Co Ltd
Current assignee: Shanghai Qiyue Information Technology Co Ltd
Priority date: 2021-07-29
Filing date: 2021-07-29
Publication date: 2021-11-05

Abstract

The invention provides a user equipment authentication method and device for internet service and computer equipment. The method comprises the following steps: obtaining equipment data of historical user equipment and equipment internet service performance data, screening variable characteristic data, determining positive and negative samples to establish a first training data set D of a first risk model₁And a test data set D₃(ii) a Performing verification training on the first risk model to establish a second training data set D₂(ii) a Using a gradient harmony measure and using the second training data set D₂Training a second risk model; and using the trained second risk model to perform identification processing on new user equipment applying for the internet service. The invention can improve the distinguishing capability of the model to the positive and negative samples and improve the prediction precision of the modelAnd model hyper-parameters do not need to be adjusted; the risk equipment can be effectively rejected, and the loss of the Internet service platform can be effectively reduced.

Description

User equipment identification method and device and computer equipment

Technical Field

The invention relates to the field of computer information processing, in particular to a user equipment authentication method and device and computer equipment.

Background

Risk control (wind control for short) means that a risk manager takes various measures and methods to eliminate or reduce various possibilities of occurrence of a risk case, or a risk controller reduces losses caused when a risk case occurs. The risk control is generally applied to the internet industry, such as risk control on company transactions, merchant transactions or personal transactions and the like.

In the field of internet wind control, with implementation of regulatory rectification measures and enhancement of consciousness of internet resource user default, fewer default samples are available in internet services, and especially in fraud identification in a specific scene (for example, in a first-time overdue scene), a serious category imbalance problem exists. This essentially means both the imbalance of the positive and negative sample classes and the imbalance of the ease of sample identification. If a large number of negative samples are easy to identify samples and a large number of positive samples are difficult to identify samples as a whole, the two types of imbalances can be summarized as attribute imbalance (or category imbalance), so that the problems of sampling overfitting, inaccurate model parameters, low model precision and the like caused by attribute imbalance or category imbalance (namely an unbalanced sample data set) are caused.

In the prior art, the modeling problem of an unbalanced sample data set is generally solved by using the following method: 1) in the data layer, starting from the original data set, a certain sampling method (such as oversampling, undersampling, mixed random sampling and the like) is used for changing the sample distribution of the data set, so that the degree of imbalance among different types of samples is changed. 2) Based on the algorithm level, on the premise of not changing the sample distribution, the algorithm is more sensitive to small samples by setting sample weights and changing the existing classifier, and the Boosting integration algorithm, the cost sensitive method and the like are generally available. 3) In terms of the criterion, the unbalanced sample data set evaluation is generally evaluated by using indexes such as a confusion matrix, sensitivity, specificity, AUC and the like. In addition, a great improvement space still exists in the aspects of risk prediction, model parameter estimation, model calculation precision, feature extraction and data updating of the user-associated equipment.

Therefore, there is a need for an improved method of authenticating a user equipment.

Disclosure of Invention

In order to solve the following problems: more accurate screening of features, improvement of model prediction accuracy, accurate quantification of risk size of user equipment, effective risk identification of new user equipment, reduction of loss of an internet service platform, and the like.

A first aspect of the present invention provides a method for authenticating a user equipment for an internet service, comprising: obtaining equipment data of historical user equipment and equipment internet service performance data, screening variable characteristic data, determining positive and negative samples to establish a first training data set D of a first risk model₁And a test data set D₃(ii) a Performing verification training on the first risk model to establish a second training data set D₂(ii) a Using a gradient harmony measure and using the second training data set D₂Training a second risk model; and using the trained second risk model to perform authentication processing on new user equipment applying for the internet service.

According to an alternative embodiment of the invention, the method uses a gradient equalization mechanism and uses the second training data set D₂Training the second risk model comprises: from the second training data set D₂Fitting the second training data set D₂To calculate the gradient density tuning parameter beta_i。

According to an alternative embodiment of the invention, the method uses a gradient equalization mechanism and uses the second training data set D₂Training the second risk model comprises: harmonic parameter beta according to gradient density_iCalculating the loss gradient of each training sample by using the following expression, namely calculating the difference between the predicted value and the true value:

wherein N refers to the total number of training samples of the second training data set; p is a radical of_i∈[0，1]Is to useA predicted probability calculated by the second risk model;

refers to a class label used to determine whether the user device is an risky device;

cross entropy loss of each training sample; gd (g) refers to the gradient density of the training samples of the second training data set,

the physical meaning is the number of samples of a unit gradient module length g part; delta_ε(g_kG) in the training sample

The number of samples with gradient mode length distributed in the range; l_ε(g) Means that

Gradient mode length of interval.

According to an alternative embodiment of the invention, comprising: and ending the training of the second risk model when the proportion of the training samples with the loss gradient larger than a set value in the second training data set is more than a specified ratio of the total amount of the samples in the second training data set.

According to an optional embodiment of the invention, the validation training of the first risk model comprises: using a K-fold cross-validation algorithm to combine the first training data set D₁Splitting into training sets D₁₁And a verification set D₁₂Wherein, the content is 5-10; using training set D₁₁And a verification set D₁₂And performing verification training on the first risk model.

According to an alternative embodiment of the invention, the validation set D in each cross-validation is compared₁₂The predicted values are spliced to obtain the distinguishing degree characteristic and the risk characteristic related to the user equipment, and the distinguishing degree characteristic and the risk characteristic are used as the second risk modelThe input feature of (1); characterizing the label value of the second risk model using the label values generated by at least two feature quantifications: APP fraud data of the equipment, overdue data of the equipment, multi-head feature data of the equipment and relationship network feature data of equipment associated users.

According to an optional embodiment of the present invention, the identifying, using the trained second risk model, a new user equipment applying for the internet service includes: when a resource service request of the new user equipment to an internet service platform is received, acquiring equipment data of the new user equipment, inputting the equipment data into the second risk model, and outputting a predicted value of the new user equipment; and judging whether the new user equipment is risk equipment or not according to the calculated predicted value.

Further, a second aspect of the present invention provides a user equipment authentication apparatus for an internet service, comprising: a screening processing module for obtaining the device data of the historical user device and the device internet service performance data, screening the variable characteristic data, and determining the positive and negative samples to establish a first training data set D of the first risk model₁And a test data set D₃(ii) a A first training module for performing verification training on the first risk model to establish a second training data set D₂(ii) a A second training module for using a gradient harmony measure and using the second training data set D₂Training a second risk model; and the identification processing module is used for identifying the new user equipment applying for the internet service by using the trained second risk model.

According to an alternative embodiment of the invention, the training data set comprises a first training data set D, and the first training data set comprises a second training data set D, and the second training data set comprises a second training data set D₂Fitting the second training data set D₂To calculate the gradient density harmonic parameter beta_i。

Furthermore, a third aspect of the present invention provides a computer device comprising a processor and a memory for storing a computer executable program, the processor performing the user equipment authentication method of an internet service according to the first aspect of the present invention when the computer program is executed by the processor.

Furthermore, a fourth aspect of the present invention provides a computer program product storing a computer executable program which, when executed, implements the user equipment authentication method according to the first aspect of the present invention.

Advantageous effects

Compared with the prior art, the method has the advantages that the variable characteristic data are screened to establish the first training data set D of the first risk model₁And a test data set D₃The feature set with high coverage rate, obvious effect of distinguishing the target variable and large information gain can be effectively screened out; the first risk model is verified and trained through a first training data set, and a second training data set D is established₂Accurate training sample data can be provided for the subsequent model building; using a gradient harmony measure and using the second training data set D₂The second risk model is trained, the gradient contributions of different samples can be adjusted in the model training process, the distinguishing capability of the model on positive and negative samples can be improved, the model prediction precision is improved, and the model hyper-parameters do not need to be adjusted; by using the trained second risk model, the new user equipment applying for the internet service can be effectively identified, risk equipment can be effectively rejected, and the loss of an internet service platform can be effectively reduced.

Furthermore, by setting a screening rule and screening variable characteristic data, a characteristic set which is high in coverage rate, obvious in target variable distinguishing effect and large in information gain can be screened out, and accurate training sample data is provided for a subsequent model building; the first risk model is verified and trained through a training set and a verification set which are obtained by dividing data through K-fold cross verification, and a second training data set D is generated through the output result of the first risk model₂A more accurate second training data set D can be obtained₂And can optimize the training process; miningAnd calculating the feature importance of each variable feature by using a decision tree algorithm, and screening the variable features according to the calculation result, so that the features with high risk discrimination and strong interpretability can be screened out.

Drawings

In order to make the technical problems solved by the present invention, the technical means adopted and the technical effects obtained more clear, the following will describe in detail the embodiments of the present invention with reference to the accompanying drawings. It should be noted, however, that the drawings described below are only illustrations of exemplary embodiments of the invention, from which further embodiments can be derived by those skilled in the art without inventive step.

Fig. 1 is a flowchart of an example of a user equipment authentication method of an internet service of embodiment 1 of the present invention.

Fig. 2 is a flowchart of another example of a user equipment authentication method of an internet service according to embodiment 1 of the present invention.

Fig. 3 is a flowchart of still another example of a user equipment authentication method of an internet service according to embodiment 1 of the present invention.

Fig. 4 is a schematic diagram of an example of a user equipment authentication apparatus of an internet service according to embodiment 2 of the present invention.

Fig. 5 is a schematic view of another example of the user equipment authentication apparatus of the internet service of embodiment 2 of the present invention.

Fig. 6 is a schematic view of still another example of a user equipment authentication apparatus of an internet service of embodiment 2 of the present invention.

Fig. 7 is a block diagram of an exemplary embodiment of a computer device according to the present invention.

Fig. 8 is a block diagram of an exemplary embodiment of a computer program product according to the present invention.

Detailed Description

Exemplary embodiments of the present invention will now be described more fully with reference to the accompanying drawings. The exemplary embodiments, however, may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the invention to those skilled in the art. The same reference numerals in the drawings denote the same or similar elements, components, or parts, and thus their repetitive description will be omitted.

Features, structures, characteristics or other details described in a particular embodiment do not preclude the fact that the features, structures, characteristics or other details may be combined in a suitable manner in one or more other embodiments in accordance with the technical idea of the invention.

In describing particular embodiments, the present invention has been described with reference to features, structures, characteristics or other details that are within the purview of one skilled in the art to provide a thorough understanding of the embodiments. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific features, structures, characteristics, or other details.

The flow diagrams depicted in the figures are merely exemplary and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

The block diagrams depicted in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various elements, components, or sections, these terms should not be construed as limiting. These phrases are used to distinguish one from another. For example, a first device may also be referred to as a second device without departing from the spirit of the present invention.

The term "and/or" and/or "includes any and all combinations of one or more of the associated listed items.

In view of the above, the present invention builds a first training data set D of a first risk model by screening variable feature data₁And a test data set D₃The feature set with high coverage rate, obvious effect of distinguishing target variables and large information gain can be effectively screened out; the first risk model is verified and trained through a first training data set, and a second training data set D is established₂Accurate training sample data can be provided for the subsequent model building; using a gradient harmony measure and using the second training data set D₂The second risk model is trained, the gradient contributions of different samples can be adjusted in the model training process, the distinguishing capability of the model on positive and negative samples can be improved, the model prediction precision is improved, and the model hyper-parameters do not need to be adjusted; by using the trained second risk model, the new user equipment applying for the internet service can be effectively identified, risk equipment can be effectively rejected, and the loss of an internet service platform can be effectively reduced.

It should be noted that, the innovation of the present invention is how to make the risk identification process of the user equipment more automated, efficient and reduce the labor cost according to the interaction between the user equipment and the internet service platform (i.e. the information interaction between the objects). However, for convenience, the present invention will be described with respect to the implementation of new ue authentication using internet services as an example, but should not be construed as limiting the present invention. The specific procedure of the new ue authentication method will be described in detail below.

Example 1

Hereinafter, an embodiment of a user equipment authentication method of an internet service of the present invention will be described with reference to fig. 1 to 3.

Fig. 1 is a flowchart of an example of a user equipment authentication method of an internet service of the present invention. As shown in fig. 1, the user equipment authentication method includes the following steps.

Step S101, obtaining the device data of the historical user device and the device Internet service performance data,screening variable characteristic data, determining positive and negative samples to establish a first training data set D of a first risk model₁And a test data set D₃。

Step S102, carrying out verification training on the first risk model to establish a second training data set D₂。

Step S103, using a gradient harmony method and using the second training data set D₂The second risk model is trained.

And step S104, using the trained second risk model to identify new user equipment applying for the Internet service.

In order to accurately identify the risk of the new user equipment, the equipment internet service performance data during the internet service period is subjected to feature screening, and according to the screened features (such as features with high coverage rate, high discrimination (KS), large information gain value and the like), training sample data is provided for establishing a model so as to accurately identify the effective risk of the new user equipment, effectively reject the risk equipment and reduce the loss of an internet service platform.

In the present invention, the internet service includes an internet service resource that provides, for example, shopping, riding, maps, takeout, sharing a single car, and the like, in response to an application from the user device (or the user-associated device) to the internet service platform. Such as resource allocation services, resource usage services, resource guarantee services or mutual aid services, resource raising services, group buying and taking bus services, etc. Where resources refer to any available material, information, time, information resources including computing resources and various types of data resources. The data resources include various private data in various domains. User equipment (or user associated equipment) refers to equipment associated with a registered user when applying for services on an internet service platform, and is generally represented by an equipment ID.

The specific process of the method of the present invention will be described below by taking an internet resource allocation service as an example.

First, in step S101, device data of a history user device andthe equipment internet service performance data is screened, variable characteristic data is screened, positive and negative samples are determined, and a first training data set D of a first risk model is established₁And a test data set D₃。

As a specific implementation manner, in a scene of resource allocation application of user equipment to an internet resource allocation service, acquiring device data and device internet service performance data of historical user equipment, wherein the device data includes a device ID, a device identification code, and a device name, and data such as shutdown, number change or suspended use during resource use, uninstall APP, rejection of customer service calls, and the like; the device internet service performance data comprises internet service application frequency internet service use times in a specific time period, internet resource unreturned data, overdue data of the device, APP fraud data of the device, multi-head feature data of the device, relation network feature data of device associated users, device associated user data of the same device and the number of device associated users, wherein the overdue data of the device comprises whether the user device returns internet service resources in the specific time period from resource returning time.

Specifically, the specific time period is 5 to 30 days, for example, the specific time period is 5 days, 7 days, 15 days, 20 days or 30 days.

More specifically, the device-associated user data includes user basic information, people information, multi-head information, various operation behavior information of the internet resource service APP, and the like.

Further, for example, the information is fused to form a wide-table variable, and data cleaning processing is performed on related data to ensure stability and accuracy of a later model.

Specifically, the data cleaning process includes at least two of the following processes: the method comprises the steps of variable missing rate analysis processing, PSI analysis, abnormal value processing, continuous variable discretization processing, WOE conversion processing, discrete variable WOE conversion and dummy variable conversion, text variable processing, feature derivation and the like.

Next, a process of screening the variable characteristic data will be specifically described.

Preferably, the variable characteristic data is filtered by setting a filtering rule.

Specifically, the screening rule includes determining a screening parameter, where the screening parameter includes at least one of the following parameters: the variable coverage, the single-value coverage, the correlation or significance with the target variable, the variable stability PSI, the discrimination (KS) of the target variable, the information gain value (IV), the characteristic importance and the like.

In one embodiment, the screening parameters are determined as information gain values (IV), variable coverage, and discriminations of target variables (KS). The screening rule comprises the variable characteristics of the maximum information gain value, the variable characteristics of which the variable coverage is greater than a set value and the variable characteristics of which the discrimination (KS) is greater than a specified value.

Further, the screening rule further includes: and performing binning processing on each model characteristic variable, and performing WOE conversion to calculate an information gain value IV of each variable characteristic and a correlation coefficient CORR between the variable characteristics.

For example, each model feature variable is binned using chi-square binning and the WOE is calculated_iThe specific calculation formula is as follows.

Wherein, WOE_iWOE value for the ith bin; # good (i) is the number of tags good in the ith bin; # good (T) is the total number of good in all bins; # bad (i) is the number of bins marked bad in the ith bin; # bad (T) is the total number of bad in all bins.

Further, based on the calculated WOE_iAnd calculating an information gain value IV of each variable characteristic, wherein a specific calculation formula is as follows.

Wherein IV meansInformation gain values of the variable characteristics; n is the number of variable characteristics; since WOE in formula (2) is calculated_iWord of equation (1), # good (i), # good (T), # bad (i), # bad (T)_iThe physical meanings of # good (i), # good (t), # bad (i) and # bad (t) are the same, and therefore, the description thereof will be omitted.

Specifically, the variable features including a first variable feature having a coverage rate greater than a specified value, a second variable feature having a discrimination rate (KS) greater than a specific threshold value, and a third variable feature having an information gain value (IV) greater than a set value are screened according to a screening rule that selects the variable feature having the largest information gain value, the variable feature having a variable coverage rate greater than a set value, the variable feature having a discrimination rate (KS) greater than a specified value, and the like.

Optionally, the screened variable features are subjected to cross combination operation to amplify the variable features, and further generate other derivative variable features.

Therefore, by setting a screening rule and screening the variable characteristic data, the characteristic set which is high in coverage rate, obvious in target variable distinguishing effect and large in information gain can be screened out, and accurate training sample data is provided for the subsequent model establishment.

In the present example, the filtering rule is set for the extraction and filtering of the variable feature, but the present invention is not limited to this, and in other examples, the extraction rule and the filtering rule may be set, or only the extraction rule may be set, or only the filtering rule may be set. The foregoing is by way of example only and is not to be construed as limiting the present invention.

Next, positive and negative examples are determined to build a first training data set D of the first risk model₁And a test data set D₃Wherein the first risk model is a primary classifier comprising an XGboost model, a LightGBM model, and a GBDT model.

Specifically, a first training data set D is established based on the screened variable features₁And a test data set D₃Said test data set D₃For changes to new extractionsThe quantity characteristics (e.g., first variable characteristics, second variable characteristics, and third variable characteristics) are scored and predicted.

More specifically, a first training data set D₁Including device data of a historical user device tagged with a first risk label, a first variable characteristic, a second variable characteristic, and a third variable characteristic.

As a specific example, good and bad samples (i.e., positive and negative samples) are defined, and a first training data set D of a first risk model is established₁Wherein the first risk label is 0, 1. The first risk label is determined by a specified probability of whether the user equipment has returned the internet service resource within a certain time period (e.g., within 7 days) of the first resource return, specifically, 1 represents a sample that the user equipment has returned the internet service resource within the certain time period (e.g., within 7 days) of the first resource return, and 0 represents a sample that the user equipment has not returned the internet service resource within the certain time period (e.g., within 7 days) of the first resource return, and the probability of return is less than Y. Generally, the higher the return probability of the user equipment, the better the internet resource recovery, the better the efficiency of capital use, and the lower the risk level of the asset, and vice versa.

Further, a test data set D of the first risk model is also established₃。

In this example, the ratio of the number of bad samples (i.e., negative samples) to the number of good samples (i.e., positive samples) is less than 1%.

In order to equalize the good and bad samples, for example, the good samples (i.e., the few samples) are oversampled using a SMOTE algorithm, etc., and the good samples (i.e., the most samples) are undersampled, so that the number ratio of the good samples to the bad samples reaches a specified ratio (e.g., 1:3 to 8:9), i.e., the specified ratio of the training data set can be established. Thereby, the first training data set D of the first risk model can be accurately established₁And a test data set D₃。

The above description is only for illustrative purposes and is not intended to limit the present invention.

In step (b)In step S102, the first risk model is validated and trained to establish a second training data set D₂。

Specifically, the first training data set D is subjected to K-fold cross validation algorithm₁Split into training sets D₁₁And a verification set D₁₂Using training set D₁₁And a verification set D₁₂And performing verification training on the first risk model.

Note that, the training set D is set₁₁And a verification set D₁₂The included data and a first training data set D₁The included data is the same, and therefore, the description of this portion is omitted.

And dividing data by using K-fold cross validation, wherein K is 5-10.

For example, K is 5. Specifically, the first training data set D is subjected to five-fold cross validation algorithm₁Splitting to form a training set D₁₁And a verification set D₁₂. In other words, for the first training data set D₁Make CVfold 5, the first training data set D₁In (3) was divided into 5 parts, and 4 parts of the data were used as training data (training set D)₁₁) And the remaining 1 copy is used as val data (validation set D)₁₂)。

For example, K is 10. Specifically, the first training data set D is subjected to a ten-fold cross validation algorithm₁Splitting to form a training set D₁₁And a verification set D₁₂. In other words, for the first training data set D₁Make CVfold 10, set the first training data set D₁Of which 9 were used as training data (training set D) by dividing all the data into 10 parts₁₁) And the remaining 1 is used as val data (validation set D)₁₂)。

Therefore, the first risk model is subjected to verification training through a training set and a verification set which are obtained by dividing data through K-fold cross validation, and a second training data set D is generated by using the output result of the first risk model₂Thereby enabling a more accurate second training data set D to be obtained₂And can optimize trainingAnd (6) carrying out the process.

Further, a second training data set D is generated using the output results of the first risk model₂。

In a preferred embodiment, step S102 is split into "performing verification training on the first risk model" step S102 and "establishing a second training data set D₂"step S201, see in particular fig. 2.

In step S201, a second training data set D is established₂。

Specifically, a second training data set D is generated by splicing the output results of the first risk model₂。

In one embodiment, validation set D in each cross-validation₁₂The predicted values are spliced to obtain the distinguishing degree characteristic and the risk characteristic related to the user equipment, and the distinguishing degree characteristic and the risk characteristic are used as the input characteristic of the second risk model.

In particular, the test data set D is used₃Scoring prediction is performed, and the scores generated by the ten-time cross validation are averaged to serve as the value of the new variable characteristic (i.e., the derivative variable characteristic corresponding to the invention) in the test data set.

For example, if there are m first risk models (i.e., primary classifiers), and training set D₁₁Is n, then n rows and m columns of new variable features (i.e., corresponding to the derived variable features in the present invention) are generated for use as input features for the second risk model (i.e., the secondary classifier) to generate a second training data set D₂。

For test data set D₃The values of the derived variable features of (a) are determined for the test data set D each time a primary classifier is trained for cross-validation₃The scores of the primary classifiers generated by 10 times of cross validation are finally averaged to be used as a test data set D₃The value of the derived variable feature.

In another embodiment, the verification set D is₁₂And/or testing data set D₃Derived variants generatedThe quantity features are used as input features for a secondary classifier, together with other features, to generate a second training data set D₂。

In another embodiment, based on the result of the stitching process, an RFE recursive variable screening method is used to further perform feature screening, and new derived variable features are generated as input features of the second risk model and used as the second training data set D₂A portion of data.

In particular, the derived variable characteristics include discriminative, risk, explanatory and complementary characteristics relating to characterizing user equipment risk or fraud.

Thus, by verifying set D in each cross-verification₁₂The predicted values are spliced to be used as input characteristics of the second risk model; based on the splicing processing result, further performing feature screening by adopting an RFE recursive variable screening method to generate new derivative variable features serving as input features of a second risk model; by using a verification set D₁₂And/or testing data set D₃Generating derived variable features as input features for a second risk model to generate a second training data set D₂Thus, a second training data set D containing a plurality of variable characteristics can be obtained₂。

In an embodiment, the second training data set D₂Including device data (e.g., device ID, device identification code) of a historical user device labeled with a second risk label (e.g., probability of fraud or probability of breach), wherein the probability of fraud is quantitatively characterized by at least two characteristics: the method comprises the steps of resource request data of false information, the number of times of repeated application of the same equipment in specific time is larger than a set value, APP fraud data of the equipment, overdue data of the equipment, multi-head characteristic data of the equipment, relation network characteristic data of equipment associated users and the like.

Optionally, the fraud probability is characterized by at least two characteristic quantifications: APP fraud data of the equipment, overdue data of the equipment, multi-head feature data of the equipment, related network feature data of an equipment-associated user and the like.

It should be noted that the settings of the first risk label and the second risk label may be interchanged, that is, the first risk label is the fraud probability or default probability, and the second risk label is the designated probability of whether the ue has returned the internet service resource within a certain time period (for example, within 7 days) after the first resource return. In other examples, other features may also be used to set the first risk label or the second risk label. The foregoing is illustrative only and is not to be construed as limiting the invention.

In order to improve the distinguishing capability of the model on positive and negative samples and improve the prediction accuracy of the model, the gradient contribution of different samples is adjusted in the model training process by using a gradient blending mechanism method, the aim is to not pay more attention to easily-separable samples but not to particularly-difficultly-separable samples (i.e. outliers), and a loss function is redesigned by embedding the gradient blending mechanism into classification loss without adjusting the super-parameters of the model. This section will be specifically described below.

In step S103, a gradient harmony measure is used, and the second training data set D is used₂The second risk model is trained.

In one embodiment, the second risk model is established using an XGBoost model or an LR model.

For example, class tag values generated using APP fraud data for the device and overdue data for the device are quantized to characterize tag values of the second risk model, and the input characterization is a device ID for the user device.

In particular, the second training data set D generated (or established) according to step S102₂Fitting said second training data set D₂To calculate the gradient density harmonic parameter beta_i。

Calculating a gradient density harmonic parameter β using the following expression_i。

Wherein，β_iRefers to gradient density harmonic parameters; n refers to the total number of training samples of the second training data set; GD (g)_i) The gradient density of the training samples of the second training data set is defined, and the modulus length is the number i of samples of the g part, g is | p-p^*|。

Further, the parameter beta is blended according to the gradient density_iCalculating the loss gradient of each training sample by using the following expression, namely calculating the difference between the predicted value and the true value:

wherein N refers to the total number of training samples of the second training data set; p is a radical of_i∈[0，1]Is a predicted probability calculated using the second risk model;

Gradient mode length of interval.

Using the second training data set D₂When the second risk model is trained, whether the second risk model is knotted or not is judged according to the loss gradient calculated by the training sampleAnd (5) training a bundle.

In one embodiment, when the second training data set D₂The proportion of the training samples with the loss gradient of the middle training sample larger than the set value is in the second training data set D₂Is greater than or equal to the specified ratio of the total number of samples, the training of the second risk model is terminated.

In another embodiment, when the second training data set D₂The proportion of the training samples with the loss gradient larger than the set value in the middle training sample is not in the second training data set D₂Above a specified ratio of the total number of samples, the training is continued until the training samples are in proportion to the second training data set D₂Is above a specified ratio of the total number of samples, the training of the second risk model is terminated.

Therefore, by using the gradient blending mechanism method, the gradient contributions of different samples are adjusted in the model training process, the distinguishing capability of the model on positive and negative samples can be improved, the model prediction precision is improved, and the model hyper-parameters do not need to be adjusted.

Next, in step S104, using the trained second risk model, a new user equipment applying for the internet service is authenticated.

In an embodiment, when a resource service request from the new user equipment to an internet service platform is received, the device data of the new user equipment is acquired, the device data is input into the second risk model, and the predicted value of the new user equipment is output.

Specifically, whether the new user equipment is a risk equipment is judged according to the calculated predicted value.

In one embodiment, the calculated predicted value is compared with a set threshold value, and when the calculated predicted value is less than or equal to the set threshold value, it is determined that the internet service resource is provided to the new user equipment.

In another embodiment, when the calculated predicted value is greater than a set threshold, it is determined that the internet service resource is not provided to the new user equipment.

Therefore, the trained second risk model is used for identifying the new user equipment applying for the internet service, so that the risk of the user equipment can be accurately quantified, and meanwhile, the risk identification is carried out on the new user equipment, and the loss of an internet service platform and the like are effectively reduced.

In another example, in order to screen out features with high risk differentiation degree and strong interpretability, the feature importance degree of each variable feature is calculated, and variable feature screening is carried out according to the calculation result. That is, step S101 in fig. 1 is split into step S101 and step S301, see fig. 3 in particular.

In step S301, feature importance of each feature is calculated using a decision tree algorithm according to the acquired relevant data of the historical user equipment. For example, using algorithms such as C4.5, CART, etc.

In a first embodiment, the second risk model is built using a decision tree algorithm. Configuring risk labels according to a feature group (the feature group comprises at least one feature) in the acquired data, grouping historical user equipment, and establishing a plurality of training data sets and a plurality of testing data sets based on the risk labels. For example, the training data set includes device data for historical user devices with a first risk label (a specified probability of whether the user device returned internet service resources within a particular time period (e.g., within 7 days) of the first time the resources were returned). For example, the training data set includes device data of historical user devices having a second risk label (whether the probability of fraud for the user device is greater than a set value).

Specifically, for each feature group and the corresponding training data set, the following steps are performed:

1) dividing sample data in corresponding training data sets according to each characteristic group so as to enable each training data set to realize the process of user equipment classification (namely user equipment grouping), namely generating a multilayer class tree containing class nodes; 2) calculating the model classification accuracy of each feature group, screening out the corresponding feature groups when the calculated model classification accuracy is greater than a specified value (for example, 85% -90%), calculating the information gain values IV of all the screened variable features as feature importance, and sequencing each feature group and each feature in sequence according to the feature importance from high to low; 3) and selecting a specific number of features from the sequence to complete the variable feature screening. For example, a feature ranked as the top 5 is selected.

Therefore, the decision tree algorithm is adopted to calculate the feature importance of each variable feature, and the variable feature screening is carried out according to the calculation result, so that the features with high risk discrimination and strong interpretability can be screened out.

The above-described procedure of the user equipment authentication method is only for explanation of the present invention, and the order and number of steps are not particularly limited. In addition, the steps in the method can be split into two or three steps, or some steps can be combined into one step, and the steps are adjusted according to practical examples.

Those skilled in the art will appreciate that all or part of the steps to implement the above-described embodiments are implemented as programs (computer programs) executed by a computer data processing apparatus. When the computer program is executed, the method provided by the invention can be realized. Furthermore, the computer program may be stored in a computer readable storage medium, which may be a readable storage medium such as a magnetic disk, an optical disk, a ROM, a RAM, or a storage array composed of a plurality of storage media, such as a magnetic disk or a magnetic tape storage array. The storage medium is not limited to centralized storage, but may be distributed storage, such as cloud storage based on cloud computing.

Furthermore, by setting a screening rule and screening variable characteristic data, a characteristic set which is high in coverage rate, obvious in target variable distinguishing effect and large in information gain can be screened out, and accurate training sample data is provided for a subsequent model building; the first risk model is verified and trained through a training data set and a verification set which are obtained by dividing data through K-fold cross verification, and a second training data set D is generated by using an output result of the first risk model₂A more accurate second training data set D can be obtained₂And can optimize the training process; and calculating the feature importance of each variable feature by adopting a decision tree algorithm, and screening the variable features according to the calculation result, so that the features with high risk discrimination and strong interpretability can be screened out.

Example 2

Embodiments of the apparatus of the present invention are described below, which may be used to perform method embodiments of the present invention. The details described in the device embodiments of the invention should be regarded as complementary to the above-described method embodiments; reference may be made to the above-described method embodiments for details not disclosed in the apparatus embodiments of the present invention.

With reference to figures 4, 5 and 6,the present invention also provides a user equipment authentication apparatus 400 for an internet service, the user equipment authentication apparatus 400 comprising: a screening processing module 401, configured to obtain device data of the historical user device and device internet service performance data, screen variable characteristic data, and determine positive and negative samples to establish a first training data set D of the first risk model₁And a test data set D₃(ii) a A first training module 402 for proof training the first risk model to establish a second training data set D₂(ii) a A second training module 403 for using a gradient blending machine method and using the second training data set D₂Training a second risk model; and an authentication processing module 404, configured to perform authentication processing on a new user equipment applying for the internet service by using the trained second risk model.

In one embodiment, as shown in fig. 5, the user equipment authentication apparatus 400 includes a data splitting module 501, and the data splitting module 501 is configured to use a K-fold cross-validation algorithm to split the first training data set D into the first training data set D and the second training data set D₁Splitting into training sets D₁₁And a verification set D₁₂Wherein K is 5-10; using training set D₁₁And a verification set D₁₂And performing verification training on the first risk model.

Further, the verification set D in each cross-verification is subjected to₁₂The predicted values are spliced to obtain the distinguishing degree characteristic and the risk characteristic related to the user equipment, and the distinguishing degree characteristic and the risk characteristic are used as the input characteristic of the second risk model; quantifying the generated label value using at least two features to characterize the label value of the second risk model: APP fraud data of the equipment, overdue data of the equipment, multi-head characteristic data of the equipment and relation network characteristic data of equipment associated users.

In another embodiment, as shown in fig. 6, the user equipment authentication apparatus 400 further comprises a calculation processing module 601, and the calculation processing module 601 is configured to calculate the second training data set D according to the second training data set D₂Fitting the second training data set D₂To calculate the gradient density harmonic parameter beta_i。

Wherein, beta_iRefers to gradient density harmonic parameters; n refers to the total number of training samples of the second training data set; GD (g)_i) The gradient density of the training samples of the second training data set is defined, and the modulus length is the number i of samples of the g part, g is | p-p^*|。

In particular, the parameter β is blended according to the gradient density_iCalculating the loss gradient of each training sample by using the following expression, namely calculating the difference between the predicted value and the true value:

Gradient mode length of interval.

In an embodiment, the training of the second risk model is ended when the proportion of training samples in the second training data set with a loss gradient greater than a set value to the total amount of samples in the second training data set is above a specified ratio.

Specifically, the identifying, using the trained second risk model, a new user equipment applying for the internet service includes: when a resource service request of the new user equipment to an internet service platform is received, acquiring equipment data of the new user equipment, inputting the equipment data into the second risk model, and outputting a predicted value of the new user equipment; and judging whether the new user equipment is risk equipment or not according to the calculated predicted value.

In another example, in order to screen out features with high risk differentiation degree and strong interpretability, the feature importance degree of each variable feature is calculated, and variable feature screening is carried out according to the calculation result.

Specifically, the feature importance of each feature is calculated using a decision tree algorithm. For example, using algorithms such as C4.5, CART, etc.

The second risk model is built, for example, using a decision tree algorithm. Configuring risk labels according to a feature group (the feature group comprises at least one feature) in the acquired data, grouping historical user equipment, and establishing a plurality of training data sets and a plurality of testing data sets based on the risk labels. For example, the training data set includes device data for historical user devices with a first risk label (a specified probability of whether the user device returned internet service resources within a particular time period (e.g., within 7 days) after the first resource return). For example, the training data set includes device data of historical user devices having a second risk label (whether the probability of fraud for the user device is greater than a set value).

In embodiment 2, the same portions as those in embodiment 1 are not described.

Those skilled in the art will appreciate that the modules in the above-described embodiments of the apparatus may be distributed as described in the apparatus, and may be correspondingly modified and distributed in one or more apparatuses other than the above-described embodiments. The modules of the above embodiments may be combined into one module, or may be further split into a plurality of sub-modules.

Compared with the prior art, the method has the advantages that the variable characteristic data are screened to establish the first training data set D of the first risk model₁And a test data set D₃The feature set with high coverage rate, obvious effect of distinguishing the target variable and large information gain can be effectively screened out; the first risk model is verified and trained through a first training data set, and a second training data set D is established₂Accurate training sample data can be provided for the subsequent model building; using a gradient harmony measure and using the second training data set D₂The second risk model is trained, the gradient contributions of different samples can be adjusted in the model training process, the distinguishing capability of the model on positive and negative samples can be improved, the model prediction precision is improved, and the model hyper-parameters do not need to be adjusted; using the trainedThe second risk model can effectively identify and process the new user equipment applying for the internet service, can effectively reject the risk equipment, and can effectively reduce the loss of the internet service platform.

Further, the first risk model is verified and trained through a training set and a verification set which are obtained by dividing data through K-fold cross verification, and a second training data set D is generated by using an output result of the first risk model₂A more accurate second training data set D can be obtained₂And can optimize the training process; the decision tree algorithm is adopted to calculate the feature importance of each variable feature, and variable feature screening is carried out according to the calculation result, so that the features with high risk discrimination and strong interpretability can be screened out.

Example 3

In the following, embodiments of the computer apparatus of the present invention are described, which may be seen as specific physical embodiments for the above-described embodiments of the method and apparatus of the present invention. The details described in connection with the computer device embodiment of the invention are to be regarded as supplementary to the above-described method or apparatus embodiment; for details not disclosed in the computer device embodiment of the invention, reference may be made to the above-described method or apparatus embodiments.

Fig. 7 is a block diagram of an exemplary embodiment of a computer device according to the present invention. The computer apparatus 200 according to this embodiment of the present invention is described below with reference to fig. 7. The computer device 200 shown in fig. 7 is only an example and should not bring any limitation to the function and the scope of use of the embodiments of the present invention.

As shown in FIG. 7, computer device 200 is in the form of a general purpose computing device. The components of the computer device 200 may include, but are not limited to: at least one processing unit 210, at least one memory unit 220, a bus 230 connecting different device components (including the memory unit 220 and the processing unit 210), a display unit 240, and the like.

Wherein the storage unit stores program code executable by the processing unit 210 to cause the processing unit 210 to perform steps according to various exemplary embodiments of the present invention described in the processing method section of the above-mentioned computer apparatus of the present specification. For example, the processing unit 210 may perform the steps as shown in fig. 1.

The storage unit 220 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)2201 and/or a cache memory unit 2202, and may further include a read only memory unit (ROM) 2203.

The storage unit 220 may also include a program/utility 2204 having a set (at least one) of program modules 2205, such program modules 2205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 230 may be any representation of one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The computer device 200 may also communicate with one or more external devices 300 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the computer device 200, and/or with any devices (e.g., router, modem, etc.) that enable the computer device 200 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 250. Also, computer device 200 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) through network adapter 260. Network adapter 260 may communicate with other modules of computer device 200 via bus 230. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the computer device 200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments of the present invention described above can be implemented by software, and can also be implemented by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a computer-readable storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, or a network device, etc.) to execute the above-mentioned method according to the present invention. Which when executed by a data processing device, enables the computer program product to carry out the above-mentioned method of the invention.

As shown in fig. 8, the computer program may be stored on one or more computer program products. The computer program product may be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer program product include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer program product may comprise a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer program product may be transmitted, propagated, or transported for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer program product may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

In summary, the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. It will be appreciated by those skilled in the art that some or all of the functionality of some or all of the components in embodiments in accordance with the invention may be implemented in practice using a general purpose data processing device such as a microprocessor or a Digital Signal Processor (DSP). The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such a program implementing the invention may be stored on a computer program product or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

While the foregoing detailed description has described the objects, aspects and advantages of the present invention in further detail, it should be appreciated that the present invention is not inherently related to any particular computer, virtual machine, or computer apparatus, as various general purpose devices may implement the present invention. The invention is not to be considered as limited to the specific embodiments thereof, but is to be understood as being modified in all respects, all changes and equivalents that come within the spirit and scope of the invention.

Claims

1. A method for authenticating a user device for internet service, comprising:

obtaining equipment data of historical user equipment and equipment internet service performance data, screening variable characteristic data, determining positive and negative samples to establish a first training data set D of a first risk model₁And a test data set D₃；

Performing verification training on the first risk model to establish a second training data set D₂；

Using a gradient harmony measure and using the second training data set D₂Training a second risk model;

and using the trained second risk model to perform identification processing on new user equipment applying for the internet service.

2. The method of claim 1, wherein the method uses a gradient equalization mechanism and uses the second training data set D₂Training the second risk model comprises:

from the second training data set D₂Fitting the second training data set D₂To calculate the gradient density harmonic parameter beta_i。

3. The method of claim 2, wherein the method uses a gradient equalization mechanism and uses the second training data set D₂Training the second risk model comprises:

blending parameter beta according to gradient density_iCalculating the loss gradient of each training sample by using the following expression, namely calculating the difference between the predicted value and the true value:

Gradient mode length of interval.

4. The user equipment authentication method according to claim 3, comprising:

and when the proportion of the training samples with the loss gradient larger than a set value in the second training data set is more than a specified ratio of the total amount of the samples in the second training data set, finishing the training of the second risk model.

5. The method of claim 1 or 2, wherein the proof training of the first risk model comprises:

using a K-fold cross-validation algorithm to combine the first training data set D₁Splitting into training sets D₁₁And a verification set D₁₂Wherein K is 5-10;

using training set D₁₁And a verification set D₁₂And performing verification training on the first risk model.

6. The user equipment authentication method according to claim 5,

for validation set D in each cross validation₁₂The predicted values are spliced to obtain the distinguishing degree characteristic and the risk characteristic related to the user equipment, and the distinguishing degree characteristic and the risk characteristic are used as the input characteristic of the second risk model;

the tag value of the second risk model is characterized using the tag values generated by the at least two feature quantifications:

APP fraud data of the equipment, overdue data of the equipment, multi-head feature data of the equipment and relationship network feature data of equipment associated users.

7. The method of claim 1, wherein the authenticating a new ue applying for the internet service using the trained second risk model comprises:

when a resource service request of the new user equipment to an internet service platform is received, acquiring equipment data of the new user equipment, inputting the equipment data into the second risk model, and outputting a predicted value of the new user equipment;

and judging whether the new user equipment is risk equipment or not according to the calculated predicted value.

8. An apparatus for authenticating a user equipment for an internet service, comprising:

a screening processing module for obtaining the device data of the historical user device and the device internet service performance data, screening the variable characteristic data, and determining the positive and negative samples to establish a first training data set D of the first risk model₁And a test data set D₃；

A first training module for performing verification training on the first risk model to establish a second training data set D₂；

A second training module for using a gradient harmony measure and using the second training data set D₂Training a second risk model;

and the identification processing module is used for identifying the new user equipment applying for the internet service by using the trained second risk model.

9. The apparatus as claimed in claim 8, further comprising a calculation processing module for calculating the second training data set D according to the second training data set₂Fitting the second training data set D₂To calculate the gradient density harmonic parameter beta_i。

10. A computer device comprising a processor and a memory for storing a computer executable program, the processor performing the user equipment authentication method of an internet service according to any one of claims 1-7 when the computer program is executed by the processor.