CN110737641A

CN110737641A - Construction method, device and system of confidence and audit models

Info

Publication number: CN110737641A
Application number: CN201810708485.8A
Authority: CN
Inventors: 解智; 郭汝元; 孙乐为; 张锋; 乔森; 庞敏辉; 邱慧
Original assignee: Shanghai Information Technology Co Ltd
Current assignee: Youxuan Beijing Information Technology Co ltd
Priority date: 2018-07-02
Filing date: 2018-07-02
Publication date: 2020-01-31

Abstract

The embodiment of the application shows a construction method, a device and a system of credit and audit models, the scheme shown in the embodiment of the application divides historical data in advance, screens the historical data to remove modeling individuals with high loss rate and features with poor prediction capability, uses the remaining data as a feature candidate set, constructs the credit and audit model based on an integrated learning model of Logistic Regression, can greatly improve the timeliness and efficiency of the credit and audit, can analyze loan capability and repayment willingness of a borrower from all aspects and dimensions by relying on the historical data, can identify malicious fraud phenomena, and improves the accuracy and robustness of prediction.

Description

Construction method, device and system of confidence and audit models

Technical Field

The invention relates to the technical field of computers, in particular to a construction method, a device and a system of confidence models.

Background

With the continuous development of scientific technology, the rapid rise of internet and industry, the internet and finance keep pace with the era, network lending has entered thousands of households, borrowers can rapidly gather funds, meanwhile, investors and guarantors can obtain satisfactory income, social thinking is added to a platform, complete online financial ecological systems are created, certainly, with the continuous expansion of platform services and the continuous rise of user quantity, and lawless persons play askew ideas to carry out fraudulent behaviors on the platform, and great difficulty and loss are caused to platform operation.

In order to reduce the loss of the platforms, a great amount of personnel with professional background knowledge are required to participate in the credit approval process, fixed time is required for credit assessment of potential lenders, the credit approval amount is increased along with the increase of business, the timeliness of the traditional mode is difficult to guarantee, platforms construct a credit approval model through sorting and analyzing historical data, application information is applied in the credit approval process, then the application information is input into the credit approval model, and finally evaluation results are output by the credit approval model according to the application information.

According to the method for constructing the letter and audit model, the evaluation result and the application information are simultaneously input into the computer, the computer constructs the letter and audit model according to the relation between the application information and the evaluation result, but the application information and the evaluation result are directly input into the model constructed by the computer due to the uniqueness of the financial application information, and the accuracy of the evaluation result output by adopting the letter and audit model is difficult to guarantee.

Disclosure of Invention

The invention aims to provide a construction method, a device and a system of confidence models, which solve the technical problem that the accuracy of output evaluation results of the confidence models constructed by the construction method of the confidence models in the prior art is difficult to guarantee.

The aspect of the embodiment of the present application shows a method for constructing confidence and review models, where the method includes:

acquiring historical data, and segmenting the historical data into a data set, wherein the data set comprises: a training set, and, a testing set;

screening the data set to obtain a characteristic candidate set, wherein the characteristic candidate set comprises: training a feature candidate set, and testing the feature candidate set;

and constructing a confidence and review model according to the features in the feature candidate set and the evaluation result.

Optionally, the step of screening the data set to obtain a feature candidate set includes:

deleting useless features in the data set to obtain data to be processed;

and converting the data in the time format into week characteristics to obtain a characteristic candidate set.

Optionally, the step of deleting the useless features in the data set to obtain the data to be processed includes:

counting a response value of a prediction result, an unresponsive value of the prediction result, and a characteristic response value and a characteristic unresponsive value of each characteristic in the data set;

according to the predicted result response value, the predicted result non-response value, the feature response value and the feature non-response value, the WOE value of each feature is calculated;

determining whether the WOE value is less than a WOE threshold;

if so, determining the feature that produced the WOE value as a useless feature;

and deleting useless features in the data set to obtain data to be processed.

counting a characteristic response value and a characteristic unresponsive value of each characteristic in the data set;

calculating actual passing probability according to the characteristic response value and the characteristic non-response value;

calculating theoretical values according to the actual passing probability, the characteristic response value of each characteristic and the characteristic unresponsive value, wherein the theoretical values comprise a theoretical response value and a theoretical unresponsive value;

calculating a chi-square value according to the theoretical response value, the theoretical unresponsive value, the characteristic response value and the characteristic unresponsive value;

judging whether the chi-square value is larger than a critical threshold value or not;

if so, determining that the feature generating the chi-squared value is determined to be a useless feature;

and deleting useless features in the data set to obtain data to be processed.

counting the loss rate of features in each modeled individual in the dataset;

and if the deletion rate is greater than the deletion rate threshold value, deleting the modeling individuals generating the deletion rate to obtain a feature candidate set.

Optionally, the step of constructing a confidence and review model according to the features in the feature candidate set and the evaluation result includes:

taking the features in the feature candidate set as a dividing basis to generate a divided data set;

respectively calculating the information gain rate of each divided data set;

determining the divided data set generating the maximum information gain rate as a target data set;

and constructing a confidence and review model according to the characteristics of the target data set and the evaluation result.

Alternatively, the step of determining the partitioned data set that will yield the largest information gain ratio as the target data set comprises:

determining the divided data set which generates the maximum information gain rate as a data set to be processed;

counting a response value and an unresponsive value of each feature segmented prediction result of each feature of the data set to be processed, and a segmented response value and a segmented unresponsive value of each feature of each processed data set;

calculating the IVi value of each feature according to the segmented prediction result response value, the segmented prediction result non-response value, the segmented response value and the segmented non-response value;

calculating an IV value showing a data set to be processed according to the IVi value of each feature;

judging whether the IV value is smaller than an IV threshold value;

if the value is less than the preset value, determining the data set to be processed of the IV value as a useless data set to be processed;

and deleting the useless data set to be processed to obtain a target data set.

The second aspect of the embodiment of the present application shows an apparatus for constructing kinds of confidence models, where the apparatus includes:

the segmentation unit is used for acquiring historical data and segmenting the historical data into a data set, wherein the data set comprises: a training set, and, a testing set;

a screening unit, configured to screen the data set to obtain a feature candidate set, where the feature candidate set includes: training a feature candidate set, and testing the feature candidate set;

and the construction unit is used for constructing a confidence and review model according to the characteristics in the characteristic candidate set and the evaluation result.

Optionally, the screening unit comprises:

the deleting unit is used for deleting the useless features in the data set to obtain data to be processed;

and the conversion unit is used for converting the data in the time format into week characteristics to obtain a characteristic candidate set.

Optionally, the deleting unit includes:

statistical unit for counting the response value of the predicted result, the unresponsive value of the predicted result, and the response value of each feature, the unresponsive value of the feature in the data set;

a WOE value calculating unit, configured to calculate a WOE value of each feature according to the predicted result response value, the predicted result non-response value, and the feature response value, the feature non-response value;

th judgment unit, used to judge whether the WOE value is less than WOE threshold;

, a useless feature determination unit for determining, if less than, the feature that produced the WOE value as a useless feature;

, a deleting unit for deleting the useless features in the data set to obtain the data to be processed.

Optionally, the deleting unit includes:

the second statistical unit is used for counting the characteristic response value and the characteristic non-response value of each characteristic in the data set;

the passing probability calculation unit is used for calculating the actual passing probability according to the characteristic response value and the characteristic non-response value;

a theoretical value calculating unit, configured to calculate a theoretical value according to the probability of passing, the feature response value of each feature, and the feature unresponsive value, where the theoretical value includes a theoretical response value and a theoretical unresponsive value;

the chi-square value calculating unit is used for calculating a chi-square value according to the theoretical response value, the theoretical non-response value, the characteristic response value and the characteristic non-response value;

the second judgment unit is used for judging whether the chi-square value is larger than a critical threshold value or not;

a useless feature determination unit for determining, if greater than, that the feature that produced the chi-squared value is determined to be a useless feature;

and the second deleting unit is used for deleting the useless features in the data set to obtain the data to be processed.

Optionally, the screening unit comprises:

the third statistical unit is used for counting the loss rate of the features in each modeling individual in the data set;

and the modeling individual deleting unit is used for deleting the modeling individual generating the loss rate to obtain the characteristic candidate set if the loss rate is greater than the loss rate threshold.

Optionally, the building unit comprises:

the dividing unit is used for generating a divided data set by taking the features in the feature candidate set as dividing bases;

an information gain ratio calculation unit for calculating an information gain ratio of each of the divided data sets, respectively;

a target data set determination unit for determining the divided data set that yields the largest information gain rate as a target data set;

, a construction unit for constructing a confidence and review model according to the characteristics of the target data set and the evaluation result, optionally, the target data set determination unit comprises:

a to-be-processed data set determining unit for determining the divided data set that will produce the largest information gain rate as a to-be-processed data set;

a fourth statistical unit, configured to count a segment prediction result response value and a segment prediction result non-response value of each feature of the to-be-processed data set, and a segment response value and a segment non-response value of each feature of each processed data set;

the IVi value calculating unit is used for calculating the IVi value of each feature according to the segmented prediction result response value, the segmented prediction result non-response value, the segmented response value and the segmented non-response value;

an IV value calculation unit, which is used for calculating an IV value showing the data set to be processed according to the IVi value of each feature;

a third judging unit, configured to judge whether the IV value is smaller than an IV threshold;

a useless to-be-processed data set determining unit configured to determine, if the value is smaller than the predetermined threshold, the to-be-processed data set that generates the IV value as a useless to-be-processed data set;

and the third deleting unit is used for deleting the useless data set to be processed to obtain a target data set.

The third aspect of the embodiment of the present application shows a construction system of kinds of crediting models, where the system includes:

the system comprises an application platform server and a data storage server connected with the application platform server, wherein the data storage server is arranged in the platform server or independently arranged, and the application platform server is connected with a terminal through the Internet;

the terminal displays the evaluation result;

the application platform server is used for realizing the method shown in the embodiment of the application;

the method is used for data sampling, feature engineering and model integration;

wherein the feature engineering comprises: data cleaning and data preprocessing, feature selection, feature discretization and combination, and model training and evaluation;

and the data storage server is used for storing related data.

According to the technical scheme, the construction method, the device and the system of the credit check models are shown in the embodiment of the application, the method comprises the steps of obtaining historical data, segmenting the historical data into a data set, wherein the data set comprises a training set and a testing set, screening the data set to obtain a feature candidate set, wherein the feature candidate set comprises a training feature candidate set and a testing feature candidate set, and constructing the credit check model according to features in the feature candidate set and evaluation results.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1-1 is a block diagram of a construction system of trial model according to ;

FIGS. 1-2 are block diagrams of a trial and error model building system according to yet another preferred embodiment;

FIG. 2 illustrates a block diagram of an application platform server in accordance with the preferred embodiment of ;

FIG. 3 is a flow chart of a method for constructing the trial and error model according to the preferred embodiment;

FIG. 4 is a detailed flowchart of step S102 shown in accordance with a preferred embodiment of ;

FIG. 5 is a detailed flowchart of step S10211 shown in accordance with a preferred embodiment of ;

FIG. 6 is a detailed flowchart of step S10211 shown in accordance with a further preferred embodiment ;

FIG. 7 is a detailed flowchart of step S102 shown in accordance with yet another preferred embodiment ;

FIG. 8 is a detailed flowchart of step S103 shown in accordance with a preferred embodiment of ;

FIG. 9 is a detailed flowchart illustrating step S1033 according to a preferred embodiment of ;

FIG. 10 is a block diagram of a construction apparatus for building trial trust models according to ;

FIG. 11 illustrates a number of servers in accordance with the preferred embodiment of ;

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only partial embodiments of of the present invention, rather than all embodiments.

Referring to fig. 1-1 and fig. 1-2, an -aspect of the embodiment of the present application shows a confidence model construction system;

the system comprises:

an application platform server 31, a data storage server 32 connected with the application platform server 31, wherein the data storage server 32 is arranged in the platform server 31 or is arranged independently, and the application platform server 31 is connected with a terminal 33 through the internet;

the terminal 33 displays the evaluation result;

referring to fig. 2, the application platform server 31 is configured to implement the method according to the embodiment of the present application;

for historical data sampling, feature engineering, off-line models, evaluation methods, and on-line models;

() data samples

Screening partial data from historical data according to purchase types and time, and carrying out training set and test set segmentation on the partial data, wherein the data segmentation adopts two methods, is a reservation method using time period segmentation, and the other is a self-service method with sample replacement.

(II) feature engineering

The main purpose at this stage is to extract useful features from the data as the data to be processed, to best express the features, and to evaluate the relationship of the results. The method comprises the steps of data cleaning, data preprocessing, feature selection, feature discretization, feature combination and the like.

(1) Data cleaning and data preprocessing

Firstly, useless features (comprising useless data, error data, features with poor prediction capability and the like) are removed, and then fields of a time format, a JSON format and the like are analyzed.

(2) Feature selection

Because of the uniqueness of financial credit and audit business, many characteristics have the defect problem, so that firstly the characteristics with high defect rate are removed, then the characteristics are further filtered by using the following method to obtain a final characteristic candidate set, wherein the characteristic selection method comprises the following steps:

an information gain rate;

IV and WOE values;

checking a chi square;

(3) discretizing and combining the characteristics;

continuous variables appear in the feature candidate set, in order to enable the model to have better robustness, the continuous features need to be discretized, a method of MDLP and information gain rate is adopted to perform continuous value discretization, then feature combination is performed, and a feature intersection mode is adopted to perform feature combination in the technical scheme.

(4) Training and evaluating a model;

before the model training, the one-hot coding needs to be performed on the feature candidate set after the feature engineering. And then determining a hyperparameter C in Logistic Regression by using cross validation, and evaluating the quality of the model by adopting the AUC value and the K-S curve.

(III) model integration;

according to the technical scheme, different Logistic Regression models (confidence and audit models) can be generated according to different behavior characteristics (repayment capacity, repayment willingness, malicious fraud and the like) of the borrower, then a voting committee mode is adopted, each model votes pass or reject, the confidence and audit pass if all models vote to pass, and the confidence and audit reject if models vote to reject.

The data storage server 32 is used for storing relevant data.

The embodiment of the application shows a construction system of credit and audit models, wherein an application platform server is used for historical data sampling, feature engineering, an offline model, an evaluation method and an online model, the embodiment of the application shows that the historical data is segmented in advance, and screening is carried out to remove modeling individuals with high loss rate and features with poor prediction ability, and the rest data is used as a feature candidate set.

Example 2:

referring to fig. 3, a second aspect of the embodiment of the present application illustrates a method for constructing review models, where the method includes:

s101, acquiring historical data, and segmenting the historical data into a data set, wherein the data set comprises: a training set, and, a testing set;

the historical data has a plurality of modeling individual components including credence data and evaluation results, wherein the evaluation results are converted into languages which can be recognized by a computer, for example, a pass is '1' and a fail is '0';

the credit audit data comprises a large number of characteristics of basic information, credit audit application information, credit investigation report, user behavior information and the like of the user;

S102, screening the data set to obtain a characteristic candidate set, wherein the characteristic candidate set comprises: training a feature candidate set, and testing the feature candidate set;

removing modeling individuals with large deletion rate and characteristics with poor prediction ability, taking the residual data as a characteristic candidate set,

s103, according to the features in the feature candidate set and the evaluation result, a confidence and review model is constructed.

The credit check model is constructed based on the Logistic Regression integrated learning model, and in the field of financial credit check, the Logistic Regression model is used for modeling, so that the credit check model is simple and reliable, good in robustness and strong in interpretability, and a plurality of good and different Logistic Regression models are integrated, so that the credit evaluation and prediction are more accurate.

According to the scheme shown in the embodiment of the application, historical data are segmented in advance, screening is carried out, modeling individuals with large loss rate are removed, characteristics with poor prediction capability are obtained, the remaining data are used as a characteristic candidate set, a credit trial model is built based on a Logistic Regression integrated learning model, timeliness and efficiency of credit trial can be greatly improved by adopting the technical scheme shown in the embodiment of the application, loan capability and repayment willingness of a borrower can be analyzed from all aspects and dimensions of the borrower depending on the historical data, malicious fraud phenomena can be identified, and prediction accuracy and robustness are improved.

Example 3:

in the financial industry, the week is used as an important characteristic , and generally has a great influence on an evaluation result, for example, the contribution of the day of the week to a letter review model is greater than the contribution of the week (working day) to the letter review model, so the week should be input into a computer as important characteristics in the modeling process to participate in the construction of the model.

To solve the above problems, the embodiment of the present application shows data transformation manners, specifically, please refer to fig. 4, and the difference that embodiment 3 has similar steps to the technical solution shown in embodiment 2 only is , in the technical solution shown in embodiment 2, the step of screening the data set to obtain the feature candidate set includes:

s10211, deleting useless features in the data set to obtain data to be processed;

the first is to remove useless features (including useless data, error data, features with poor prediction ability, etc.)

S10212 converts the time-formatted data into week characteristics to obtain a characteristic candidate set.

Converting the data in the time format into week characteristics according to the corresponding relation between the date and the week, for example, converting 6 months and 26 days in 2018 into week characteristics of 'Tuesday';

according to the technical scheme, the time-formatted data are converted into week characteristics, the week characteristics are used as important characteristics to participate in the construction of the credit and audit model, and therefore the high accuracy of the constructed credit and audit model is guaranteed.

Example 4:

generally useless information with poor prediction capability of the characteristics and the crediting result is used as the characteristics of model construction in the construction process of the model, which undoubtedly increases the data processing amount of the computer, reduces the bandwidth of the system and the utilization rate of resources.

In order to solve the above problem, the embodiment of the present application shows a method for determining invalid messages, specifically, please refer to fig. 5:

the technical solution shown in embodiment 4 has similar steps to those shown in embodiment 3 except that is the only difference between the technical solution shown in embodiment 3, in which the step of deleting the useless features in the data set and obtaining the data to be processed includes:

s1021111 statistics of response value of prediction result, non-response value of prediction result, and response value of feature, non-response value of feature of each feature in the data set;

s1021112 calculates a WOE value of each feature according to the predicted result response value, the predicted result non-response value, the feature response value and the feature non-response value;

WOE is called "Weight of Evidence" in its entirety, and is forms of encoding features in historical data.

To perform WOE encoding on variables, the variables need to be grouped first, that is, data including a certain characteristic is clustered into groups;

wherein pyi is the proportion of characteristic response value of each characteristic (in the confidence review model, corresponding to clients passing confidence review, in short, referring to the individual with predictor variable value "yes" or 1) to all responding clients (predicted result response value) in all samples, pni is the proportion of non-responding clients in the group to all non-responding clients (predicted result non-response value) in the samples;

yi is the response value of the feature in the group, ni is the unresponsive value of the feature in the group, yT is the response value of the prediction result, and nT is the unresponsive value of the prediction result.

WOE represents the difference between "the proportion of the characteristic response value to the predicted result response value in the current packet" and "the proportion of the characteristic unresponsive value to the predicted result unresponsive value".

WOE can also understand that it represents the ratio of the currently responding clients to the non-responding clients in the group, and the difference in this ratio across all samples. This difference is expressed as the ratio of the two ratios, logarithmically. The larger the WOE, the greater the difference, the more likely the sample in the packet is to respond, and the smaller the WOE, the smaller the difference, the less likely the sample in the packet is to respond.

S1021113 determining whether the WOE value is less than a WOE threshold;

s1021114 if less than, determining the feature that produced the WOE value as a useless feature;

s1021115 deletes the useless features in the dataset to obtain the data to be processed.

The method shown in the embodiment of the application removes useless features with small response values, namely the features that whether the WOE value of the useless features is smaller than the WOE threshold value, from the useless features), the influence of the useless features on the evaluation result is small, and the useless features are used as independent variables to be applied to the construction of the model, so that the accuracy of the prediction capability of the model is not increased, and the processing amount of computer data is increased.

The method shown in the embodiment of the application determines whether the characteristic is an useless characteristic or not by comparing a certain characteristic WOE value with a WOE threshold value, if the characteristic is determined to be the useless characteristic, the characteristic is directly deleted without participating in the construction of the model, and the method shown in the embodiment of the application deletes useless characteristics on the basis of ensuring the accuracy of the constructed credence model.

Example 5:

generally, features are independent from the crediting result, and the features do not have a fixed influence on the crediting result , and in the process of building the model, the useless information is used as the features of the model building, so that the data processing amount of a computer is increased, the bandwidth of a system is reduced, and the utilization rate of resources is lowered.

In order to solve the above problem, the embodiment of the present application shows a method for determining invalid messages, specifically, please refer to fig. 6:

the technical solution shown in embodiment 5 has similar steps to those shown in embodiment 3 except that is the only difference between the technical solution shown in embodiment 3, in which the step of deleting the useless features in the data set and obtaining the data to be processed includes:

s1021121 counting the characteristic response value and the characteristic unresponsive value of each characteristic in the data set;

the following are typical quad checks that we want to know if a user who has a loan record has an effect on the chances of passing the review:

through simple statistics, the credit and audit passing rates of users with and without the loan records are 30.94% and 25.00%, and the difference between the users with and without the loan records can be caused by sampling errors and can also be the influence of the loan records on the credit and audit passing rate.

S1021122 calculating actual passing probability according to the characteristic response value and the characteristic non-response value;

to determine the true reason, we first assume that the no-loan record has no effect on the credit pass rate, i.e., the loan record is independent of the credit pass rate, so we can conclude that the actual pass probability is actually (43+28)/(43+28+96+84) ═ 28.29%

S1021123, calculating theoretical values according to the passing probability, the characteristic response value of each characteristic and the characteristic unresponsive value, wherein the theoretical values comprise a theoretical response value and a theoretical unresponsive value;

the theoretical table should be as follows:

the following table:

s1021124 calculating a chi-squared value based on the theoretical response value, the theoretical non-response value, the characteristic response value, and the characteristic non-response value;

the calculation formula of chi-square test is as follows:

a is a characteristic response value, and T is a theoretical response value;

wherein A is a characteristic response value, and T is a theoretical response value.

x2 is used to measure the difference degree between the characteristic response value and the theoretical response value (i.e. the core idea of chi-square test), and contains the following two information:

1. the absolute magnitude of the deviation of the characteristic response value from the theoretical response value (the difference is exaggerated due to the presence of the square);

2. the relative magnitude of the degree of difference and the theoretical response value.

Chi fang ═ 1.077 (43-39.3231) square/39.3231 + (28-31.6848) square/31.6848 + (96-99.6769) square/99.6769 + (84-80.3152) square/80.3152;

s1011125, judging whether the chi-square value is larger than a critical threshold value;

the critical threshold is obtained by table look-up of degrees of freedom;

the degree of freedom is equal to V ═ (number of rows-1) × (number of columns-1), and the degree of freedom V ═ 1.

S1021126, if greater, determining that the feature that generated the chi-squared value is determined to be a useless feature;

s1021127 deleting useless features in the data set to obtain data to be processed.

For V ═ 1, the critical probability of a chi-square distribution with a loan record and confidence that are uncorrelated with 95% probability is: 3.84. that is, if the chi-square is greater than 3.84, then milk and cold are considered to be unrelated with a 95% probability.

It is clear that 1.077<3.84, the threshold of chi-square distribution is not reached, and the assumption that there is independent independence of loan records and confidence passing probabilities does not hold.

Obviously in the dataset characteristics: the presence or absence of a loan record is a useless feature. Deleting the useless features;

the method shown in the embodiment of the application deletes features independent from each other according to the crediting passing probability (i.e. feature deletion irrelevant to the evaluation result),

that is, the useless features are removed, and the useless features are not related to the confidence pass probability, and the useless features are used as arguments for constructing the model, so that the accuracy of the prediction capability of the model is not increased, and the processing amount of computer data is increased.

The method shown in the embodiment of the application determines whether the characteristic is a useless characteristic or not by comparing a certain characteristic chi-square with a critical threshold, and if the characteristic is determined to be the useless characteristic, the characteristic is directly deleted without participating in the construction of the model.

Example 6:

the credit trial data comprises a large number of characteristics such as basic information of a user, credit trial application information, credit investigation reports and user behavior information, and the characteristics are required to be screened in order to improve the accuracy and timeliness of model prediction.

In order to solve the above problem, the embodiment of the present application shows a deletion method of invalid modeled individuals, specifically, please refer to fig. 7:

the technical solution shown in embodiment 6 is similar to the technical solution shown in embodiment 2 except that is the only difference between the technical solution shown in embodiment 2, in which the step of screening the data set to obtain the feature candidate set includes:

s10221, counting the loss rate of the features in each modeled individual in the data set;

if so, each modeled individual contains 100 features. Counting each modeling individual;

the number of the characteristics as null values/100 is used as the missing rate of the characteristics;

s10222, if the deletion rate is greater than the deletion rate threshold value, deleting the modeling individuals generating the deletion rate to obtain a feature candidate set.

If the deletion rate of the characteristic of a certain modeling individual is larger than the deletion rate threshold value, the deletion rate of the modeling individual is large, and the modeling individual is removed.

The method shown in the embodiment of the application deletes modeling individuals with large missing rate on the basis of ensuring the accuracy of the constructed crediting model, reduces the processing amount of computer data, and improves the bandwidth of a system and the utilization rate of resources.

Example 7:

referring to fig. 8, the technical solution shown in embodiment 7 has similar steps to the technical solution shown in embodiment 2 except that is the only difference between the technical solution shown in embodiment 2, in which the step of constructing the confidence review model according to the features in the feature candidate set and the evaluation result includes:

s1031, taking the features in the feature candidate set as division bases, and generating a division data set;

firstly, judging the purity of a partitioned data set;

information Entropy (Information Entropy)

Entropy is parameters used to assess the purity of the partitioned dataset, that is, given partitioned datasets, the sample in the partitioned dataset may belong to many different classes, or may belong to only classes, and then we say that the sample is impure if it belongs to many different classes, and that the sample is pure if it belongs to only classes.

The entropy is the measure used to calculate whether the data in the partitioned data sets is clean or dirty.

The general characteristics contain continuous variables, and if the continuous variables are in an increasing relationship in the construction process of the model, the output result is required to have a corresponding increasing relationship;

however, because the particularity of the confidence and audit data cannot generally guarantee that all incremental data and output results have corresponding incremental relationships, the method shown in the embodiment of the application discretizes data which are continuously and incrementally related;

characteristics of the latest loan amounts [0-100000]

According to the characteristics of the latest loan amount as the division basis, the generated division data set is as follows:

partitioning the data set 1: <100 yuan, [100,200), [200,500, > <500 >

Partition data set 2: 500 yuan, 500 >;

s1032 calculates an information gain ratio of each divided data set, respectively;

the characteristics of the latest loan amounts include two values, less than 500 and greater than 500, and if the data set D is divided by less than 500 and greater than 500, two sets are obtained, D is less than 500 and D is greater than 500 respectively, the two divided sets have passed the confidence and have failed the confidence, so the purities of the two divided sets can be calculated respectively, and after calculation, the information entropies of the two sets are weighted and averaged:

when the data set is divided into the original data set by the characteristics, the difference of the obtained information entropy is the improved value of the purity, namely the improved value of the purity, the difference of the obtained information entropy is the improved value of the purity, and the information gain is obtained by subtracting the former from the latter, namely the characteristics of the latest loan amount is used as the division basis of the data set 2, namely <500 yuan >, and >:

the method is characterized in that the data set is a data set, a is a selected feature, in the data set a has V values, the data set D is divided by the V values to obtain data sets D1 to Dv respectively, the information entropies of the V data sets are obtained respectively, and the information entropies are obtained by weighted averaging.

Whether the data set D needs to be divided by the characteristic a can be judged according to the magnitude of the information gain value, if the obtained information gain is larger, the characteristic is a better characteristic for dividing the data set D, otherwise, the characteristic is not suitable for dividing the data set D.

S1033 determining the divided data set that yields the largest information gain rate as a target data set;

the partitioned data set that yields the maximum information gain ratio is the target data set.

S1034, according to the characteristics of the target data set and the evaluation result, constructing a credit and audit model.

According to the technical scheme shown in the embodiment of the application, continuous variables appear in the feature candidate set, and in order to enable the model to have better robustness, the continuous features are discretized.

Example 8:

referring to fig. 9 of the drawings, a schematic diagram of a display device,

the embodiment 8 has similar steps to those of the embodiment 7 except that is the difference between the embodiment 7 and the embodiment that the step of determining the partitioned data set that yields the largest information gain ratio as the target data set includes:

the fact that 'IV is used to measure variable prediction capability' can be generally understood from visual logic, that is, the classification of target variables is two types, namely Y1 and Y2., for individuals A to be predicted, whether A belongs to Y1 or Y2 is judged, specific Information is needed, and the needed Information is included in all independent variables C1, C2, C3, … … and Cn, wherein the more Information Ci is included in variables Ci, the greater the contribution of the Information Ci to judging whether A belongs to Y1 or Y2 is, the greater the Information Value of Ci is, and the greater the IV of Ci is, the greater the IV of Ci is to be included in a model variable list.

S10331, determining the divided data set which generates the maximum information gain rate as a data set to be processed;

likewise, for packet i, there will also be corresponding IV values, which are calculated as follows:

with the IV values of the groups of variables, we can calculate the IV value of the entire variable simply by adding the IV values of the groups:

s10332, counting a response value and an unresponsive value of each feature segmented prediction result of each to-be-processed data set, and a segmented response value and a segmented unresponsive value of each feature in each to-be-processed data set;

s10333, calculating an IVi value of each feature according to the segmented prediction result response value, the segmented prediction result non-response value, the segmented response value and the segmented non-response value;

s10334 calculating an IV value showing a data set to be processed from the IVi value of each of the features;

s10335 determining whether the IV value is less than an IV threshold;

s10336, if the value is smaller than the preset value, determining the data set to be processed of the IV value as a useless data set to be processed;

s10337, deleting the useless data set to be processed to obtain a target data set.

Suppose we need to build confidence and review models to predict whether each client in the company's set of clients can respond to some of our marketing campaign.suppose we have randomly drawn 100000 clients from the company's client list for marketing campaign testing and collected the results of these clients ' responses as our historical data, where there are 10000 responding clients.suppose further that we have also extracted features of these clients as sets of candidate features for our model:

whether there was a purchase in the last months;

the last purchases;

the category of merchandise purchased most recently ;

whether it is a corporate VIP customer;

assuming we have discretized these variables, the statistical results are shown in the tables below.

Whether or not purchase has occurred in the last months:

(2) last purchases:

(3) whether it is a VIP client or not,

wherein the latest purchases are determined as the data set to be processed;

<100 Yuan: WOE1 ═ -0.74721;

[100,200]WOE2＝0；

[200,500]WOE3＝0.81093；

>500WOE4＝1.349927；

we discretize this variable into 4 segments: <100 yuan, [100,200], [200,500], >500 yuan. Firstly, according to the WOE calculation formula, the WOEs of the four segments are respectively:

from the above calculation we can see the basic characteristics of WOE under :

in the current packet, the larger the proportion of the response is, the larger the WOE value is;

and determining the positive and negative of the WOE of the current packet according to the size relationship between the proportion of response and non-response of the current packet and the proportion of overall response and non-response of the sample, wherein when the proportion of the current packet is smaller than the overall proportion of the sample, the WOE is negative, when the proportion of the current packet is larger than the overall proportion, the WOE is positive, and when the proportion of the current packet is equal to the overall proportion, the WOE is 0.

The range of values of WOE is the total real number.

The WOE actually describes the direction and magnitude of the influence of the current grouping of the variable on judging whether the individual can respond (or belongs to which class), when the WOE is positive, the current value of the variable has positive influence on judging whether the individual can respond, and when the WOE is negative, the variable has negative influence. The magnitude of the WOE value is representative of the magnitude of this effect.

<100 Yuan: IV 1 ═ 0.20756;

[100,200]IV 2＝0；

[200,500]IV 3＝0.135155；

>500 IV4＝0.149992；

we have calculated the WOE and IV values for of the three variables the other three calculations we will not describe in detail and give the IV results directly.

0.250224725 of whether or not purchases have been made in the last months;

whether it is a VIP client: 1.56550367, respectively;

previously we have calculated that the IV of the most recent purchases was 0.49270645;

the result of the ranking of the four variables IV is whether the company VIP client has purchased more than recent purchases > months recently.

If the IV of the feature that was purchased last months is less than the IV threshold, then delete the feature that was purchased last months;

the party shown in the embodiments of the present application uses IV to determine unwanted features; difference between IV and WOEThat (py) in which IV is multiplied on WOE basis_i-pn_i)；

When the prediction capability of variables is measured, the index value used by the variable index value is not negative, and the index value is multiplied by pyn, so that the result of each group of the variables is non-negative;

and after the product is multiplied by pyn, the influence of the proportion of the number of individuals in the current group of the variables to the total number of individuals on the variable prediction capability is reflected, if the marketing response model mentioned above has variables A, the values of the variables A are only two: 0,1, and the data is as follows:

A	response to	Not responding	Total up to	Response ratio
					Is that	90	10	100	90％
Whether or not	9910	89990	99900	10％
					Total up to	10000	90000	100000	10％

We can see from the above table that when the variable a takes 1, the response rate reaches 90%, which is very high, but when a takes 1, the response rate is very high, but the number of clients in this group is too small, and the occupied rate is too low.

From this table we can see that when the variable takes 1, the response ratio reaches 90%, corresponding to a high WOE, but a low IV, because the IV is preceded by coefficients (py)_i-pn_i) Conversely, if the absolute value of WOE is directly summed, very high indicators would be obtained, which is unreasonable.

The technical scheme shown in the embodiment of the application adopts the IV value as the judgment basis for determining the useless features, so that the influence of too low quantity of samples on the determination of the useless features can be effectively avoided, and the accuracy of the determination of the useless features is ensured.

The method shown in the embodiment of the application determines whether the characteristic is a useless characteristic or not through the size of an IV value of a certain characteristic, if the characteristic is determined to be the useless characteristic, the characteristic is directly deleted without participating in the construction of the model, and the method shown in the embodiment of the application deletes useless characteristics on the basis of ensuring the accuracy of the constructed confidence and audit model.

Example 9:

referring to fig. 10, a third aspect of the embodiment of the present application illustrates an apparatus for constructing review models, where the apparatus includes:

the segmentation unit 21 is configured to obtain historical data, and segment the historical data into a data set, where the data set includes: a training set, and, a testing set;

a screening unit 22, configured to screen the data set to obtain a feature candidate set, where the feature candidate set includes: training a feature candidate set, and testing the feature candidate set;

and the constructing unit 23 is configured to construct a confidence and review model according to the features in the feature candidate set and the evaluation result.

Optionally, the screening unit comprises:

Optionally, the deleting unit includes:

Optionally, the screening unit comprises:

Optionally, the building unit comprises:

Example 10:

in a fourth aspect of the embodiment of the present application, kinds of servers are shown, please refer to fig. 11, which includes:

or more processors 41;

a memory 42 for storing or more programs;

when the programs are executed by the processor 41 processors processor 41 processor are configured to implement the methods of embodiments of the present application.

This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the -like principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains and as may be applied to the essential features hereinbefore set forth, the description and examples are to be considered as illustrative, the true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1, method for constructing trial and error model, the method includes:

2. The method of claim 1, wherein the step of screening the data set to obtain a candidate set of features comprises:

deleting useless features in the data set to obtain data to be processed;

3. The method of claim 2, wherein the step of removing unwanted features from the dataset to obtain the data to be processed comprises:

determining whether the WOE value is less than a WOE threshold;

and deleting useless features in the data set to obtain data to be processed.

4. The method of claim 2, wherein the step of removing unwanted features from the dataset to obtain the data to be processed comprises:

and deleting useless features in the data set to obtain data to be processed.

5. The method of claim 1, wherein the step of filtering the data set to obtain a candidate set of features comprises:

counting the loss rate of features in each modeled individual in the dataset;

6. The method of claim 1, wherein the step of constructing a confidence and review model based on the features in the feature candidate set and the evaluation results comprises:

respectively calculating the information gain rate of each divided data set;

7. The method of claim 6, wherein the step of determining the partitioned data set yielding the largest information gain rate as the target data set comprises:

judging whether the IV value is smaller than an IV threshold value;

and deleting the useless data set to be processed to obtain a target data set.

8, kinds of letter review model's construction equipment, characterized in that, the equipment includes:

9. The apparatus of claim 8, wherein the screening unit comprises:

10. The apparatus of claim 9, wherein the deleting unit comprises:

11. The apparatus of claim 9, wherein the deleting unit comprises:

12. The apparatus of claim 8, wherein the screening unit comprises:

13. The apparatus of claim 8, wherein the building unit comprises:

and , a construction unit, configured to construct a confidence and review model according to the characteristics of the target data set and the evaluation result.

14. The apparatus of claim 13, wherein the target data set determination unit comprises:

15, kinds of letter trial models' construction system, characterized by that, the said system includes:

the terminal displays the evaluation result;

the application platform server for implementing the method of any of claims 1-7;

wherein the feature engineering comprises: data cleaning and data preprocessing, feature selection, feature discretization and combination, and model training and evaluation; and the data storage server is used for storing related data.