CN111724175A

CN111724175A - Citizen credit point evaluation method applying logistic regression modeling

Info

Publication number: CN111724175A
Application number: CN202010568798.5A
Authority: CN
Inventors: 吴福全; 朱全日; 张小花; 左杨; 刘爽
Original assignee: Anhui Dike Digital Gold Technology Co ltd
Current assignee: Anhui Dike Digital Gold Technology Co ltd
Priority date: 2020-06-19
Filing date: 2020-06-19
Publication date: 2020-09-29

Abstract

The invention discloses a citizen credit point evaluation method applying logistic regression modeling, which specifically comprises the following steps: firstly, obtaining independent variable data, wherein the independent variable data comprises: government affairs data and bank data; the government affair data includes: identity characteristics, consumption capacity, credit history, qualification honor; the bank data comprises asset information, wherein the asset information comprises annual income and whether own residences exist; acquiring dependent variable data y, wherein the dependent variable data y specifically comprises good clients and bad clients, and the clients which are overdue for more than 60 days are the bad clients and the clients which are not overdue are the good clients; the invention can increase the total amount of available data through federal learning, and can well solve the problem of existing data isolated island; for enterprises, the federal learning can be used for simply, legally and inexpensively acquiring external effective data information, troubles caused by insufficient data volume or data dimension are quickly solved, and data or business confidentiality among cooperative enterprises cannot be leaked.

Description

Citizen credit point evaluation method applying logistic regression modeling

Technical Field

The invention belongs to the field of citizen credit points, and particularly relates to a citizen credit point evaluation method applying logistic regression modeling.

Background

The data privacy protection is realized through federal learning, wherein the design goal of the federal learning is to carry out high-efficiency machine learning among multiple participants or multiple computing nodes on the premise of ensuring the information safety during big data exchange, protecting the privacy of terminal data and personal data and ensuring legal compliance.

Federal learning is divided into horizontal federal learning and vertical federal learning, and as used herein, vertical federal learning is used. Longitudinal federated learning applies to the case where two data sets share the same sample ID space but the feature space is different. Longitudinal federated learning is the process of aggregating these different features and calculating training losses and gradients in a privacy-preserving manner to build models using both parties' data together.

Disclosure of Invention

The invention aims to provide a citizen credit point evaluation method applying logistic regression modeling.

The purpose of the invention can be realized by the following technical scheme:

a citizen credit score evaluation method applying logistic regression modeling specifically comprises the following steps:

the method comprises the following steps: acquiring independent variable data, wherein the independent variable data comprises: government affairs data and bank data;

wherein the government affair data comprises: identity characteristics, consumption capacity, credit history, qualification honor;

the bank data comprises asset information, wherein the asset information comprises annual income and whether own residences exist;

step two: acquiring dependent variable data y, specifically including good clients and bad clients, wherein the clients which are overdue for more than 60 days are the bad clients and the clients which are not overdue are the good clients;

step three: the bank side is called a gust, the government side is called a host, wherein the government side comprises a plurality of hosts, and specifically comprises a government network, a social security bureau and a public accumulation fund center; and a third party called arbiter is built on the government administration side;

step four: performing data analysis, specifically:

step 1): firstly, taking intersection of sample ids of a guest party and a host party, finding the same users, and using the users as modeling samples;

step 2): filling in, filling in missing values, and performing Guest and host locally;

step 3): binning and calculating woe and IV values for each feature;

s3.1: the method adopts equal-frequency binning, and specifically comprises the following steps: dividing all characteristic values of each variable into n boxes to enable the quantity of each box to be equal, and obtaining a dividing point of each box;

s3.2: the guest and the host respectively carry out equal frequency binning on all continuous variables locally;

s3.3: the guest performs Paillier encryption on y and sends the encrypted y to host;

s3.4: the guest calculates the local IV value: calculating the number of good samples and bad samples of each group after all the features are subjected to binning, for example, obtaining the following format: result _ sum { 'x1' [ [0,0], [2,1], [0,0], [1,0] ], [ x2 '[ [0,0], [0,0], [0,0], [1,0], [2,1] ], [ x3' [ [0,0], [0,0], [2,1], [1,0] ] } of each feature, and then woe values for each feature are calculated, as follows:

woe — i ═ ln (bad sample rate/good sample rate);

iv _ i (bad-good sample rate) woe _ i;

IV of variable IV — i of all bins of the variable;

s3.5: and (4) calculating the host square IV value: the host calculates the good and bad sample numbers result _ sum of all groups after all the characteristics are subjected to binning according to the encrypted y sent by the gust, and sends the result to the gust;

s3.6: the guest receives the encrypted result _ sum of the host party, firstly decrypts, calculates woe and IV values in the same way, and sends the woe _ i result of the host party to the host;

s3.7: and (3) feature value conversion: guest and host respectively convert all the characteristics into woe values of the box to which the characteristic values belong, and the result values replace the original independent variable x values to be used as new modeling independent variables;

step 4): feature selection, in the method, an IV value is used for feature selection, a threshold thr is set by both guest and host, the IV values of all feature variables of both parties are compared with the threshold thr by the guest, the feature variables with IV < the threshold thr are filtered, and the rest variables are used for constructing a model;

step five: the method comprises the following steps of constructing a model by adopting a longitudinal logistic regression federated learning method, and specifically comprising the following steps:

s1: the guest first sets tag y to 1, -1, x-woe value;

s2: batching data, wherein the batch _ info result after batching is assumed to be batch _ info, and the batch _ info includes the data size of each batch and the number of data batches;

sending the batch _ info after batching to Host and the Arbiter;

sending the sample id of each batch of data, namely index, to the Host;

s3: both guest and host initialize the model separately, i.e., w is a randomly generated number uniformly distributed from 0 to 1, and the constant term is 1; initializing the iteration number to be 0;

s4: starting loop training, when the iteration number is less than the set maximum iteration number:

s401: initializing a cyclic batch number equal to 0;

acquiring corresponding data characteristics according to the serial numbers of each batch of data;

s402: and (3) performing cyclic training on all the data after batching according to batches:

s40201: calculating a gradient;

s40202: calculating loss;

s40203: updating the weight;

s403: acquiring a tag is _ changed for stopping iteration sent by the Arbiter;

s404: the iteration times are the iteration times + 1;

s405: if the obtained stop iteration label is 'True', exiting the large loop;

step six: carrying out model prediction;

s601: the resulting model was W ═ W0, W1, W2,.., wn ];

then the probability p ═ 1/(1+ e ^ W ^ X)) that a sample is predicted to be a bad customer, i.e.:

ln (event occurrence ratio) ═ ln (p/(1-p)) ═ W × X ═ W0+ W1 × 1+ W2 × 2+. + wn × n;

s602: converting the probability into a fraction and a positive integer;

credit score (parameter a + parameter B (WX) ═ a + B (w0+ w1 x1+ ·+ wn xn);

s603: solution A, B is performed:

setting the score when x is good/bad as P, the score of the point with the ratio of 2x should be P-PDO;

substituting the formula to obtain:

P＝A-Bln(x)；

PDO＝A-Bln(2x)；

setting x to 5%, P to 800, PD0 to 50;

calculating A and B and substituting the A and B into a credit score;

credit score ═ a + B × w0) + B × 1 × 1+ ·+ B × wn × xn;

in the formula, A + B w0 is a basic score, and B wi xi is a score correspondingly distributed to each variable;

s604: and multiplying the score corresponding to each variable by woe _ i of each box in the variable to obtain the scoring result of each box.

Further, the detailed process of calculating the gradient is:

step S01: calculate forward for polymerization:

the feature value of the guest party forwards is w x weight, and the result is encrypted

Acquiring Host _ forwards (w x) sent by Host

Aggregation _ forwards after polymerization plus host _ forwards

Step S02: calculating for _ gradient:

form _ gradient is 0.25 aggregated _ forwards-0.5 y; in the formula, y is a sample real label value;

simultaneously sending the form _ gradient to the host;

step S03: calculating a unilateral gradient:

both guest and host calculate respectively:

the single-side gradient is uniform _ gradient, for _ gradient X/n, wherein n is the sample size;

step S04: add regularized one-sided gradient:

here, using L2 regularization, both guest and host calculate:

the unideralal _ gradient + alpha w, where alpha is a regularization coefficient;

step S05: performing gradient updating, specifically:

acquiring optim _ gauge _ gradient and optim _ host _ gradient;

both the guest and the host respectively send the unilateral gradient _ gradient to the arbiter;

the arbiters receive the unilateral gradient of the two sides and firstly decrypt the unilateral gradient;

and then the results, namely the optimal gradient, after decryption by different parties are respectively sent to the guest and the host, namely:

removing all the remaining last value from the host _ optim _ gradient;

the last bit value in the default _ optim _ gradient is set;

wherein: host _ optim _ gradient: optimal gradient for the host side;

guest _ optim _ gradient: optimal gradient for the guest square;

separate _ optim _ gradient: the collected gusts and gradients of multiple host parties are aggregated into a set for arbiters.

Further, the detailed process of calculating the loss is as follows:

SS 01: and the Guest obtains the information sent by the host, and the total Loss calculation formula of the two parties is as follows:

wherein Loss is total Loss; n is the total sample size of the two parties; yi is the actual label value of the ith sample; w is a feature weight; x is a characteristic value;

SS 02: the guest and host parties respectively calculate the regularization term by using formulas, and the formula of the loss regularization term is as follows:

SS 03: the host sends the self-owned regularization term result host _ loss _ regular to the guest;

SS 04: and the guaest summarizes the regularization terms of the two parties and the total loss calculated in the previous step to obtain the total loss added with the regularization terms:

Loss＝Loss+loss_norm，Loss＝Loss+host_loss_regular；

SS 05: the guest then sends the Loss to the arbiters.

Further, the detailed process of weight update is as follows:

guest and host respectively calculate the updated weight:

w＝w-guest_optim_gradient，w＝w-host_optim_gradient。

further, a detailed procedure of stopping the iteration tag is obtained:

and the arbiter judges whether convergence is carried out by using the absolute value of the Loss according to the total Loss sent by the guest, if the Loss value is less than a set threshold value eps, the is _ converged is True, otherwise the is _ converged is false, and the is _ converged is sent to the guest and the host.

Further, the identity characteristics in the step one are sex, age, identity card address and student status; consumption capacity: the social security public accumulation fund payment base number and the like; credit history: authorizing the number of the applications according to the credit service authorization condition, the number of times of continuous credit service performance and the number of times of credit service loss in the APP; honor and honor: the number of the types of the red lists and the number of the types of the black lists, in particular to the number of the red black lists according to the credit Chinese website public list.

Furthermore, the government affair data come from a government system at a government side, the bank data come from a bank system at a bank side, and the respective data are not shared and are respectively modeled locally, so that the effect of aggregating data of different parties together for modeling is achieved; the customer samples corresponding to the data on the two sides are customers in the same region.

The invention has the beneficial effects that:

according to the invention, from the whole data industry, the total amount of available data can be increased by federal learning, and the problem of existing data isolated island can be well solved; for enterprises, the federal learning can be used for simply, legally and inexpensively acquiring external effective data information, troubles caused by insufficient data volume or data dimension are quickly solved, and data or business confidentiality among cooperative enterprises cannot be leaked.

Detailed Description

A citizen credit score assessment method applying logistic regression modeling comprises the following steps:

the identity characteristics are sex, age, identity card address, student status and school calendar; consumption capacity: the social security public accumulation fund payment base number and the like; credit history: authorizing the number of the applications according to the credit service authorization condition, the number of times of continuous credit service performance and the number of times of credit service loss in the APP; honor and honor: the number of the types of the red lists and the number of the types of the black lists, specifically the number of the types of the red black lists according to websites such as credit Chinese public lists;

wherein the bank data includes: asset information, including annual income, presence or absence of own residence, etc.;

the government affair data come from a government system on a government side, the bank data come from a bank system on a bank side, and the data are not shared and are respectively modeled locally, so that the effect of aggregating data of different parties together for modeling is achieved. The customer samples corresponding to the data on two sides are customers in the same region

step three: assuming that a bank side is called a gusest and a government side is called a host, wherein the government side comprises a government network, a social security bureau, a accumulation fund center and the like, a plurality of hosts are provided, and a third party called arbiters is built on the government side;

step four: performing data analysis, specifically:

step 3): binning and calculating woe and IV values for each feature;

s3.1: the method adopts equal-frequency binning, and specifically comprises the following steps: dividing all eigenvalues of each variable into n bins such that the amount of each bin is equal, resulting in a division point for each bin, such as split _ points { 'x1': 0.1,0.2,0.3,0.4. ], 'x2': 1,2,3, 4. ], ·;

s3.4: the guest calculates the local IV value: calculating the number of good samples and bad samples of each group after all the characteristics are subjected to box separation, and obtaining the following format: result _ sum { 'x1' [ [0,0], [2,1], [0,0], [1,0] ], [ x2 '[ [0,0], [0,0], [0,0], [1,0], [2,1] ], [ x3' [ [0,0], [0,0], [2,1], [1,0] ] } of each feature, and then woe values for each feature are calculated, as follows:

s3.5: woe — i ═ ln (bad sample rate/good sample rate);

s3.6: iv _ i (bad-good sample rate) woe _ i;

s3.7: IV of variable IV — i of all bins of the variable;

s3.8: and (4) calculating the host square IV value: host calculates result _ sum according to the encrypted y sent by the guest, and sends the result to the guest;

s3.9: the guest receives the encrypted result _ sum of the host party, firstly decrypts, calculates woe and IV values in the same way, and sends the woe _ i result of the host party to the host;

s3.10: and (3) feature value conversion: guest and host respectively convert all the characteristics into woe values of the box to which the characteristic values belong, and the result values replace the original independent variable x values to be used as new modeling independent variables; eigenvalue conversion is typically to scatter data for feature selection;

step 4): feature selection, in the method, an IV value is used for feature selection, a threshold value thr is set by both guest and host, the IV value and the thr of all feature variables of both parties are compared by the guest, the feature variables with IV < thr are filtered out, and the rest variables are used for constructing a model; the specific method for constructing the model comprises the following steps:

s1: the guest first sets tag y to 1, -1, x-woe value;

s2: batching the data to obtain batch _ info, wherein the batch _ info comprises the data size of each batch and the data batch number;

sending the batch _ info to the Host and the Arbiter;

sending the sample id of each batch of data, namely index, to the Host;

s3: both guest and host initialize the model separately, i.e., w is a randomly generated number uniformly distributed from 0 to 1, and the constant term is 1;

s401: initializing a cyclic batch number equal to 0;

s40201: calculating a gradient;

s40202: calculating loss;

s40203: updating the weight;

s403: acquiring a tag is _ changed for stopping iteration sent by the Arbiter;

s404: the iteration times are the iteration times + 1;

s405: if the obtained stop iteration label is 'True', exiting the large loop; wherein, the detailed process of calculating the gradient is as follows:

step S01: calculate forward for polymerization:

forwards ═ w x, and the result is encrypted

Acquiring Host _ forwards (w x) sent by Host

self.aggregated_forwards＝self.forwards+host_forwads

Step S02: calculating for _ gradient:

fore_gradient＝0.25*self.aggregated_forwards-0.5*y；

simultaneously sending the form _ gradient to the host;

step S03: calculating a unilateral gradient:

both guest and host calculate respectively: uniform _ gradient _ for _ gradient X/n; step S04: add regularized one-sided gradient:

here, using L2 regularization, both guest and host calculate:

unilateral_gradient＝unilateral_gradient+alpha*w

step S05: performing gradient updating, specifically:

acquiring optim _ gauge _ gradient and optim _ host _ gradient;

host_optim_gradient＝separate_optim_gradient[:-1]；

guest_optim_gradient＝separate_optim_gradient[-1]；

the detailed process of calculating the loss is as follows:

Loss＝Loss+loss_norm，Loss＝Loss+host_loss_regular；

SS 05: the guest then sends the Loss to the arbiters.

Wherein, the detailed process of the weight update is as follows:

guest and host respectively calculate the updated weight:

w＝w-guest_optim_gradient，w＝w-host_optim_gradient；

obtaining a detailed process of stopping iteration labels:

the arbiter judges whether convergence is carried out by using an absolute value of Loss according to the total Loss sent by the guest, if the Loss value is less than a set threshold value eps, the is _ converged is True, otherwise the is _ converged is false, and the is _ converged is sent to the guest and the host;

step six: carrying out model prediction; the method specifically comprises the following steps:

s601: the resulting model was W ═ W0, W1, W2,.., wn ];

then the probability p of predicting a sample as a bad client is 1/(1+ e ^ (z)), i.e.:

ln(odds)＝ln(p/(1-p))＝z＝W*X＝w0+w1*x1+w2*x2+...+wn*xn；

s602: converting the probability into a fraction and a positive integer;

score total ═ a-B × ln (odds) ═ a + B (w0+ w1 × 1+. + wn × n);

s603: solution A, B is performed:

substituting the formula to obtain:

P＝A-Bln(x)；

PDO＝A-Bln(2x)；

setting x to 5%, P to 800, PD0 to 50;

calculating A and B and substituting the A and B into score;

score＝(A+B*w0)+B*w1*x1+...+B*wn*xn；

s604: and multiplying the score corresponding to each variable by woe _ i of each bin in the variable respectively to obtain a scoring result of each bin, wherein the scoring result is as follows:

therefore, for a new user, each variable of the user only needs to be corresponding to each sub-box to obtain the corresponding woe value, and the score of the sample under each variable is calculated according to the formula. And finally, adding the scores corresponding to all the variables to obtain a final scoring result.

At present, a lot of scene requirements of federal learning are needed, but large-scale landing is not achieved, besides the problem that policies and standards are to be perfected, the technical requirements of engineers are high, for example, technologies such as using federal learning to carry out privacy modeling need more knowledge popularization and experience accumulation, but with the gradual clearness of market requirements and technical solutions, more and more enterprises are believed to participate, and the federal learning assists data flow to enable data islands to be connected into nets.

The foregoing is merely exemplary and illustrative of the present invention and various modifications, additions and substitutions may be made by those skilled in the art to the specific embodiments described without departing from the scope of the invention as defined in the following claims.

Claims

1. A citizen credit score evaluation method applying logistic regression modeling is characterized by comprising the following steps:

step four: performing data analysis, specifically:

step 3): binning and calculating woe and IV values for each feature;

woe — i ═ ln (bad sample rate/good sample rate);

iv _ i (bad-good sample rate) woe _ i;

IV of variable IV — i of all bins of the variable;

s1: the guest first sets tag y to 1, -1, x-woe value;

sending the batch _ info after batching to Host and the Arbiter;

sending the sample id of each batch of data, namely index, to the Host;

s401: initializing a cyclic batch number equal to 0;

s40201: calculating a gradient;

s40202: calculating loss;

s40203: updating the weight;

s403: acquiring a tag is _ changed for stopping iteration sent by the Arbiter;

s404: the iteration times are the iteration times + 1;

s405: if the obtained stop iteration label is 'True', exiting the large loop; step six: carrying out model prediction;

s601: the resulting model was W ═ W0, W1, W2,.., wn ];

s602: converting the probability into a fraction and a positive integer;

credit score (parameter a + parameter B (WX) ═ a + B (w0+ w1 x1+ ·+ wn xn);

s603: solution A, B is performed:

substituting the formula to obtain:

P＝A-Bln(x)；

PDO＝A-Bln(2x)；

setting x to 5%, P to 800, PD0 to 50;

calculating A and B and substituting the A and B into a credit score;

credit score ═ a + B × w0) + B × 1 × 1+ ·+ B × wn × xn;

2. The method for citizen credit point assessment using logistic regression modeling as claimed in claim 1, wherein the detailed process of calculating the gradient is:

step S01: calculate forward for polymerization:

Acquiring Host _ forwards (w x) sent by Host

Aggregation _ forwards after polymerization plus host _ forwards

Step S02: calculating for _ gradient:

simultaneously sending the form _ gradient to the host;

step S03: calculating a unilateral gradient:

both guest and host calculate respectively:

step S04: add regularized one-sided gradient:

here, using L2 regularization, both guest and host calculate:

step S05: performing gradient updating, specifically:

acquiring optim _ gauge _ gradient and optim _ host _ gradient;

removing all the remaining last value from the host _ optim _ gradient;

the last bit value in the default _ optim _ gradient is set;

wherein: host _ optim _ gradient: optimal gradient for the host side;

guest _ optim _ gradient: optimal gradient for the guest square;

3. The citizen credit score assessment method using logistic regression modeling as claimed in claim 1, wherein the detailed process of calculating loss is:

Loss＝Loss+loss_norm，Loss＝Loss+host_loss_regular；

SS 05: the guest then sends the Loss to the arbiters.

4. The citizen credit score assessment method using logistic regression modeling according to claim 1, wherein the detailed process of weight update is as follows:

guest and host respectively calculate the updated weight:

w＝w-guest_optim_gradient，w＝w-host_optim_gradient。

5. the method for citizen credit point assessment using logistic regression modeling as claimed in claim 1, wherein the detailed process of obtaining stop iteration label:

6. The method of claim 1, wherein the identity characteristics in step one are gender, age, ID card address, student status; consumption capacity: the social security public accumulation fund payment base number and the like; credit history: authorizing the number of the applications according to the credit service authorization condition, the number of times of continuous credit service performance and the number of times of credit service loss in the APP; honor and honor: the number of the types of the red lists and the number of the types of the black lists, in particular to the number of the red black lists according to the credit Chinese website public list.

7. The citizen credit point assessment method applying logistic regression modeling according to claim 1, wherein the government affair data come from a government system on a government side, the bank data come from a bank system on a bank side, and the respective data are not shared and are respectively modeled locally, so that the effect of aggregating data of different parties together for modeling is achieved; the customer samples corresponding to the data on the two sides are customers in the same region.