CN111724175A - Citizen credit point evaluation method applying logistic regression modeling - Google Patents

Citizen credit point evaluation method applying logistic regression modeling Download PDF

Info

Publication number
CN111724175A
CN111724175A CN202010568798.5A CN202010568798A CN111724175A CN 111724175 A CN111724175 A CN 111724175A CN 202010568798 A CN202010568798 A CN 202010568798A CN 111724175 A CN111724175 A CN 111724175A
Authority
CN
China
Prior art keywords
host
data
gradient
guest
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010568798.5A
Other languages
Chinese (zh)
Inventor
吴福全
朱全日
张小花
左杨
刘爽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Dike Digital Gold Technology Co ltd
Original Assignee
Anhui Dike Digital Gold Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Dike Digital Gold Technology Co ltd filed Critical Anhui Dike Digital Gold Technology Co ltd
Priority to CN202010568798.5A priority Critical patent/CN111724175A/en
Publication of CN111724175A publication Critical patent/CN111724175A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/018Certifying business or products
    • G06Q30/0185Product, service or business identity fraud
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0609Buyer or seller confidence or verification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Technology Law (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a citizen credit point evaluation method applying logistic regression modeling, which specifically comprises the following steps: firstly, obtaining independent variable data, wherein the independent variable data comprises: government affairs data and bank data; the government affair data includes: identity characteristics, consumption capacity, credit history, qualification honor; the bank data comprises asset information, wherein the asset information comprises annual income and whether own residences exist; acquiring dependent variable data y, wherein the dependent variable data y specifically comprises good clients and bad clients, and the clients which are overdue for more than 60 days are the bad clients and the clients which are not overdue are the good clients; the invention can increase the total amount of available data through federal learning, and can well solve the problem of existing data isolated island; for enterprises, the federal learning can be used for simply, legally and inexpensively acquiring external effective data information, troubles caused by insufficient data volume or data dimension are quickly solved, and data or business confidentiality among cooperative enterprises cannot be leaked.

Description

Citizen credit point evaluation method applying logistic regression modeling
Technical Field
The invention belongs to the field of citizen credit points, and particularly relates to a citizen credit point evaluation method applying logistic regression modeling.
Background
The data privacy protection is realized through federal learning, wherein the design goal of the federal learning is to carry out high-efficiency machine learning among multiple participants or multiple computing nodes on the premise of ensuring the information safety during big data exchange, protecting the privacy of terminal data and personal data and ensuring legal compliance.
Federal learning is divided into horizontal federal learning and vertical federal learning, and as used herein, vertical federal learning is used. Longitudinal federated learning applies to the case where two data sets share the same sample ID space but the feature space is different. Longitudinal federated learning is the process of aggregating these different features and calculating training losses and gradients in a privacy-preserving manner to build models using both parties' data together.
Disclosure of Invention
The invention aims to provide a citizen credit point evaluation method applying logistic regression modeling.
The purpose of the invention can be realized by the following technical scheme:
a citizen credit score evaluation method applying logistic regression modeling specifically comprises the following steps:
the method comprises the following steps: acquiring independent variable data, wherein the independent variable data comprises: government affairs data and bank data;
wherein the government affair data comprises: identity characteristics, consumption capacity, credit history, qualification honor;
the bank data comprises asset information, wherein the asset information comprises annual income and whether own residences exist;
step two: acquiring dependent variable data y, specifically including good clients and bad clients, wherein the clients which are overdue for more than 60 days are the bad clients and the clients which are not overdue are the good clients;
step three: the bank side is called a gust, the government side is called a host, wherein the government side comprises a plurality of hosts, and specifically comprises a government network, a social security bureau and a public accumulation fund center; and a third party called arbiter is built on the government administration side;
step four: performing data analysis, specifically:
step 1): firstly, taking intersection of sample ids of a guest party and a host party, finding the same users, and using the users as modeling samples;
step 2): filling in, filling in missing values, and performing Guest and host locally;
step 3): binning and calculating woe and IV values for each feature;
s3.1: the method adopts equal-frequency binning, and specifically comprises the following steps: dividing all characteristic values of each variable into n boxes to enable the quantity of each box to be equal, and obtaining a dividing point of each box;
s3.2: the guest and the host respectively carry out equal frequency binning on all continuous variables locally;
s3.3: the guest performs Paillier encryption on y and sends the encrypted y to host;
s3.4: the guest calculates the local IV value: calculating the number of good samples and bad samples of each group after all the features are subjected to binning, for example, obtaining the following format: result _ sum { 'x1' [ [0,0], [2,1], [0,0], [1,0] ], [ x2 '[ [0,0], [0,0], [0,0], [1,0], [2,1] ], [ x3' [ [0,0], [0,0], [2,1], [1,0] ] } of each feature, and then woe values for each feature are calculated, as follows:
woe — i ═ ln (bad sample rate/good sample rate);
iv _ i (bad-good sample rate) woe _ i;
IV of variable IV — i of all bins of the variable;
s3.5: and (4) calculating the host square IV value: the host calculates the good and bad sample numbers result _ sum of all groups after all the characteristics are subjected to binning according to the encrypted y sent by the gust, and sends the result to the gust;
s3.6: the guest receives the encrypted result _ sum of the host party, firstly decrypts, calculates woe and IV values in the same way, and sends the woe _ i result of the host party to the host;
s3.7: and (3) feature value conversion: guest and host respectively convert all the characteristics into woe values of the box to which the characteristic values belong, and the result values replace the original independent variable x values to be used as new modeling independent variables;
step 4): feature selection, in the method, an IV value is used for feature selection, a threshold thr is set by both guest and host, the IV values of all feature variables of both parties are compared with the threshold thr by the guest, the feature variables with IV < the threshold thr are filtered, and the rest variables are used for constructing a model;
step five: the method comprises the following steps of constructing a model by adopting a longitudinal logistic regression federated learning method, and specifically comprising the following steps:
s1: the guest first sets tag y to 1, -1, x-woe value;
s2: batching data, wherein the batch _ info result after batching is assumed to be batch _ info, and the batch _ info includes the data size of each batch and the number of data batches;
sending the batch _ info after batching to Host and the Arbiter;
sending the sample id of each batch of data, namely index, to the Host;
s3: both guest and host initialize the model separately, i.e., w is a randomly generated number uniformly distributed from 0 to 1, and the constant term is 1; initializing the iteration number to be 0;
s4: starting loop training, when the iteration number is less than the set maximum iteration number:
s401: initializing a cyclic batch number equal to 0;
acquiring corresponding data characteristics according to the serial numbers of each batch of data;
s402: and (3) performing cyclic training on all the data after batching according to batches:
s40201: calculating a gradient;
s40202: calculating loss;
s40203: updating the weight;
s403: acquiring a tag is _ changed for stopping iteration sent by the Arbiter;
s404: the iteration times are the iteration times + 1;
s405: if the obtained stop iteration label is 'True', exiting the large loop;
step six: carrying out model prediction;
s601: the resulting model was W ═ W0, W1, W2,.., wn ];
then the probability p ═ 1/(1+ e ^ W ^ X)) that a sample is predicted to be a bad customer, i.e.:
ln (event occurrence ratio) ═ ln (p/(1-p)) ═ W × X ═ W0+ W1 × 1+ W2 × 2+. + wn × n;
s602: converting the probability into a fraction and a positive integer;
credit score (parameter a + parameter B (WX) ═ a + B (w0+ w1 x1+ ·+ wn xn);
s603: solution A, B is performed:
setting the score when x is good/bad as P, the score of the point with the ratio of 2x should be P-PDO;
substituting the formula to obtain:
P=A-Bln(x);
PDO=A-Bln(2x);
setting x to 5%, P to 800, PD0 to 50;
calculating A and B and substituting the A and B into a credit score;
credit score ═ a + B × w0) + B × 1 × 1+ ·+ B × wn × xn;
in the formula, A + B w0 is a basic score, and B wi xi is a score correspondingly distributed to each variable;
s604: and multiplying the score corresponding to each variable by woe _ i of each box in the variable to obtain the scoring result of each box.
Further, the detailed process of calculating the gradient is:
step S01: calculate forward for polymerization:
the feature value of the guest party forwards is w x weight, and the result is encrypted
Acquiring Host _ forwards (w x) sent by Host
Aggregation _ forwards after polymerization plus host _ forwards
Step S02: calculating for _ gradient:
form _ gradient is 0.25 aggregated _ forwards-0.5 y; in the formula, y is a sample real label value;
simultaneously sending the form _ gradient to the host;
step S03: calculating a unilateral gradient:
both guest and host calculate respectively:
the single-side gradient is uniform _ gradient, for _ gradient X/n, wherein n is the sample size;
step S04: add regularized one-sided gradient:
here, using L2 regularization, both guest and host calculate:
the unideralal _ gradient + alpha w, where alpha is a regularization coefficient;
step S05: performing gradient updating, specifically:
acquiring optim _ gauge _ gradient and optim _ host _ gradient;
both the guest and the host respectively send the unilateral gradient _ gradient to the arbiter;
the arbiters receive the unilateral gradient of the two sides and firstly decrypt the unilateral gradient;
and then the results, namely the optimal gradient, after decryption by different parties are respectively sent to the guest and the host, namely:
removing all the remaining last value from the host _ optim _ gradient;
the last bit value in the default _ optim _ gradient is set;
wherein: host _ optim _ gradient: optimal gradient for the host side;
guest _ optim _ gradient: optimal gradient for the guest square;
separate _ optim _ gradient: the collected gusts and gradients of multiple host parties are aggregated into a set for arbiters.
Further, the detailed process of calculating the loss is as follows:
SS 01: and the Guest obtains the information sent by the host, and the total Loss calculation formula of the two parties is as follows:
Figure BDA0002548576930000061
wherein Loss is total Loss; n is the total sample size of the two parties; yi is the actual label value of the ith sample; w is a feature weight; x is a characteristic value;
SS 02: the guest and host parties respectively calculate the regularization term by using formulas, and the formula of the loss regularization term is as follows:
Figure BDA0002548576930000062
SS 03: the host sends the self-owned regularization term result host _ loss _ regular to the guest;
SS 04: and the guaest summarizes the regularization terms of the two parties and the total loss calculated in the previous step to obtain the total loss added with the regularization terms:
Loss=Loss+loss_norm,Loss=Loss+host_loss_regular;
SS 05: the guest then sends the Loss to the arbiters.
Further, the detailed process of weight update is as follows:
guest and host respectively calculate the updated weight:
w=w-guest_optim_gradient,w=w-host_optim_gradient。
further, a detailed procedure of stopping the iteration tag is obtained:
and the arbiter judges whether convergence is carried out by using the absolute value of the Loss according to the total Loss sent by the guest, if the Loss value is less than a set threshold value eps, the is _ converged is True, otherwise the is _ converged is false, and the is _ converged is sent to the guest and the host.
Further, the identity characteristics in the step one are sex, age, identity card address and student status; consumption capacity: the social security public accumulation fund payment base number and the like; credit history: authorizing the number of the applications according to the credit service authorization condition, the number of times of continuous credit service performance and the number of times of credit service loss in the APP; honor and honor: the number of the types of the red lists and the number of the types of the black lists, in particular to the number of the red black lists according to the credit Chinese website public list.
Furthermore, the government affair data come from a government system at a government side, the bank data come from a bank system at a bank side, and the respective data are not shared and are respectively modeled locally, so that the effect of aggregating data of different parties together for modeling is achieved; the customer samples corresponding to the data on the two sides are customers in the same region.
The invention has the beneficial effects that:
according to the invention, from the whole data industry, the total amount of available data can be increased by federal learning, and the problem of existing data isolated island can be well solved; for enterprises, the federal learning can be used for simply, legally and inexpensively acquiring external effective data information, troubles caused by insufficient data volume or data dimension are quickly solved, and data or business confidentiality among cooperative enterprises cannot be leaked.
Detailed Description
A citizen credit score assessment method applying logistic regression modeling comprises the following steps:
the method comprises the following steps: acquiring independent variable data, wherein the independent variable data comprises: government affairs data and bank data;
wherein the government affair data comprises: identity characteristics, consumption capacity, credit history, qualification honor;
the identity characteristics are sex, age, identity card address, student status and school calendar; consumption capacity: the social security public accumulation fund payment base number and the like; credit history: authorizing the number of the applications according to the credit service authorization condition, the number of times of continuous credit service performance and the number of times of credit service loss in the APP; honor and honor: the number of the types of the red lists and the number of the types of the black lists, specifically the number of the types of the red black lists according to websites such as credit Chinese public lists;
wherein the bank data includes: asset information, including annual income, presence or absence of own residence, etc.;
the government affair data come from a government system on a government side, the bank data come from a bank system on a bank side, and the data are not shared and are respectively modeled locally, so that the effect of aggregating data of different parties together for modeling is achieved. The customer samples corresponding to the data on two sides are customers in the same region
Step two: acquiring dependent variable data y, specifically including good clients and bad clients, wherein the clients which are overdue for more than 60 days are the bad clients and the clients which are not overdue are the good clients;
step three: assuming that a bank side is called a gusest and a government side is called a host, wherein the government side comprises a government network, a social security bureau, a accumulation fund center and the like, a plurality of hosts are provided, and a third party called arbiters is built on the government side;
step four: performing data analysis, specifically:
step 1): firstly, taking intersection of sample ids of a guest party and a host party, finding the same users, and using the users as modeling samples;
step 2): filling in, filling in missing values, and performing Guest and host locally;
step 3): binning and calculating woe and IV values for each feature;
s3.1: the method adopts equal-frequency binning, and specifically comprises the following steps: dividing all eigenvalues of each variable into n bins such that the amount of each bin is equal, resulting in a division point for each bin, such as split _ points { 'x1': 0.1,0.2,0.3,0.4. ], 'x2': 1,2,3, 4. ], ·;
s3.2: the guest and the host respectively carry out equal frequency binning on all continuous variables locally;
s3.3: the guest performs Paillier encryption on y and sends the encrypted y to host;
s3.4: the guest calculates the local IV value: calculating the number of good samples and bad samples of each group after all the characteristics are subjected to box separation, and obtaining the following format: result _ sum { 'x1' [ [0,0], [2,1], [0,0], [1,0] ], [ x2 '[ [0,0], [0,0], [0,0], [1,0], [2,1] ], [ x3' [ [0,0], [0,0], [2,1], [1,0] ] } of each feature, and then woe values for each feature are calculated, as follows:
s3.5: woe — i ═ ln (bad sample rate/good sample rate);
s3.6: iv _ i (bad-good sample rate) woe _ i;
s3.7: IV of variable IV — i of all bins of the variable;
s3.8: and (4) calculating the host square IV value: host calculates result _ sum according to the encrypted y sent by the guest, and sends the result to the guest;
s3.9: the guest receives the encrypted result _ sum of the host party, firstly decrypts, calculates woe and IV values in the same way, and sends the woe _ i result of the host party to the host;
s3.10: and (3) feature value conversion: guest and host respectively convert all the characteristics into woe values of the box to which the characteristic values belong, and the result values replace the original independent variable x values to be used as new modeling independent variables; eigenvalue conversion is typically to scatter data for feature selection;
step 4): feature selection, in the method, an IV value is used for feature selection, a threshold value thr is set by both guest and host, the IV value and the thr of all feature variables of both parties are compared by the guest, the feature variables with IV < thr are filtered out, and the rest variables are used for constructing a model; the specific method for constructing the model comprises the following steps:
step five: the method comprises the following steps of constructing a model by adopting a longitudinal logistic regression federated learning method, and specifically comprising the following steps:
s1: the guest first sets tag y to 1, -1, x-woe value;
s2: batching the data to obtain batch _ info, wherein the batch _ info comprises the data size of each batch and the data batch number;
sending the batch _ info to the Host and the Arbiter;
sending the sample id of each batch of data, namely index, to the Host;
s3: both guest and host initialize the model separately, i.e., w is a randomly generated number uniformly distributed from 0 to 1, and the constant term is 1;
s4: starting loop training, when the iteration number is less than the set maximum iteration number:
s401: initializing a cyclic batch number equal to 0;
acquiring corresponding data characteristics according to the serial numbers of each batch of data;
s402: and (3) performing cyclic training on all the data after batching according to batches:
s40201: calculating a gradient;
s40202: calculating loss;
s40203: updating the weight;
s403: acquiring a tag is _ changed for stopping iteration sent by the Arbiter;
s404: the iteration times are the iteration times + 1;
s405: if the obtained stop iteration label is 'True', exiting the large loop; wherein, the detailed process of calculating the gradient is as follows:
step S01: calculate forward for polymerization:
forwards ═ w x, and the result is encrypted
Acquiring Host _ forwards (w x) sent by Host
self.aggregated_forwards=self.forwards+host_forwads
Step S02: calculating for _ gradient:
fore_gradient=0.25*self.aggregated_forwards-0.5*y;
simultaneously sending the form _ gradient to the host;
step S03: calculating a unilateral gradient:
both guest and host calculate respectively: uniform _ gradient _ for _ gradient X/n; step S04: add regularized one-sided gradient:
here, using L2 regularization, both guest and host calculate:
unilateral_gradient=unilateral_gradient+alpha*w
step S05: performing gradient updating, specifically:
acquiring optim _ gauge _ gradient and optim _ host _ gradient;
both the guest and the host respectively send the unilateral gradient _ gradient to the arbiter;
the arbiters receive the unilateral gradient of the two sides and firstly decrypt the unilateral gradient;
and then the results, namely the optimal gradient, after decryption by different parties are respectively sent to the guest and the host, namely:
host_optim_gradient=separate_optim_gradient[:-1];
guest_optim_gradient=separate_optim_gradient[-1];
the detailed process of calculating the loss is as follows:
SS 01: and the Guest obtains the information sent by the host, and the total Loss calculation formula of the two parties is as follows:
Figure BDA0002548576930000111
SS 02: the guest and host parties respectively calculate the regularization term by using formulas, and the formula of the loss regularization term is as follows:
Figure BDA0002548576930000112
SS 03: the host sends the self-owned regularization term result host _ loss _ regular to the guest;
SS 04: and the guaest summarizes the regularization terms of the two parties and the total loss calculated in the previous step to obtain the total loss added with the regularization terms:
Loss=Loss+loss_norm,Loss=Loss+host_loss_regular;
SS 05: the guest then sends the Loss to the arbiters.
Wherein, the detailed process of the weight update is as follows:
guest and host respectively calculate the updated weight:
w=w-guest_optim_gradient,w=w-host_optim_gradient;
obtaining a detailed process of stopping iteration labels:
the arbiter judges whether convergence is carried out by using an absolute value of Loss according to the total Loss sent by the guest, if the Loss value is less than a set threshold value eps, the is _ converged is True, otherwise the is _ converged is false, and the is _ converged is sent to the guest and the host;
step six: carrying out model prediction; the method specifically comprises the following steps:
s601: the resulting model was W ═ W0, W1, W2,.., wn ];
then the probability p of predicting a sample as a bad client is 1/(1+ e ^ (z)), i.e.:
ln(odds)=ln(p/(1-p))=z=W*X=w0+w1*x1+w2*x2+...+wn*xn;
s602: converting the probability into a fraction and a positive integer;
score total ═ a-B × ln (odds) ═ a + B (w0+ w1 × 1+. + wn × n);
s603: solution A, B is performed:
setting the score when x is good/bad as P, the score of the point with the ratio of 2x should be P-PDO;
substituting the formula to obtain:
P=A-Bln(x);
PDO=A-Bln(2x);
setting x to 5%, P to 800, PD0 to 50;
calculating A and B and substituting the A and B into score;
score=(A+B*w0)+B*w1*x1+...+B*wn*xn;
in the formula, A + B w0 is a basic score, and B wi xi is a score correspondingly distributed to each variable;
s604: and multiplying the score corresponding to each variable by woe _ i of each bin in the variable respectively to obtain a scoring result of each bin, wherein the scoring result is as follows:
Figure BDA0002548576930000131
therefore, for a new user, each variable of the user only needs to be corresponding to each sub-box to obtain the corresponding woe value, and the score of the sample under each variable is calculated according to the formula. And finally, adding the scores corresponding to all the variables to obtain a final scoring result.
At present, a lot of scene requirements of federal learning are needed, but large-scale landing is not achieved, besides the problem that policies and standards are to be perfected, the technical requirements of engineers are high, for example, technologies such as using federal learning to carry out privacy modeling need more knowledge popularization and experience accumulation, but with the gradual clearness of market requirements and technical solutions, more and more enterprises are believed to participate, and the federal learning assists data flow to enable data islands to be connected into nets.
The foregoing is merely exemplary and illustrative of the present invention and various modifications, additions and substitutions may be made by those skilled in the art to the specific embodiments described without departing from the scope of the invention as defined in the following claims.

Claims (7)

1. A citizen credit score evaluation method applying logistic regression modeling is characterized by comprising the following steps:
the method comprises the following steps: acquiring independent variable data, wherein the independent variable data comprises: government affairs data and bank data;
wherein the government affair data comprises: identity characteristics, consumption capacity, credit history, qualification honor;
the bank data comprises asset information, wherein the asset information comprises annual income and whether own residences exist;
step two: acquiring dependent variable data y, specifically including good clients and bad clients, wherein the clients which are overdue for more than 60 days are the bad clients and the clients which are not overdue are the good clients;
step three: the bank side is called a gust, the government side is called a host, wherein the government side comprises a plurality of hosts, and specifically comprises a government network, a social security bureau and a public accumulation fund center; and a third party called arbiter is built on the government administration side;
step four: performing data analysis, specifically:
step 1): firstly, taking intersection of sample ids of a guest party and a host party, finding the same users, and using the users as modeling samples;
step 2): filling in, filling in missing values, and performing Guest and host locally;
step 3): binning and calculating woe and IV values for each feature;
s3.1: the method adopts equal-frequency binning, and specifically comprises the following steps: dividing all characteristic values of each variable into n boxes to enable the quantity of each box to be equal, and obtaining a dividing point of each box;
s3.2: the guest and the host respectively carry out equal frequency binning on all continuous variables locally;
s3.3: the guest performs Paillier encryption on y and sends the encrypted y to host;
s3.4: the guest calculates the local IV value: calculating the number of good samples and bad samples of each group after all the features are subjected to binning, for example, obtaining the following format: result _ sum { 'x1' [ [0,0], [2,1], [0,0], [1,0] ], [ x2 '[ [0,0], [0,0], [0,0], [1,0], [2,1] ], [ x3' [ [0,0], [0,0], [2,1], [1,0] ] } of each feature, and then woe values for each feature are calculated, as follows:
woe — i ═ ln (bad sample rate/good sample rate);
iv _ i (bad-good sample rate) woe _ i;
IV of variable IV — i of all bins of the variable;
s3.5: and (4) calculating the host square IV value: the host calculates the good and bad sample numbers result _ sum of all groups after all the characteristics are subjected to binning according to the encrypted y sent by the gust, and sends the result to the gust;
s3.6: the guest receives the encrypted result _ sum of the host party, firstly decrypts, calculates woe and IV values in the same way, and sends the woe _ i result of the host party to the host;
s3.7: and (3) feature value conversion: guest and host respectively convert all the characteristics into woe values of the box to which the characteristic values belong, and the result values replace the original independent variable x values to be used as new modeling independent variables;
step 4): feature selection, in the method, an IV value is used for feature selection, a threshold thr is set by both guest and host, the IV values of all feature variables of both parties are compared with the threshold thr by the guest, the feature variables with IV < the threshold thr are filtered, and the rest variables are used for constructing a model;
step five: the method comprises the following steps of constructing a model by adopting a longitudinal logistic regression federated learning method, and specifically comprising the following steps:
s1: the guest first sets tag y to 1, -1, x-woe value;
s2: batching data, wherein the batch _ info result after batching is assumed to be batch _ info, and the batch _ info includes the data size of each batch and the number of data batches;
sending the batch _ info after batching to Host and the Arbiter;
sending the sample id of each batch of data, namely index, to the Host;
s3: both guest and host initialize the model separately, i.e., w is a randomly generated number uniformly distributed from 0 to 1, and the constant term is 1; initializing the iteration number to be 0;
s4: starting loop training, when the iteration number is less than the set maximum iteration number:
s401: initializing a cyclic batch number equal to 0;
acquiring corresponding data characteristics according to the serial numbers of each batch of data;
s402: and (3) performing cyclic training on all the data after batching according to batches:
s40201: calculating a gradient;
s40202: calculating loss;
s40203: updating the weight;
s403: acquiring a tag is _ changed for stopping iteration sent by the Arbiter;
s404: the iteration times are the iteration times + 1;
s405: if the obtained stop iteration label is 'True', exiting the large loop; step six: carrying out model prediction;
s601: the resulting model was W ═ W0, W1, W2,.., wn ];
then the probability p ═ 1/(1+ e ^ W ^ X)) that a sample is predicted to be a bad customer, i.e.:
ln (event occurrence ratio) ═ ln (p/(1-p)) ═ W × X ═ W0+ W1 × 1+ W2 × 2+. + wn × n;
s602: converting the probability into a fraction and a positive integer;
credit score (parameter a + parameter B (WX) ═ a + B (w0+ w1 x1+ ·+ wn xn);
s603: solution A, B is performed:
setting the score when x is good/bad as P, the score of the point with the ratio of 2x should be P-PDO;
substituting the formula to obtain:
P=A-Bln(x);
PDO=A-Bln(2x);
setting x to 5%, P to 800, PD0 to 50;
calculating A and B and substituting the A and B into a credit score;
credit score ═ a + B × w0) + B × 1 × 1+ ·+ B × wn × xn;
in the formula, A + B w0 is a basic score, and B wi xi is a score correspondingly distributed to each variable;
s604: and multiplying the score corresponding to each variable by woe _ i of each box in the variable to obtain the scoring result of each box.
2. The method for citizen credit point assessment using logistic regression modeling as claimed in claim 1, wherein the detailed process of calculating the gradient is:
step S01: calculate forward for polymerization:
the feature value of the guest party forwards is w x weight, and the result is encrypted
Acquiring Host _ forwards (w x) sent by Host
Aggregation _ forwards after polymerization plus host _ forwards
Step S02: calculating for _ gradient:
form _ gradient is 0.25 aggregated _ forwards-0.5 y; in the formula, y is a sample real label value;
simultaneously sending the form _ gradient to the host;
step S03: calculating a unilateral gradient:
both guest and host calculate respectively:
the single-side gradient is uniform _ gradient, for _ gradient X/n, wherein n is the sample size;
step S04: add regularized one-sided gradient:
here, using L2 regularization, both guest and host calculate:
the unideralal _ gradient + alpha w, where alpha is a regularization coefficient;
step S05: performing gradient updating, specifically:
acquiring optim _ gauge _ gradient and optim _ host _ gradient;
both the guest and the host respectively send the unilateral gradient _ gradient to the arbiter;
the arbiters receive the unilateral gradient of the two sides and firstly decrypt the unilateral gradient;
and then the results, namely the optimal gradient, after decryption by different parties are respectively sent to the guest and the host, namely:
removing all the remaining last value from the host _ optim _ gradient;
the last bit value in the default _ optim _ gradient is set;
wherein: host _ optim _ gradient: optimal gradient for the host side;
guest _ optim _ gradient: optimal gradient for the guest square;
separate _ optim _ gradient: the collected gusts and gradients of multiple host parties are aggregated into a set for arbiters.
3. The citizen credit score assessment method using logistic regression modeling as claimed in claim 1, wherein the detailed process of calculating loss is:
SS 01: and the Guest obtains the information sent by the host, and the total Loss calculation formula of the two parties is as follows:
Figure FDA0002548576920000051
wherein Loss is total Loss; n is the total sample size of the two parties; yi is the actual label value of the ith sample; w is a feature weight; x is a characteristic value;
SS 02: the guest and host parties respectively calculate the regularization term by using formulas, and the formula of the loss regularization term is as follows:
Figure FDA0002548576920000052
SS 03: the host sends the self-owned regularization term result host _ loss _ regular to the guest;
SS 04: and the guaest summarizes the regularization terms of the two parties and the total loss calculated in the previous step to obtain the total loss added with the regularization terms:
Loss=Loss+loss_norm,Loss=Loss+host_loss_regular;
SS 05: the guest then sends the Loss to the arbiters.
4. The citizen credit score assessment method using logistic regression modeling according to claim 1, wherein the detailed process of weight update is as follows:
guest and host respectively calculate the updated weight:
w=w-guest_optim_gradient,w=w-host_optim_gradient。
5. the method for citizen credit point assessment using logistic regression modeling as claimed in claim 1, wherein the detailed process of obtaining stop iteration label:
and the arbiter judges whether convergence is carried out by using the absolute value of the Loss according to the total Loss sent by the guest, if the Loss value is less than a set threshold value eps, the is _ converged is True, otherwise the is _ converged is false, and the is _ converged is sent to the guest and the host.
6. The method of claim 1, wherein the identity characteristics in step one are gender, age, ID card address, student status; consumption capacity: the social security public accumulation fund payment base number and the like; credit history: authorizing the number of the applications according to the credit service authorization condition, the number of times of continuous credit service performance and the number of times of credit service loss in the APP; honor and honor: the number of the types of the red lists and the number of the types of the black lists, in particular to the number of the red black lists according to the credit Chinese website public list.
7. The citizen credit point assessment method applying logistic regression modeling according to claim 1, wherein the government affair data come from a government system on a government side, the bank data come from a bank system on a bank side, and the respective data are not shared and are respectively modeled locally, so that the effect of aggregating data of different parties together for modeling is achieved; the customer samples corresponding to the data on the two sides are customers in the same region.
CN202010568798.5A 2020-06-19 2020-06-19 Citizen credit point evaluation method applying logistic regression modeling Withdrawn CN111724175A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010568798.5A CN111724175A (en) 2020-06-19 2020-06-19 Citizen credit point evaluation method applying logistic regression modeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010568798.5A CN111724175A (en) 2020-06-19 2020-06-19 Citizen credit point evaluation method applying logistic regression modeling

Publications (1)

Publication Number Publication Date
CN111724175A true CN111724175A (en) 2020-09-29

Family

ID=72568672

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010568798.5A Withdrawn CN111724175A (en) 2020-06-19 2020-06-19 Citizen credit point evaluation method applying logistic regression modeling

Country Status (1)

Country Link
CN (1) CN111724175A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112270597A (en) * 2020-11-10 2021-01-26 恒安嘉新(北京)科技股份公司 Business processing and credit evaluation model training method, device, equipment and medium
CN112651170A (en) * 2020-12-14 2021-04-13 德清阿尔法创新研究院 Efficient feature contribution evaluation method in longitudinal federated learning scene
CN113345597A (en) * 2021-07-15 2021-09-03 中国平安人寿保险股份有限公司 Federal learning method and device of infectious disease probability prediction model and related equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112270597A (en) * 2020-11-10 2021-01-26 恒安嘉新(北京)科技股份公司 Business processing and credit evaluation model training method, device, equipment and medium
CN112651170A (en) * 2020-12-14 2021-04-13 德清阿尔法创新研究院 Efficient feature contribution evaluation method in longitudinal federated learning scene
CN112651170B (en) * 2020-12-14 2024-02-27 德清阿尔法创新研究院 Efficient characteristic contribution assessment method in longitudinal federal learning scene
CN113345597A (en) * 2021-07-15 2021-09-03 中国平安人寿保险股份有限公司 Federal learning method and device of infectious disease probability prediction model and related equipment

Similar Documents

Publication Publication Date Title
Lv et al. Analysis of healthcare big data
CN111724175A (en) Citizen credit point evaluation method applying logistic regression modeling
WO2021179720A1 (en) Federated-learning-based user data classification method and apparatus, and device and medium
US20230100679A1 (en) Data processing method, apparatus, and device, computer-readable storage medium, and computer program product
Vu Privacy-preserving Naive Bayes classification in semi-fully distributed data model
CN112183730A (en) Neural network model training method based on shared learning
JP6892454B2 (en) Systems and methods for calculating the data confidentiality-practicality trade-off
CN110414987A (en) Recognition methods, device and the computer system of account aggregation
CN112949760A (en) Model precision control method and device based on federal learning and storage medium
Jatain et al. A contemplative perspective on federated machine learning: Taxonomy, threats & vulnerability assessment and challenges
CN116049909B (en) Feature screening method, device, equipment and storage medium in federal feature engineering
CN110210858A (en) A kind of air control guard system design method based on intelligent terminal identification
CN103870668A (en) Method and device for establishing master patient index oriented to regional medical treatment
CN114372871A (en) Method and device for determining credit score value, electronic device and storage medium
CN116204773A (en) Causal feature screening method, causal feature screening device, causal feature screening equipment and storage medium
CN107070932B (en) Anonymous method for preventing label neighbor attack in social network dynamic release
CN114742239A (en) Financial insurance claim risk model training method and device based on federal learning
Liang et al. A methodology of trusted data sharing across telecom and finance sector under china’s data security policy
WO2019223082A1 (en) Customer category analysis method and apparatus, and computer device and storage medium
CN106685893A (en) Authority control method based on social networking group
CN114422105A (en) Joint modeling method and device, electronic equipment and storage medium
US11630852B1 (en) Machine learning-based clustering model to create auditable entities
CN113591115A (en) Method for batch normalization in logistic regression model for safe federal learning
CN113553612A (en) Privacy protection method based on mobile crowd sensing technology
Fu et al. Data heterogeneous federated learning algorithm for industrial entity extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20200929

WW01 Invention patent application withdrawn after publication