CN115204934A

CN115204934A - National grid electricity consumption customer fraud risk prediction method based on scoring card

Info

Publication number: CN115204934A
Application number: CN202210524254.8A
Authority: CN
Inventors: 徐家宁; 楼斐; 蒋颖; 吴懿臻; 张维; 徐宏伟; 俞佳莉; 陈齐瑞; 陈昱伶; 张一池; 罗欣
Original assignee: Zhejiang Huayun Information Technology Co Ltd; Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd
Current assignee: Zhejiang Huayun Information Technology Co Ltd; Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd
Priority date: 2022-05-13
Filing date: 2022-05-13
Publication date: 2022-10-18

Abstract

The invention relates to a country network electricity consumption customer fraud risk prediction method based on a scorecard, and the technical scheme of the invention is characterized in that S1, data acquisition, data combing, such as user basic account information, user behavior information, activity participation information and the like, data integrity and accuracy verification, and electricity consumption customer feature library establishment; s2, data cleaning; s3, characteristic engineering; s4, establishing a model by adopting a scoring card algorithm, and training and verifying; and S5, dividing the risk level of the user according to the output result of the scoring card, and configuring different disposal strategies.

Description

National grid electricity consumption customer fraud risk prediction method based on scoring card

Technical Field

The invention belongs to a method for effectively intercepting risk customers and ensuring effective issuing of activity rewards, and relates to a national network electricity customer fraud risk prediction method based on a score card.

Background

With the gradual popularization of the internet APP, the number of online electricity customers is steadily increased, and the online handling capacity of the electric power service is synchronously improved. The importance of online customers' operations, such as activity operations, rights and interests issuance, is becoming increasingly important. At present, a wind control means for cheating groups is lacked, a traditional wind control rule engine cannot effectively identify and intercept cheating components, only can adopt the modes of activity off-shelf treatment and the like, and the experience of operation activity clients is poor.

Disclosure of Invention

The method solves the problems that in the prior art, a wind control means for cheating groups is lacked, a traditional wind control rule engine cannot effectively identify and intercept cheating partners, only can be used for processing in ways of moving off shelves and the like, and the experience of an operation activity client is poor, and provides a method for predicting the cheating risk of a national grid electricity client based on a score card.

The technical scheme adopted by the invention for solving the technical problems is as follows: a credit card-based national network electricity customer fraud risk prediction method comprises the following steps,

s1, acquiring data, combing basic account information of a user, user behavior information and activity participation information, checking the integrity and accuracy of the data, and establishing a power utilization customer feature library;

s2, cleaning data;

s3, characteristic engineering;

s4, establishing a model by adopting a scoring card algorithm, and training and verifying;

s5, dividing the risk level of the user according to the output result of the scoring card, and configuring different disposal strategies;

in the step S3, the characteristic engineering comprises the following sub-steps,

a1, aiming at a missing value and an abnormal value, matching a corresponding processing strategy according to a missing proportion;

a2, adopting an optimized box separation strategy to reduce the risk of model overfitting;

and A3, respectively calculating WOE (word of error) and IV (mean of error) values of different boxes in each variable according to the box separation result so as to be used for variable screening and model training.

According to the method, the APP account information, the activity data, the behavior data and the like of the 'online national network' of the company Zhejiang, china network are used as basic data, the electricity consumption customer feature library is established, the fraud risk of the electricity consumption customers is predicted based on the scoring card algorithm, the risk customers are effectively intercepted, and the activity rewards are effectively issued.

Preferably, in A1, the numerical features are processed by a median, mean, and linear filling method, the class features are processed by a mode filling method, and the abnormal values are detected and processed by an extreme value method or a quartile range method.

Preferably, in A2, the sample ratio and the positive-negative label ratio of the feature division regions are used to merge adjacent regions that satisfy the merging condition until a certain stopping criterion is satisfied.

Preferably, in said A2, the following sub-steps are performed,

a21, sorting, initializing and binning, sorting the numerical type features, wherein the number of initialized bins is min (100, n × 10%), wherein n is the sample size,

a22, calculating W _i ，

Constructing a box division basis index:

y _i : number of negative samples in interval i

y _T : total number of negative samples

n _i : number of front samples in interval i

n _T : total number of front samples

A23, combining the intervals, calculating the combination gain of the adjacent intervals,

E _i,i+1 ＝W _i,i+1 -W _i -W _i+1

respectively calculate E _i,i-1 ，E _i,i+1 Selecting the first n combinations with the maximum profit for combination;

and A24, the box separation is ended, when the number of the intervals meets the expected setting, the box separation operation is ended, and the stopping condition is as follows: meet the desired number of bins or complete iterations.

Preferably, WOE (evidence weight) and IV (information value) values of different boxes in each variable are respectively calculated according to the box dividing results for variable screening and model training, wherein the WOE (evidence weight) and the IV (information value) values are used for variable screening and model training

WOE calculation formula:

IV, calculating formula:

and (5) according to the calculation result, removing the characteristics of IV less than 0.1, and finally entering model training.

Preferably, in step S4, a scoring card algorithm is used for model training and verification, and sample data includes 70% as a training set and 30% as a verification set.

The substantial effects of the invention are as follows: according to the method, the APP account information, the activity data, the behavior data and the like of the 'online national network' of the company Zhejiang, china network are used as basic data, the electricity consumption customer feature library is established, the fraud risk of the electricity consumption customers is predicted based on the scoring card algorithm, the risk customers are effectively intercepted, and the activity rewards are effectively issued.

Drawings

FIG. 1 is a schematic overall flow chart of the present invention;

FIG. 2 is a comparison graph of the model effect of the training set and the test set in the present invention;

FIG. 3 is a schematic flow diagram of a characteristic binning method of the present invention;

fig. 4 is a flow chart illustrating the improved optimal binning strategy of the present invention.

Detailed Description

The technical solution of the present invention will be further specifically described below by way of specific examples.

Example 1:

a fraud risk prediction method for national grid power utilization customers based on a score card (see figure 1) comprises the following steps,

s1, data acquisition

And combing data such as user basic account information, user behavior information, activity participation information and the like, checking the integrity and accuracy of the data, and establishing a power utilization customer feature library. The data time range of the modeling shows that sample data is extracted by taking 5 months from 6 months to 2021 year 2020 as an observation period, and sample labels are extracted by taking 6 months to 2021 year 9 months as an observation period.

And S2, data cleaning, which is the prior art, is performed, and data selection is performed according to a preset format, which is not described in detail in this embodiment.

S3, characteristic engineering

A1, processing missing value and abnormal value

According to the method, different processing strategies are matched according to different deletion ratios, the characteristics with high deletion ratios are removed, numerical type characteristics are processed by adopting methods such as median, mean, linear filling and the like, and category characteristics are filled by adopting mode. The detection and processing of abnormal values are carried out by an extreme value method (for example, a 1% or 99% quantile capping method), a quartile range method, or the like.

A2, characteristic box separation

In the embodiment, a user-defined optimal binning strategy is adopted to perform optimal binning on the category characteristics and the numerical characteristics respectively, so that the characteristics have strong robustness on abnormal data, and the risk of model overfitting is reduced.

A3, WOE and IV value calculation

And respectively calculating WOE (word of error) and IV (input/output) values of different boxes in each variable according to the box separation result so as to be used for variable screening and model training.

WOE formula:

IV, calculating formula:

according to the calculation result, the characteristics that IV is less than 0.1 are removed, and the indexes of model training are shown in the table 1:

feature(s)	IV value
		Number of device login accounts	0.31
Ip associated account number of 7 days	0.26
		Number of activity participation in approximately 7 days	0.16
Usage amount of red envelope in nearly 7 days	0.13
		Number of account bound for nearly 30 days	0.19
Total point of nearly 30 days	0.22
		Total recommended population	0.11
Account age	0.18
		Authentication of real name	0.19
Days of Login in approximately 180 days	0.12
		Number of account unbinding in about 30 days	0.16
Red packet deduction ratio used in nearly 7 days	0.25
		Using integral deduction ratio in near 7 days	0.13
Accumulating the number of devices logged in	0.14
		Number of charges made in about 30 days	0.11
The sum of the fees paid by the households in nearly 30 days	0.13
		Cumulative number of paid households	0.12
Number of bound households per day	0.23

TABLE 1 index of final model training

S4, model training and verification (see the attached figure 2)

The modeling adopts a scoring card algorithm, and 1658 ten thousand samples are collected, wherein 70% of samples are used as a training set, and 30% of samples are used as a verification set.

S5, model result application

According to the output result of the scoring card, the risk grade of the user is divided, and different disposal strategies are configured, wherein the grade division and the disposal strategy details are shown in a table 2:

between scoring areas	Rating label	Handling policy
			(80,100]	Low risk	Do not process
(60,80]	Middle risk	Monitoring
			(0,60]	High risk	Interception black coating

TABLE 2 ranking and handling policy details

The box separation method for the user-defined numerical characteristic of the embodiment specifically comprises the following processes (see the attached drawing 3):

in said A2, the following sub-steps are performed,

a21, sequencing, initializing and binning, sequencing the numerical type features, wherein the number of initialized and binned boxes is min (100, n is 10%), wherein n is the sample size,

a22, calculating W _i ，

Constructing a box dividing basis index:

y _i : number of negative samples in interval i

y _T : total number of negative samples

n _i : number of front samples in interval i

n _T : total number of front samples

E _i,i+1 ＝W _i,i+1 -W _i -W _i+1

In this embodiment, when the improved optimal binning strategy is adopted (see fig. 4), the following steps are performed:

firstly, data acquisition and data cleaning are carried out, and data such as account number basic information, user behavior information and user activity information are collected from a marketing system, an online and national network APP application platform and the like.

And then performing feature engineering, processing the missing values by methods such as median, mode, mean, linear filling and the like, performing optimal binning on the features after the missing value processing is completed, respectively calculating WOE (weighted average) and IV (weighted average) values after binning, and preliminarily screening indexes according to the IV values and correlation coefficients.

And performing model training and model verification again, dividing the sample set into a training set and a test set, finishing the model training of the scoring card by using the data of the training set, and outputting the weight of each index. And performing simulation verification on the trained model on a verification set, and evaluating the actual effect of the model. And calculating the characteristic score, and calculating the score of each interval according to the WOE and the weight of the characteristic to form a final score card. And the grading card is deployed and applied, the grading card model is deployed in a system server, and differentiation strategies are adopted for users with different grading results, so that the wind control quality of the online national network is improved.

In the embodiment, the APP account information, the activity data, the behavior data and the like of the 'Internet and national network' of the company Zhejiang, china and the like are used as basic data, the power consumption customer feature library is established, the fraud risk of the power consumption customers is predicted based on the scoring card algorithm, the risk customers are effectively intercepted, and the activity rewards are effectively issued.

The above-described embodiments are only preferred embodiments of the present invention, and are not intended to limit the present invention in any way, and other variations and modifications may be made without departing from the spirit of the invention as set forth in the claims.

Claims

1. A national grid power utilization customer fraud risk prediction method based on a score card is characterized by comprising the following steps,

s2, data cleaning;

s3, characteristic engineering;

and A3, respectively calculating WOE and IV values of different bins in each variable according to the bin dividing result so as to be used for variable screening and model training.

2. The method for predicting the fraud risk of the national grid electricity consumers based on the score card as claimed in claim 1, wherein in the A1, the numerical features are processed by a median, mean and linear filling method, the category features are filled by a mode, and the abnormal values are detected and processed by an extreme value method or a quartering distance method.

3. The national grid power consumption customer fraud risk prediction method based on the score card as claimed in claim 1, wherein in A2, adjacent sections satisfying the merging condition are merged by using a sample proportion and a positive-negative label proportion between feature division sections until a certain stopping criterion is satisfied.

4. The national grid electricity utilization customer fraud risk prediction method based on the score card according to claim 1 or 3, characterized in that in A2, the following sub-steps are performed,

a21, sorting, initializing and binning, and sorting the numerical type features, wherein the number of initialized bins is min (100, n x 10%), and n is the sample size;

a22, calculating W _i ，

Constructing a box dividing basis index:

y _i : number of negative samples in interval i

y _T : total number of negative samples

n _i : number of front samples in interval i

n _T : total number of front samples

E _i,i+1 ＝W _i,i+1 -W _i -W _i+1

5. The method as claimed in claim 1, wherein the WOE and IV values of different bins of each variable are calculated according to the result of the bin division for variable screening and model training, wherein the WOE and IV values are used for variable screening and model training

WOE calculation formula:

IV, calculating formula:

and according to the calculation result, removing the characteristics with IV less than 0.1, and finally entering model training.

6. The method for predicting fraud risk of national grid electricity consumers according to claim 1, wherein in step S4, a scoring card algorithm is adopted during model training and verification, and 70% of sample data is used as a training set and 30% is used as a verification set.