CN115204934A - National grid electricity consumption customer fraud risk prediction method based on scoring card - Google Patents

National grid electricity consumption customer fraud risk prediction method based on scoring card Download PDF

Info

Publication number
CN115204934A
CN115204934A CN202210524254.8A CN202210524254A CN115204934A CN 115204934 A CN115204934 A CN 115204934A CN 202210524254 A CN202210524254 A CN 202210524254A CN 115204934 A CN115204934 A CN 115204934A
Authority
CN
China
Prior art keywords
data
fraud risk
national grid
prediction method
scoring card
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210524254.8A
Other languages
Chinese (zh)
Inventor
徐家宁
楼斐
蒋颖
吴懿臻
张维
徐宏伟
俞佳莉
陈齐瑞
陈昱伶
张一池
罗欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Huayun Information Technology Co Ltd
Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd
Original Assignee
Zhejiang Huayun Information Technology Co Ltd
Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Huayun Information Technology Co Ltd, Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd filed Critical Zhejiang Huayun Information Technology Co Ltd
Priority to CN202210524254.8A priority Critical patent/CN115204934A/en
Publication of CN115204934A publication Critical patent/CN115204934A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0207Discounts or incentives, e.g. coupons or rebates
    • G06Q30/0225Avoiding frauds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/06Electricity, gas or water supply

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Theoretical Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Physics & Mathematics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Tourism & Hospitality (AREA)
  • Game Theory and Decision Science (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Educational Administration (AREA)
  • Public Health (AREA)
  • Water Supply & Treatment (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a country network electricity consumption customer fraud risk prediction method based on a scorecard, and the technical scheme of the invention is characterized in that S1, data acquisition, data combing, such as user basic account information, user behavior information, activity participation information and the like, data integrity and accuracy verification, and electricity consumption customer feature library establishment; s2, data cleaning; s3, characteristic engineering; s4, establishing a model by adopting a scoring card algorithm, and training and verifying; and S5, dividing the risk level of the user according to the output result of the scoring card, and configuring different disposal strategies.

Description

National grid electricity consumption customer fraud risk prediction method based on scoring card
Technical Field
The invention belongs to a method for effectively intercepting risk customers and ensuring effective issuing of activity rewards, and relates to a national network electricity customer fraud risk prediction method based on a score card.
Background
With the gradual popularization of the internet APP, the number of online electricity customers is steadily increased, and the online handling capacity of the electric power service is synchronously improved. The importance of online customers' operations, such as activity operations, rights and interests issuance, is becoming increasingly important. At present, a wind control means for cheating groups is lacked, a traditional wind control rule engine cannot effectively identify and intercept cheating components, only can adopt the modes of activity off-shelf treatment and the like, and the experience of operation activity clients is poor.
Disclosure of Invention
The method solves the problems that in the prior art, a wind control means for cheating groups is lacked, a traditional wind control rule engine cannot effectively identify and intercept cheating partners, only can be used for processing in ways of moving off shelves and the like, and the experience of an operation activity client is poor, and provides a method for predicting the cheating risk of a national grid electricity client based on a score card.
The technical scheme adopted by the invention for solving the technical problems is as follows: a credit card-based national network electricity customer fraud risk prediction method comprises the following steps,
s1, acquiring data, combing basic account information of a user, user behavior information and activity participation information, checking the integrity and accuracy of the data, and establishing a power utilization customer feature library;
s2, cleaning data;
s3, characteristic engineering;
s4, establishing a model by adopting a scoring card algorithm, and training and verifying;
s5, dividing the risk level of the user according to the output result of the scoring card, and configuring different disposal strategies;
in the step S3, the characteristic engineering comprises the following sub-steps,
a1, aiming at a missing value and an abnormal value, matching a corresponding processing strategy according to a missing proportion;
a2, adopting an optimized box separation strategy to reduce the risk of model overfitting;
and A3, respectively calculating WOE (word of error) and IV (mean of error) values of different boxes in each variable according to the box separation result so as to be used for variable screening and model training.
According to the method, the APP account information, the activity data, the behavior data and the like of the 'online national network' of the company Zhejiang, china network are used as basic data, the electricity consumption customer feature library is established, the fraud risk of the electricity consumption customers is predicted based on the scoring card algorithm, the risk customers are effectively intercepted, and the activity rewards are effectively issued.
Preferably, in A1, the numerical features are processed by a median, mean, and linear filling method, the class features are processed by a mode filling method, and the abnormal values are detected and processed by an extreme value method or a quartile range method.
Preferably, in A2, the sample ratio and the positive-negative label ratio of the feature division regions are used to merge adjacent regions that satisfy the merging condition until a certain stopping criterion is satisfied.
Preferably, in said A2, the following sub-steps are performed,
a21, sorting, initializing and binning, sorting the numerical type features, wherein the number of initialized bins is min (100, n × 10%), wherein n is the sample size,
a22, calculating W i
Constructing a box division basis index:
Figure BDA0003643422640000021
y i : number of negative samples in interval i
y T : total number of negative samples
n i : number of front samples in interval i
n T : total number of front samples
A23, combining the intervals, calculating the combination gain of the adjacent intervals,
E i,i+1 =W i,i+1 -W i -W i+1
respectively calculate E i,i-1 ,E i,i+1 Selecting the first n combinations with the maximum profit for combination;
and A24, the box separation is ended, when the number of the intervals meets the expected setting, the box separation operation is ended, and the stopping condition is as follows: meet the desired number of bins or complete iterations.
Preferably, WOE (evidence weight) and IV (information value) values of different boxes in each variable are respectively calculated according to the box dividing results for variable screening and model training, wherein the WOE (evidence weight) and the IV (information value) values are used for variable screening and model training
WOE calculation formula:
Figure BDA0003643422640000022
IV, calculating formula:
Figure BDA0003643422640000023
and (5) according to the calculation result, removing the characteristics of IV less than 0.1, and finally entering model training.
Preferably, in step S4, a scoring card algorithm is used for model training and verification, and sample data includes 70% as a training set and 30% as a verification set.
The substantial effects of the invention are as follows: according to the method, the APP account information, the activity data, the behavior data and the like of the 'online national network' of the company Zhejiang, china network are used as basic data, the electricity consumption customer feature library is established, the fraud risk of the electricity consumption customers is predicted based on the scoring card algorithm, the risk customers are effectively intercepted, and the activity rewards are effectively issued.
Drawings
FIG. 1 is a schematic overall flow chart of the present invention;
FIG. 2 is a comparison graph of the model effect of the training set and the test set in the present invention;
FIG. 3 is a schematic flow diagram of a characteristic binning method of the present invention;
fig. 4 is a flow chart illustrating the improved optimal binning strategy of the present invention.
Detailed Description
The technical solution of the present invention will be further specifically described below by way of specific examples.
Example 1:
a fraud risk prediction method for national grid power utilization customers based on a score card (see figure 1) comprises the following steps,
s1, data acquisition
And combing data such as user basic account information, user behavior information, activity participation information and the like, checking the integrity and accuracy of the data, and establishing a power utilization customer feature library. The data time range of the modeling shows that sample data is extracted by taking 5 months from 6 months to 2021 year 2020 as an observation period, and sample labels are extracted by taking 6 months to 2021 year 9 months as an observation period.
And S2, data cleaning, which is the prior art, is performed, and data selection is performed according to a preset format, which is not described in detail in this embodiment.
S3, characteristic engineering
A1, processing missing value and abnormal value
According to the method, different processing strategies are matched according to different deletion ratios, the characteristics with high deletion ratios are removed, numerical type characteristics are processed by adopting methods such as median, mean, linear filling and the like, and category characteristics are filled by adopting mode. The detection and processing of abnormal values are carried out by an extreme value method (for example, a 1% or 99% quantile capping method), a quartile range method, or the like.
A2, characteristic box separation
In the embodiment, a user-defined optimal binning strategy is adopted to perform optimal binning on the category characteristics and the numerical characteristics respectively, so that the characteristics have strong robustness on abnormal data, and the risk of model overfitting is reduced.
A3, WOE and IV value calculation
And respectively calculating WOE (word of error) and IV (input/output) values of different boxes in each variable according to the box separation result so as to be used for variable screening and model training.
WOE formula:
Figure BDA0003643422640000041
IV, calculating formula:
Figure BDA0003643422640000042
according to the calculation result, the characteristics that IV is less than 0.1 are removed, and the indexes of model training are shown in the table 1:
feature(s) IV value
Number of device login accounts 0.31
Ip associated account number of 7 days 0.26
Number of activity participation in approximately 7 days 0.16
Usage amount of red envelope in nearly 7 days 0.13
Number of account bound for nearly 30 days 0.19
Total point of nearly 30 days 0.22
Total recommended population 0.11
Account age 0.18
Authentication of real name 0.19
Days of Login in approximately 180 days 0.12
Number of account unbinding in about 30 days 0.16
Red packet deduction ratio used in nearly 7 days 0.25
Using integral deduction ratio in near 7 days 0.13
Accumulating the number of devices logged in 0.14
Number of charges made in about 30 days 0.11
The sum of the fees paid by the households in nearly 30 days 0.13
Cumulative number of paid households 0.12
Number of bound households per day 0.23
TABLE 1 index of final model training
S4, model training and verification (see the attached figure 2)
The modeling adopts a scoring card algorithm, and 1658 ten thousand samples are collected, wherein 70% of samples are used as a training set, and 30% of samples are used as a verification set.
S5, model result application
According to the output result of the scoring card, the risk grade of the user is divided, and different disposal strategies are configured, wherein the grade division and the disposal strategy details are shown in a table 2:
between scoring areas Rating label Handling policy
(80,100] Low risk Do not process
(60,80] Middle risk Monitoring
(0,60] High risk Interception black coating
TABLE 2 ranking and handling policy details
The box separation method for the user-defined numerical characteristic of the embodiment specifically comprises the following processes (see the attached drawing 3):
in said A2, the following sub-steps are performed,
a21, sequencing, initializing and binning, sequencing the numerical type features, wherein the number of initialized and binned boxes is min (100, n is 10%), wherein n is the sample size,
a22, calculating W i
Constructing a box dividing basis index:
Figure BDA0003643422640000051
y i : number of negative samples in interval i
y T : total number of negative samples
n i : number of front samples in interval i
n T : total number of front samples
A23, combining the intervals, calculating the combination gain of the adjacent intervals,
E i,i+1 =W i,i+1 -W i -W i+1
respectively calculate E i,i-1 ,E i,i+1 Selecting the first n combinations with the maximum profit for combination;
and A24, the box separation is ended, when the number of the intervals meets the expected setting, the box separation operation is ended, and the stopping condition is as follows: meet the desired number of bins or complete iterations.
In this embodiment, when the improved optimal binning strategy is adopted (see fig. 4), the following steps are performed:
firstly, data acquisition and data cleaning are carried out, and data such as account number basic information, user behavior information and user activity information are collected from a marketing system, an online and national network APP application platform and the like.
And then performing feature engineering, processing the missing values by methods such as median, mode, mean, linear filling and the like, performing optimal binning on the features after the missing value processing is completed, respectively calculating WOE (weighted average) and IV (weighted average) values after binning, and preliminarily screening indexes according to the IV values and correlation coefficients.
And performing model training and model verification again, dividing the sample set into a training set and a test set, finishing the model training of the scoring card by using the data of the training set, and outputting the weight of each index. And performing simulation verification on the trained model on a verification set, and evaluating the actual effect of the model. And calculating the characteristic score, and calculating the score of each interval according to the WOE and the weight of the characteristic to form a final score card. And the grading card is deployed and applied, the grading card model is deployed in a system server, and differentiation strategies are adopted for users with different grading results, so that the wind control quality of the online national network is improved.
In the embodiment, the APP account information, the activity data, the behavior data and the like of the 'Internet and national network' of the company Zhejiang, china and the like are used as basic data, the power consumption customer feature library is established, the fraud risk of the power consumption customers is predicted based on the scoring card algorithm, the risk customers are effectively intercepted, and the activity rewards are effectively issued.
The above-described embodiments are only preferred embodiments of the present invention, and are not intended to limit the present invention in any way, and other variations and modifications may be made without departing from the spirit of the invention as set forth in the claims.

Claims (6)

1. A national grid power utilization customer fraud risk prediction method based on a score card is characterized by comprising the following steps,
s1, acquiring data, combing basic account information of a user, user behavior information and activity participation information, checking the integrity and accuracy of the data, and establishing a power utilization customer feature library;
s2, data cleaning;
s3, characteristic engineering;
s4, establishing a model by adopting a scoring card algorithm, and training and verifying;
s5, dividing the risk level of the user according to the output result of the scoring card, and configuring different disposal strategies;
in the step S3, the characteristic engineering comprises the following sub-steps,
a1, aiming at a missing value and an abnormal value, matching a corresponding processing strategy according to a missing proportion;
a2, adopting an optimized box separation strategy to reduce the risk of model overfitting;
and A3, respectively calculating WOE and IV values of different bins in each variable according to the bin dividing result so as to be used for variable screening and model training.
2. The method for predicting the fraud risk of the national grid electricity consumers based on the score card as claimed in claim 1, wherein in the A1, the numerical features are processed by a median, mean and linear filling method, the category features are filled by a mode, and the abnormal values are detected and processed by an extreme value method or a quartering distance method.
3. The national grid power consumption customer fraud risk prediction method based on the score card as claimed in claim 1, wherein in A2, adjacent sections satisfying the merging condition are merged by using a sample proportion and a positive-negative label proportion between feature division sections until a certain stopping criterion is satisfied.
4. The national grid electricity utilization customer fraud risk prediction method based on the score card according to claim 1 or 3, characterized in that in A2, the following sub-steps are performed,
a21, sorting, initializing and binning, and sorting the numerical type features, wherein the number of initialized bins is min (100, n x 10%), and n is the sample size;
a22, calculating W i
Constructing a box dividing basis index:
Figure FDA0003643422630000011
y i : number of negative samples in interval i
y T : total number of negative samples
n i : number of front samples in interval i
n T : total number of front samples
A23, combining the intervals, calculating the combination gain of the adjacent intervals,
E i,i+1 =W i,i+1 -W i -W i+1
respectively calculate E i,i-1 ,E i,i+1 Selecting the first n combinations with the maximum profit for combination;
and A24, the box separation is ended, when the number of the intervals meets the expected setting, the box separation operation is ended, and the stopping condition is as follows: meet the desired number of bins or complete iterations.
5. The method as claimed in claim 1, wherein the WOE and IV values of different bins of each variable are calculated according to the result of the bin division for variable screening and model training, wherein the WOE and IV values are used for variable screening and model training
WOE calculation formula:
Figure FDA0003643422630000021
IV, calculating formula:
Figure FDA0003643422630000022
and according to the calculation result, removing the characteristics with IV less than 0.1, and finally entering model training.
6. The method for predicting fraud risk of national grid electricity consumers according to claim 1, wherein in step S4, a scoring card algorithm is adopted during model training and verification, and 70% of sample data is used as a training set and 30% is used as a verification set.
CN202210524254.8A 2022-05-13 2022-05-13 National grid electricity consumption customer fraud risk prediction method based on scoring card Pending CN115204934A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210524254.8A CN115204934A (en) 2022-05-13 2022-05-13 National grid electricity consumption customer fraud risk prediction method based on scoring card

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210524254.8A CN115204934A (en) 2022-05-13 2022-05-13 National grid electricity consumption customer fraud risk prediction method based on scoring card

Publications (1)

Publication Number Publication Date
CN115204934A true CN115204934A (en) 2022-10-18

Family

ID=83575283

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210524254.8A Pending CN115204934A (en) 2022-05-13 2022-05-13 National grid electricity consumption customer fraud risk prediction method based on scoring card

Country Status (1)

Country Link
CN (1) CN115204934A (en)

Similar Documents

Publication Publication Date Title
CN107103548A (en) The monitoring method and system and risk monitoring and control method and system of network behavior data
CN101504745A (en) Risk regulation/model establishing and optimizing system and method based on network on-line service
CN103605714B (en) The recognition methods of website abnormal data and device
KR20120040589A (en) Optimum tender price prediction method and system
CN101236638A (en) Web based bank card risk monitoring method and system
CN112102073A (en) Credit risk control method and system, electronic device and readable storage medium
CN108428055B (en) Load clustering method considering load longitudinal characteristics
CN109345313A (en) Customer satisfaction survey system and method based on big data
CN106408325A (en) User consumption behavior prediction analysis method based on user payment information and system
CN114943565A (en) Electric power spot price prediction method and device based on intelligent algorithm
CN115577152A (en) Online book borrowing management system based on data analysis
CN114154672A (en) Data mining method for customer churn prediction
CN110610415B (en) Method and device for updating model
CN115204934A (en) National grid electricity consumption customer fraud risk prediction method based on scoring card
CN115423600B (en) Data screening method, device, medium and electronic equipment
CN113052422A (en) Wind control model training method and user credit evaluation method
CN114418018A (en) Model performance evaluation method, device, equipment and storage medium
CN109785126A (en) Business loan dynamic air control cost accounting and floating pricing method and system
CN114358519A (en) Intelligent credit limit interest rate adjusting method and device
CN114626940A (en) Data analysis method and device and electronic equipment
CN113269412A (en) Risk assessment method and related device
CN114049190A (en) Financial fraud risk assessment and solution method based on transaction behavior feature extraction
CN113592140A (en) Electric charge payment prediction model training system and electric charge payment prediction model
CN112598225A (en) Evaluation index determination method and apparatus, storage medium, and electronic apparatus
CN116151670B (en) Intelligent evaluation method, system and medium for marketing project quality of marketing business

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination