CN108364191A - Top-tier customer Optimum Identification Method and device based on random forest and logistic regression - Google Patents
Top-tier customer Optimum Identification Method and device based on random forest and logistic regression Download PDFInfo
- Publication number
- CN108364191A CN108364191A CN201810027580.1A CN201810027580A CN108364191A CN 108364191 A CN108364191 A CN 108364191A CN 201810027580 A CN201810027580 A CN 201810027580A CN 108364191 A CN108364191 A CN 108364191A
- Authority
- CN
- China
- Prior art keywords
- tier customer
- customer
- sample
- tier
- logistic regression
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 55
- 238000007477 logistic regression Methods 0.000 title claims abstract description 38
- 238000007637 random forest analysis Methods 0.000 title claims abstract description 33
- 238000004458 analytical method Methods 0.000 claims abstract description 26
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 9
- 230000004069 differentiation Effects 0.000 claims abstract description 6
- 238000012549 training Methods 0.000 claims description 32
- 241000208340 Araliaceae Species 0.000 claims description 12
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 claims description 12
- 235000003140 Panax quinquefolius Nutrition 0.000 claims description 12
- 235000008434 ginseng Nutrition 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 5
- 238000011161 development Methods 0.000 claims description 4
- 238000013139 quantization Methods 0.000 claims description 3
- 230000005611 electricity Effects 0.000 description 81
- 238000005457 optimization Methods 0.000 description 8
- 241001269238 Data Species 0.000 description 4
- 238000007689 inspection Methods 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 4
- 241000196324 Embryophyta Species 0.000 description 3
- 230000006399 behavior Effects 0.000 description 3
- 230000003542 behavioural effect Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 230000004888 barrier function Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000002860 competitive effect Effects 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 238000011038 discontinuous diafiltration by volume reduction Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 102100035419 DnaJ homolog subfamily B member 9 Human genes 0.000 description 1
- 101710173651 DnaJ homolog subfamily B member 9 Proteins 0.000 description 1
- 208000034888 Needle issue Diseases 0.000 description 1
- 240000002853 Nelumbo nucifera Species 0.000 description 1
- 235000006508 Nelumbo nucifera Nutrition 0.000 description 1
- 235000006510 Nelumbo pentapetala Nutrition 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000002950 deficient Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 238000013210 evaluation model Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000012535 impurity Substances 0.000 description 1
- 230000002045 lasting effect Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000005498 polishing Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000013441 quality evaluation Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0201—Market modelling; Market analysis; Collecting market data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/06—Energy or water supply
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Strategic Management (AREA)
- Finance (AREA)
- Economics (AREA)
- Development Economics (AREA)
- Accounting & Taxation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- General Business, Economics & Management (AREA)
- Health & Medical Sciences (AREA)
- Marketing (AREA)
- Entrepreneurship & Innovation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Tourism & Hospitality (AREA)
- Water Supply & Treatment (AREA)
- Public Health (AREA)
- Human Resources & Organizations (AREA)
- Primary Health Care (AREA)
- Game Theory and Decision Science (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a kind of top-tier customer Optimum Identification Method and device based on random forest and logistic regression, the described method comprises the following steps:Sample customer value feature is obtained, and carries out quality differentiation;Using sample customer data, top-tier customer identification model is built based on random forest and logistic regression algorithm;Efficiency analysis is carried out to the judging result of top-tier customer identification model based on supervising professional method, and top-tier customer Statistical error model is trained based on analysis result;Using the value characteristic of client to be identified as input, it is based on the top-tier customer identification model, judges whether the client is top-tier customer.The present invention is based on the precise positionings that big data realizes top-tier customer.
Description
Technical field
The invention belongs to machine learning field more particularly to a kind of top-tier customer based on random forest and logistic regression are excellent
Change recognition methods and device.
Background technology
With electric Power Reform in-depth, comprehensive relieving of sales market, electric companies at different levels of State Grid Corporation of China face
The market competitive pressure, to promote power grid enterprises' profitability and competitiveness, increases the loyalty, satisfaction and visitor of top-tier customer
Family stickiness, on the basis of carrying out whole society's universal service, it will be each sale of electricity main body to provide good service for top-tier customer for enterprise
Compete the main means and strategy of top-tier customer, it is necessary to targetedly competitive service strategy is formulated, by limited Service Source
It puts on the body of top-tier customer, is established with it stable for electricity consumption relationship, be that power grid enterprises keep long-term sustainable to develop
Inevitable choice.
With the explosive growth of data volume and the continuous improvement of business need, traditional service system structure is more next
It is more difficult to meet the requirement of system operation.Big data technology has been reached common understanding in the world as important strategic resource,
This basic strategic resource of data is analysis customer demand and provides pertinent service, provides data supporting.
Therefore, the precise positioning that top-tier customer how is realized based on big data, is that the technology urgently solved is needed to ask at present
Topic.
Invention content
To overcome above-mentioned the deficiencies in the prior art, the present invention provides a kind of sale of electricity side groups in random forest and logistic regression
Top-tier customer Optimum Identification Method and device, the method is special with grid company client electrical properties, electricity consumption behavior, electricity consumption
Based on the mass datas such as sign, the customer evaluation index system of various dimensions is established, the client built in a manner of data analysis is passed through
Evaluation model carries out comprehensive score to client, to realize the precise positioning to top-tier customer.
To achieve the above object, the present invention adopts the following technical scheme that:
A kind of top-tier customer Optimum Identification Method based on random forest and logistic regression, includes the following steps:
Step 1:Sample customer value feature is obtained, and carries out quality differentiation;
Step 2:Using sample customer data, mould is identified based on random forest and logistic regression algorithm structure top-tier customer
Type;
Step 3:Efficiency analysis is carried out to the judging result of top-tier customer identification model based on supervising professional method, and is based on
Analysis result trains top-tier customer Statistical error model;
Step 4:Using the value characteristic of client to be identified as input, it is based on the top-tier customer identification model, judges institute
State whether client is top-tier customer.
Further, the step 1 includes:
Step 1.1:Customer value evaluating characteristic index system is built according to user's items power information of acquisition;
Step 1.2:According to the value characteristic of the index system statistical sample user, and carries out sample of users quality and sentence
Not.
Further, in the step 1 value characteristic include the corresponding essential attribute of user, economic value, Laden-Value,
Dynamogenetic value, credit worthiness, industry are worth data.
Further, the step 2 includes:
Step 2.1:Sample of users data are pre-processed;
Step 2.2:Top-tier customer judgment models are trained based on random forest method;
Step 2.3:Top-tier customer grade judgment models are built using logistic regression algorithm;
Step 2.4:Top-tier customer, which is obtained, in conjunction with top-tier customer judgment models and top-tier customer grade judgment models identifies mould
Type.
Further, the step 2.1 includes:Data cleansing, characteristic factor quantization, feature expand, feature selecting and different
Constant value processing.
Further, the step 2.2 includes:
Full feature training:Sample chooses whole sample of users data, and model enters ginseng for whole operational indicators;
Important feature is trained:Sample chooses whole sample of users data, and it is high preceding 40% index of importance that model, which enters ginseng,;
Full characteristic crossover training:Mix the sample with user data and averagely split into 10 parts, select every time wherein 9 parts as trained sample
This, remaining 1 part is used as forecast sample, loop iteration 10 times, model to enter ginseng for whole operational indicators;
Important feature cross-training:Mix the sample with user data and averagely split into 10 parts, every time select wherein 9 parts as train
Sample, remaining 1 part is used as forecast sample, loop iteration 10 times, and it is high preceding 40% index of importance that model, which enters ginseng,.
Scheme as a further preference, before model training, the method further includes:It is mutually tied with MDG methods using MDA methods
The mode of conjunction chooses importance index, by model training, obtains index importance analysis result.
Scheme as a further preference, the method further include:Establish the top-tier customer identification model upgrading optimization
Permanent mechanism, efficiency analysis is aperiodically carried out to the judging result of top-tier customer identification model based on supervising professional method,
And it is based on analysis result, re -training top-tier customer Statistical error model.
Further, the step 2.3 includes:The top-tier customer that top-tier customer judgment models obtain is passed through into logistic regression
Model carries out comprehensive score;Multiple comprehensive score sections are set, top-tier customer grade judgment models are obtained.
Further, the method further includes:Trained model is integrated, it is special to collect user by data-interface
Data are levied, the judgement of the high-quality grade of client is periodically carried out.
Second purpose according to the present invention, the present invention also provides a kind of high-quality visitor based on random forest and logistic regression
Family Statistical error device, including memory, processor and storage are on a memory and the computer journey that can run on a processor
Sequence, the processor realize the method when executing described program.
Third purpose according to the present invention, the present invention also provides a kind of computer readable storage mediums, are stored thereon with
Computer program executes a kind of top-tier customer based on random forest and logistic regression when the program is executed by processor
Optimum Identification Method.
Beneficial effects of the present invention
1, the present invention is adopted by grid company client with electrical properties, electricity consumption behavior, with based on the mass datas such as electrical feature
With the technological means of machine learning, the identification of top-tier customer is realized, providing good service to be directed to top-tier customer provides guarantor
Barrier helps to promote power grid enterprises' competitiveness.
2, the present invention carries out the training of client's identification model in such a way that random forest and logistic regression are combined, described
Identification model can judge the high-quality grade of client, high-quality visitor be furthermore achieved on the basis of identifying whether client is good
The precise positioning at family.
3, the present invention establishes the permanent mechanism of the top-tier customer identification model upgrading optimization, based on supervising professional method to excellent
The judging result of matter client's identification model aperiodically carries out efficiency analysis, and is based on analysis result, the high-quality visitor of re -training
Family Statistical error model achievees the purpose that model version upgrading and optimization by re -training model.
Description of the drawings
The accompanying drawings which form a part of this application are used for providing further understanding of the present application, and the application's shows
Meaning property embodiment and its explanation do not constitute the improper restriction to the application for explaining the application.
Fig. 1 is that the present invention is based on the top-tier customer Optimum Identification Method flow charts of random forest and logistic regression;
Fig. 2 is that top-tier customer identification model of the present invention builds flow chart;
Fig. 3 is that the present invention is based on client's grade trend schematic diagrames that logistic regression is formed.
Specific implementation mode
It is noted that described further below be all exemplary, it is intended to provide further instruction to the application.Unless another
It indicates, all technical and scientific terms used herein has usual with the application person of an ordinary skill in the technical field
The identical meanings of understanding.
It should be noted that term used herein above is merely to describe specific implementation mode, and be not intended to restricted root
According to the illustrative embodiments of the application.As used herein, unless the context clearly indicates otherwise, otherwise singulative
It is also intended to include plural form, additionally, it should be understood that, when in the present specification using term "comprising" and/or " packet
Include " when, indicate existing characteristics, step, operation, device, component and/or combination thereof.
In the absence of conflict, the features in the embodiments and the embodiments of the present application can be combined with each other.
Embodiment one
Present embodiment discloses a kind of top-tier customer Optimum Identification Method based on random forest and logistic regression, such as Fig. 1
It is shown, include the following steps:
(1) data preparation stage
1, customer value evaluating characteristic index system is established:
By collecting User Profile information, economic value category information, Laden-Value category information, dynamogenetic value category information, letter
It is worth category information with value category information, industry, the various factors for influencing customer general value of comprehensive analysis are established customer value and commented
Valence characteristic index system.Discussion and customer surveys are concentrated by client, realizes that prefectures and cities' sample of users quality differentiates, is model
Training provides data basis.
It is based on the various value characteristics that grid company is brought according to top-tier customer, the every electricity consumption for combing client refers to
Mark sorts out index according to customer value angle, builds customer evaluation index system, to criterionization processing, goes forward side by side
Row various dimensions summarize, to judge that the high-quality characteristic of client provides data basis.
2, model training sample is determined:
By the top-tier customer index system determined with districts and cities expert discussions, based on sales service application system, telecommunications is used
Acquisition system is ceased, counts the corresponding essential attribute of sample of users, economic value, Laden-Value, dynamogenetic value, credit valence respectively
Value, industry are worth data, in this, as model training sample.To 47.4 ten thousand sample customer electricity behavioural characteristic numbers in the present embodiment
According to expert judging has been carried out, whether high-quality it is labelled with.
User property:Family number, name in an account book, trade classification, whether highly energy-consuming and electricity consumption classification.
Economic value:Customer electricity to situation of getting a profit caused by power supply enterprise, as average electric sales rate is higher, electricity consumption compared with
Greatly, the more client of the electricity charge.Including:Current average electric sales rate, the current electricity charge, current electricity, accumulative average electric sales rate, the accumulative electricity charge,
Accumulative electricity, contract capacity and working capacity.
Laden-Value:The electric load value that client shows during electricity consumption, as power factor (PF) is larger, average negative
The high and low preferable client of paddy power consumption rate of lotus rate.Including:Average daily load rate, Peak power use rate, valley power consumption rate and power tune system
Number.
Dynamogenetic value:Client itself electricity consumption development is preferable, and future contributes larger client, can be brought to company lasting
Profit contribution.Including:Current electricity growth rate, nearly 3 months electricity growth rates, nearly 6 months electricity growth rates, nearly 1 year electricity increase
Long rate, increase-volume number and volume reduction number.
Credit worthiness:Credit is that the basic guarantee of transaction is completed for electricity consumption both sides, can use electricity in accordance with the law, pay the electricity charge on time
Client.Including:The advance rate of carrying down of the electricity charge, the overdue number of days of electricity charge returned money, the overdue number of electricity charge returned money, electricity charge returned money phase, check
Returned ticket number and promise breaking stealing number.
Industry is worth:Consider that the industry development foreground of client, the development of industry entirety electricity consumption level are preferable.Including:Industry
Electricity growth rate, industry major class electricity growth rate and industry group electricity growth rate.
In data preparation stage, the standard formulation work of supervision source is also carried out, i.e., as effective supervision source, it is substantially answered
What the business of the satisfaction is, the supervision source of output only in the business is just considered effective, can carry out
Supervised learning.
Expert judging has been carried out to 47.4 ten thousand sample customer electricity behavioural characteristic data in the present embodiment, whether excellent has been labelled with
Matter.
(2) data processing stage
Current database is easily invaded and harassed by noise, loss data and inconsistent data, and quantity is too big, and comes from mostly
Multiple heterogeneous data sources cause the quality of data relatively low, and low-quality data will cause the result of data analysis inaccurate, therefore
Before model training, need to carry out data prediction.The data prediction of this programme is mainly from characteristic factor quantization, exceptional value
Reason, continuous variable processing etc. expansion.
1, data cleansing
It is examined by the inspection of data over run value, feature validation test, data null value, data is cleaned.
It transfinites inspection:Check that electricity consumption and electricity charge electricity price are 0 record and are deleted, electricity consumption and electricity charge electricity price are equal
Indicate that user without electricity, i.e., does not produce for 0, other related features also do not have characteristic.
Characteristic validity inspection:Check that the excessively single record of user's importance characteristic information, only minority belong to important
User.
Null value inspection:Check that the complete overdue number of days of empty and electricity charge returned money of pause day digital section lacks serious record.Suspend day
The complete empty expression pause full user of number of days of digital section lacks;It checks the overdue number of days of electricity charge returned money, it is found that field record is sky, but have
Body business is not overdue.
2, characteristic factor quantifies
The information such as files on each of customers, festivals or holidays and the weather come from marketing system or other systems acquisition are all to use word or generation
Number indicate, need to carry out numeralization expression to this class variable.
42 field features such as name in an account book, family number, industry, industry group, industry major class, highly energy-consuming trade, importance rate.
It is classified as follows:1) customer attribute information;2) economic value;3) Laden-Value;4) dynamogenetic value;5) credit worthiness;6) industry valence
Value.
Factorization is converted:(being expressed using 0/1/2/3... digital codings) industry, industry group, industry major class, high consumption
It can industry, importance rate, electricity consumption classification, voltage class, region, scale of investment, the size of capacity, load character;
3, feature is expanded:
1) normalization is expanded:(setting within [0-1] user data value to data as feature) electricity charge, contract capacity,
Nearly annual electricity sales amount, nearly 6 monthly average electricity sales amounts, nearly 3 monthly average electricity sales amounts, working capacities;
2) discretization is expanded:(user data value is segmented by size and is used as feature) electricity charge, contract capacity, a nearly annual
Electricity sales amount, nearly 6 monthly average electricity sales amounts, nearly 3 monthly average electricity sales amounts, working capacities;
3) sequencing feature is expanded:(sorting by size user data value as feature) electricity charge, are put down at contract capacity for nearly 1 year
Equal electricity sales amount, nearly 6 monthly average electricity sales amounts, nearly 3 monthly average electricity sales amounts, working capacities;
4) few data encoding is measured to expand:(codings of onehot 0/1) increase-volume number, volume reduction number, the old deficient electricity charge, Chen Qian electricity
Take accounting, promise breaking stealing number.
4, feature selecting:
For user property feature, the distributing equilibrium situation of data is observed, whether these dimensional characteristics of preliminary analysis are to excellent
The influence of matter and requirement item.
For 5 class value characteristics, the distributing equilibrium situation of data is observed, whether these dimensional characteristics of preliminary analysis are to high-quality
With the influence of requirement item.It checks whether with associate feature.
Comprehensive dimensionality reduction, explores a variety of methods of attempting, and the result of comprehensive various methods carries out dimensionality reduction.
5, outlier processing
Gathered data, which exists, not to be acquired or the case where abnormal data, archives class data the case where there is also missings, needs needle
Missing values processing is carried out to this partial data, different missing values processing methods is selected according to different business rule:
Default value is replaced:For such as the case where load character, voltage class, being set by universal business rule in certain archives
Default value is set to be calculated.
Case scalping method:If missing values proportion is fewer, and certain attribute is important, then is picked using case
Division weeds out the data.If such as user id loses in User Profile information, directly weeds out the data.
Mean value Shift Method:If missing values are value types, the number of missing is filled with the average value of front and back data
According to.
If missing values are non-numeric types, the data that are lacked come polishing with the mode of the attribute.
Calorie completion method:An object most like with missing data object is selected in data set, with the value of the object
Instead of missing values.
(3) model training stage
The present embodiment carries out model training using random forest and logistic regression, as shown in Figure 2.
1, it is based on random forest method and trains top-tier customer judgment models
Importance index is chosen
Importance index selection is carried out using following two methods:One is the methods based on OOB errors, referred to as MDA
(Mean Decrease Accuracy);Another kind is the method based on Gini impurity levels, referred to as MDG (Mean Decrease
Gini).Both of which is that the bigger expression variable of scalar value is more important.By model training, index importance analysis knot is obtained
Two methods of fruit, the importance index that comparison obtain, table specific as follows:
Table 1
Ranking | MDA | MDG |
1 | Accumulative electricity | Accumulative electricity |
2 | The accumulative electricity charge | The accumulative electricity charge |
3 | The current electricity charge | The current electricity charge |
4 | Current electricity | Current electricity |
5 | Working capacity | Working capacity |
6 | It dishonours a cheque number | Power tune coefficient |
7 | Accumulative average electric sales rate | Industry major class electricity growth rate |
8 | Industry major class electricity growth rate | Annual daily load rate |
9 | Power tune coefficient | The electricity charge returned money phase |
10 | Accumulative electricity price growth rate | Industry group electricity growth rate |
In conjunction with the above importance index, determine that 13 indexs are importance index, it is specific as follows:
Table 2
Serial number | Importance index | Corresponding data arranges |
1 | Accumulative electricity | 7 |
2 | The accumulative electricity charge | 8 |
3 | The current electricity charge | 5 |
4 | Current electricity | 4 |
5 | Working capacity | 10 |
6 | Power tune coefficient | 15 |
7 | It dishonours a cheque number | 35 |
8 | Accumulative average electric sales rate | 9 |
9 | Industry major class electricity growth rate | 39 |
10 | Accumulative electricity price growth rate | 24 |
11 | Annual daily load rate | 11 |
12 | The electricity charge returned money phase | 34 |
13 | Industry group electricity growth rate | 38 |
Training data is trained and is optimized by random forest method, it is whether excellent with user to find out electricity consumption behavioural characteristic value
Correspondence between matter, generation judge the whether good model of client.
Preferably, using following training process, implementation model gradually adjusts, from two dimensions of model stability and accuracy
Carry out model validation analysis:
Full feature training:Sample chooses all 47.4 ten thousand families, and model enters ginseng for whole operational indicators;
Important feature is trained:Sample chooses all 47.4 ten thousand families, and it is high preceding 40% index of importance that model, which enters ginseng,;
Full characteristic crossover training:Whole sample means are split into 10 parts, select every time wherein 9 parts as training sample,
Remaining 1 part is used as forecast sample, loop iteration 10 times, model to enter ginseng for whole operational indicators;
Important feature cross-training:Whole sample means are split into 10 parts, select every time wherein 9 parts as trained sample
This, remaining 1 part is used as forecast sample, loop iteration 10 times, and it is high preceding 40% index of importance that model, which enters ginseng,.
Wherein, noise identification is carried out by the notable property coefficient p of analysis model input variable, noise variance will not be included in mould
Type.
The present embodiment amounts to 47.4 ten thousand datas of collection and weeds out 3.94 ten thousand sample of users by data cleansing.Model
Training process is total to apply 43.5 samples, wherein 10.06 ten thousand families are top-tier customer, 33.39 ten thousand families are non-prime client, high-quality
With the ratio 0.3 to 1 of non-prime sample.
2, top-tier customer grade judgment models are built using logistic regression algorithm
The probability P and comprehensive score Y that user is top-tier customer, wherein probability P=1/ (1 are obtained using logistic regression algorithm
+ exp (- Y)) it is about mono- nonlinear function of comprehensive score Y.Comprehensive score Y is a continuous variable, different by being arranged
Comprehensive score section provides numerical basis for the further subdivision high-quality grade of client.Whole top-tier customers are passed through into logistic regression
Model carries out comprehensive score, and score value Y to form client's grade trend figure according to being ranked up from high to low, by top-tier customer according to
Quartile method is divided, and is determined four grade top-tier customer scorings section (such as Fig. 3), is formed top-tier customer rating scale.With
Logic Regression Models calculate the Y value of storage top-tier customer, judge the high-quality grade of the client by its Y value.
Top-tier customer identification model falls into 5 types all high voltage customers, is respectively:Non-prime client, level-one top-tier customer
(grade is low), two level top-tier customer (grade is relatively low), three-level top-tier customer (higher ranked), level Four top-tier customer (grade is high).
In 47.4 ten thousand current training samples, probability P is divided into top-tier customer more than 0.5, and probability is less than or equal to
0.5 is divided into non-prime client, the category of model result rate of accuracy reached based on important feature to 99.1%.Probability P=1/ (1
+ exp (- Y)) it is comprehensive score Y can be used as further subdivision client high-quality etc. about mono- nonlinear function of comprehensive score Y
The numerical basis of grade.Score value Y to form client's grade trend figure according to being ranked up from high to low, by top-tier customer according to four points
Position method is divided, and is determined four grade top-tier customer scorings section (such as Fig. 3), is formed top-tier customer rating scale.With logic
Regression model calculates the Y value of storage top-tier customer, judges the high-quality grade of the client by its Y value.
The logistic regression is additionally operable to the high-quality evaluation of single client:
Specific to single top-tier customer, the high-quality solution for differentiating result of sole user is carried out using logistic regression as auxiliary
It releases.By the analysis to sample data, the model coefficient K of each index is obtained.And the size generation of the product Hi of K values and characteristic value
Contribution degree of the table index in the reflection high-quality degree of client, influences client good principal element, i.e. user to analyze
High-quality speciality.
(4) model iteration optimization
The permanent mechanism of modeler model edition upgrading optimization.Carry out model by supervising professional and judges that result is corrected, it is indefinite
Phase carries out efficiency analysis to model judgement result and reaches model version by re -training model on the basis of analysis result
The purpose of upgrading and optimization.
(5) modelling effect is assessed
With the data of expert estimation, test to accuracy rate, the recall rate of best model, assessment models effect.
(6) model application deployment
Trained model is integrated, user characteristic data is collected by data-interface, it is high-quality periodically to carry out client
The judgement of grade.
Embodiment two
The purpose of the present embodiment is to provide a kind of computing device.
A kind of top-tier customer Statistical error device based on random forest and logistic regression, including memory, processor and
The computer program that can be run on a memory and on a processor is stored, the processor is realized following when executing described program
Step, including:
Step 1:Sample customer value feature is obtained, and carries out quality differentiation;
Step 2:Using sample customer data, mould is identified based on random forest and logistic regression algorithm structure top-tier customer
Type;
Step 3:Efficiency analysis is carried out to the judging result of top-tier customer identification model based on supervising professional method, and is based on
Analysis result trains top-tier customer Statistical error model;
Step 4:Using the value characteristic of client to be identified as input, it is based on the top-tier customer identification model, judges institute
State whether client is top-tier customer.
Embodiment three
The purpose of the present embodiment is to provide a kind of computer readable storage medium.
A kind of computer readable storage medium, is stored thereon with computer program, which executes when being executed by processor
Following steps:
Step 1:Sample customer value feature is obtained, and carries out quality differentiation;
Step 2:Using sample customer data, mould is identified based on random forest and logistic regression algorithm structure top-tier customer
Type;
Step 3:Efficiency analysis is carried out to the judging result of top-tier customer identification model based on supervising professional method, and is based on
Analysis result trains top-tier customer Statistical error model;
Step 4:Using the value characteristic of client to be identified as input, it is based on the top-tier customer identification model, judges institute
State whether client is top-tier customer.
Each step involved in the device of above example two and three is corresponding with embodiment of the method one, specific implementation mode
It can be found in the related description part of embodiment one.Term " computer readable storage medium " is construed as including one or more
The single medium or multiple media of instruction set;Any medium is should also be understood as including, any medium can be stored, be compiled
Code carries the instruction set for being executed by processor and processor is made to execute the either method in the present invention.
Beneficial effects of the present invention
1, the present invention is adopted by grid company client with electrical properties, electricity consumption behavior, with based on the mass datas such as electrical feature
With the technological means of machine learning, the identification of top-tier customer is realized, providing good service to be directed to top-tier customer provides guarantor
Barrier helps to promote power grid enterprises' competitiveness.
2, the present invention carries out the training of client's identification model in such a way that random forest and logistic regression are combined, described
Identification model can judge the high-quality grade of client, high-quality visitor be furthermore achieved on the basis of identifying whether client is good
The precise positioning at family.
3, the present invention establishes the permanent mechanism of the top-tier customer identification model upgrading optimization, based on supervising professional method to excellent
The judging result of matter client's identification model aperiodically carries out efficiency analysis, and is based on analysis result, the high-quality visitor of re -training
Family Statistical error model achievees the purpose that model version upgrading and optimization by re -training model.
It will be understood by those skilled in the art that each module or each step of aforementioned present invention can be filled with general computer
It sets to realize, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored
Be performed by computing device in the storage device, either they are fabricated to each integrated circuit modules or by they
In multiple modules or step be fabricated to single integrated circuit module to realize.The present invention is not limited to any specific hardware and
The combination of software.
Above-mentioned, although the foregoing specific embodiments of the present invention is described with reference to the accompanying drawings, not protects model to the present invention
The limitation enclosed, those skilled in the art should understand that, based on the technical solutions of the present invention, those skilled in the art are not
Need to make the creative labor the various modifications or changes that can be made still within protection scope of the present invention.
Claims (10)
1. a kind of top-tier customer Optimum Identification Method based on random forest and logistic regression, which is characterized in that including following step
Suddenly:
Step 1:Sample customer value feature is obtained, and carries out quality differentiation;
Step 2:Using sample customer data, top-tier customer identification model is built based on random forest and logistic regression algorithm;
Step 3:Efficiency analysis is carried out to the judging result of top-tier customer identification model based on supervising professional method, and based on analysis
As a result top-tier customer Statistical error model is trained;
Step 4:Using the value characteristic of client to be identified as input, it is based on the top-tier customer identification model, judges the visitor
Whether family is top-tier customer.
2. a kind of top-tier customer Optimum Identification Method based on random forest and logistic regression as described in claim 1, special
Sign is that the step 1 includes:
Step 1.1:Customer value evaluating characteristic index system is built according to user's items power information of acquisition;
Step 1.2:According to the value characteristic of the index system statistical sample user, and carry out sample of users quality differentiation.
3. a kind of top-tier customer Optimum Identification Method based on random forest and logistic regression as claimed in claim 1 or 2,
It is characterized in that, value characteristic includes the corresponding essential attribute of user, economic value, Laden-Value, development valence in the step 1
Value, credit worthiness, industry are worth data.
4. a kind of top-tier customer Optimum Identification Method based on random forest and logistic regression as described in claim 1, special
Sign is that the step 2 includes:
Step 2.1:Sample of users data are pre-processed;
Step 2.2:Top-tier customer judgment models are trained based on random forest method;
Step 2.3:Top-tier customer grade judgment models are built using logistic regression algorithm;
Step 2.4:Top-tier customer identification model is obtained in conjunction with top-tier customer judgment models and top-tier customer grade judgment models.
5. a kind of top-tier customer Optimum Identification Method based on random forest and logistic regression as claimed in claim 4, special
Sign is that the step 2.1 includes:Data cleansing, characteristic factor quantization, feature expansion, feature selecting and outlier processing.
6. a kind of top-tier customer Optimum Identification Method based on random forest and logistic regression as claimed in claim 4, special
Sign is that the step 2.2 includes:
Full feature training:Sample chooses whole sample of users data, and model enters ginseng for whole operational indicators;
Important feature is trained:Sample chooses whole sample of users data, and it is high preceding 40% index of importance that model, which enters ginseng,;
Full characteristic crossover training:Mix the sample with user data and averagely split into 10 parts, select every time wherein 9 parts as training sample,
Remaining 1 part is used as forecast sample, loop iteration 10 times, model to enter ginseng for whole operational indicators;
Important feature cross-training:Mix the sample with user data and averagely split into 10 parts, select every time wherein 9 parts as trained sample
This, remaining 1 part is used as forecast sample, loop iteration 10 times, and it is high preceding 40% index of importance that model, which enters ginseng,.
7. a kind of top-tier customer Optimum Identification Method based on random forest and logistic regression as claimed in claim 4, special
Sign is that the step 2.3 includes:The top-tier customer that top-tier customer judgment models are obtained is carried out comprehensive by Logic Regression Models
Close scoring;Multiple comprehensive score sections are set, top-tier customer grade judgment models are obtained.
8. a kind of top-tier customer Optimum Identification Method based on random forest and logistic regression as described in claim 1, special
Sign is that the method further includes:Trained model is integrated, user characteristic data is collected by data-interface, it is fixed
Phase carries out the judgement of the high-quality grade of client.
9. a kind of top-tier customer Statistical error device based on random forest and logistic regression, including memory, processor and deposit
Store up the computer program that can be run on a memory and on a processor, which is characterized in that the processor executes described program
Shi Shixian such as claim 1-8 any one of them methods.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor
It is executed when execution and knowledge is optimized based on the top-tier customer of random forest and logistic regression as claim 1-8 any one of them is a kind of
Other method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810027580.1A CN108364191A (en) | 2018-01-11 | 2018-01-11 | Top-tier customer Optimum Identification Method and device based on random forest and logistic regression |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810027580.1A CN108364191A (en) | 2018-01-11 | 2018-01-11 | Top-tier customer Optimum Identification Method and device based on random forest and logistic regression |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108364191A true CN108364191A (en) | 2018-08-03 |
Family
ID=63011006
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810027580.1A Pending CN108364191A (en) | 2018-01-11 | 2018-01-11 | Top-tier customer Optimum Identification Method and device based on random forest and logistic regression |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108364191A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109242323A (en) * | 2018-09-18 | 2019-01-18 | 深圳市元征科技股份有限公司 | A kind of Automobile Service Factory's methods of marking and relevant apparatus |
CN109325781A (en) * | 2018-09-04 | 2019-02-12 | 中国平安人寿保险股份有限公司 | Client's Quality Analysis Methods, device, computer equipment and storage medium |
CN109559146A (en) * | 2018-09-25 | 2019-04-02 | 国家电网有限公司客户服务中心 | Electricity customer service center accesses data center's optimization method based on the provinces and cities of logistic model prediction potential user's quantity |
CN109710890A (en) * | 2018-12-20 | 2019-05-03 | 四川新网银行股份有限公司 | Behavior portrait model based on building identifies the method and system of false material in real time |
CN110033307A (en) * | 2019-01-04 | 2019-07-19 | 国网浙江省电力有限公司电力科学研究院 | A kind of electric power top-tier customer screening technique based on machine learning model |
CN110458592A (en) * | 2019-06-18 | 2019-11-15 | 北京海致星图科技有限公司 | Knowledge based map and machine learning algorithm excavate the potential credit client method of bank |
-
2018
- 2018-01-11 CN CN201810027580.1A patent/CN108364191A/en active Pending
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109325781A (en) * | 2018-09-04 | 2019-02-12 | 中国平安人寿保险股份有限公司 | Client's Quality Analysis Methods, device, computer equipment and storage medium |
CN109242323A (en) * | 2018-09-18 | 2019-01-18 | 深圳市元征科技股份有限公司 | A kind of Automobile Service Factory's methods of marking and relevant apparatus |
CN109559146A (en) * | 2018-09-25 | 2019-04-02 | 国家电网有限公司客户服务中心 | Electricity customer service center accesses data center's optimization method based on the provinces and cities of logistic model prediction potential user's quantity |
CN109559146B (en) * | 2018-09-25 | 2022-11-04 | 国家电网有限公司客户服务中心 | Provincial and municipal access data center optimization method for predicting number of potential users by electric power customer service center based on logistic model |
CN109710890A (en) * | 2018-12-20 | 2019-05-03 | 四川新网银行股份有限公司 | Behavior portrait model based on building identifies the method and system of false material in real time |
CN109710890B (en) * | 2018-12-20 | 2023-06-09 | 四川新网银行股份有限公司 | Method and system for identifying false material in real time based on constructed behavior portrait model |
CN110033307A (en) * | 2019-01-04 | 2019-07-19 | 国网浙江省电力有限公司电力科学研究院 | A kind of electric power top-tier customer screening technique based on machine learning model |
CN110458592A (en) * | 2019-06-18 | 2019-11-15 | 北京海致星图科技有限公司 | Knowledge based map and machine learning algorithm excavate the potential credit client method of bank |
CN110458592B (en) * | 2019-06-18 | 2023-04-07 | 北京海致星图科技有限公司 | Method for mining potential credit clients of bank based on knowledge graph and machine learning algorithm |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108389069A (en) | Top-tier customer recognition methods based on random forest and logistic regression and device | |
CN108364191A (en) | Top-tier customer Optimum Identification Method and device based on random forest and logistic regression | |
CN109063945B (en) | Value evaluation system-based 360-degree customer portrait construction method for electricity selling company | |
CN108388955A (en) | Customer service strategies formulating method, device based on random forest and logistic regression | |
CN108388974A (en) | Top-tier customer Optimum Identification Method and device based on random forest and decision tree | |
CN108280541A (en) | Customer service strategies formulating method, device based on random forest and decision tree | |
CN101329683A (en) | Recommendation system and method | |
CN108154311A (en) | Top-tier customer recognition methods and device based on random forest and decision tree | |
CN106294882A (en) | Data digging method and device | |
CN110119948A (en) | Based on when variable weight dynamic combined power consumer credit assessment method and system | |
CN109740036A (en) | OTA platform hotel's sort method and device | |
CN105359172A (en) | Calculating a probability of a business being delinquent | |
CN113450141A (en) | Intelligent prediction method and device based on electricity selling quantity characteristics of large-power customer groups | |
CN112258067A (en) | Low-voltage user payment behavior classification method based on Gaussian mixture model clustering algorithm | |
CN116187808A (en) | Electric power package recommendation method based on virtual power plant user-package label portrait | |
CN107844874A (en) | Enterprise operation problem analysis system and its method | |
CN116227958A (en) | Method and system for dynamically and quantitatively evaluating offset fund manager based on holding bin and net value | |
CN115423631A (en) | Trading member scoring method and system based on trading data of industrial internet platform | |
CN112232945B (en) | Method and device for determining personal client credit | |
CN112529712A (en) | Modeling method and system for user operation analysis RFM | |
CN113592140A (en) | Electric charge payment prediction model training system and electric charge payment prediction model | |
CN113077108A (en) | Data prediction system for power material configuration requirements | |
CN112529628A (en) | Client label generation method and device, computer equipment and storage medium | |
CN116151861B (en) | Construction method of sales volume prediction model constructed based on intermittent time sequence samples | |
CN117333269B (en) | Fresh flower order management method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180803 |