CN107862347A

CN107862347A - A kind of discovery method of the electricity stealing based on random forest

Info

Publication number: CN107862347A
Application number: CN201711260280.XA
Authority: CN
Inventors: 刘晓; 施亚林; 张同乔; 张若冰
Original assignee: Jinan Power Supply Co of State Grid Shandong Electric Power Co Ltd
Current assignee: Jinan Power Supply Co of State Grid Shandong Electric Power Co Ltd
Priority date: 2017-12-04
Filing date: 2017-12-04
Publication date: 2018-03-30

Abstract

The invention discloses a kind of discovery method of the electricity stealing based on random forest, comprise the following steps：Obtain power system customer data and the user data for needing to judge is extracted from marketing system and is screened, the possible data of stealing are not present in rejecting；Initial data after screening is pre-processed, feature, which carries out extraction, to be included extracting Variance feature and extracting containing zero percentage feature；Pretreated data are tested using random forests algorithm and final experimental result is calculated.The present invention greatly eliminates existing artificial electricity anti-theft method existing the drawbacks of consuming a large amount of manpower and materials, reduces the job costs of anti-electricity-theft work, improves the operating efficiency of anti-electricity-theft work.The anti-electricity-theft work of big data instrument assist process is used simultaneously, the degree of accuracy of anti-electricity-theft work is favorably improved, is power industry trend of the times.

Description

A kind of discovery method of the electricity stealing based on random forest

Technical field

The present invention relates to technical field of power systems, more particularly to a kind of discovery of the electricity stealing based on random forest Method.

Background technology

Electric energy early has evolved into the indispensable energy, electric energy as today's social life, the important driving force of development Shortage, resident's normal life can be caused not ensure, industrial production can not be normally carried out.However, always have some criminals, In order to seek oneself private interests, electric power resource is used come illegal by the way of electric power is stolen, electric energy expense is paid in escape.This row The interests of the country and people for serious infringement, suspected illegal crime.Therefore, anti-electricity-theft work is always the important of electric company One of work.

Traditional electricity anti-theft method includes the methods of coarsenesses such as regular visit, periodic verification ammeter.However, traditional is anti- There is certain drawback in stealing electricity method, and to expend substantial amounts of manpower and materials when hitting electricity stealing, therefore, using data mining Mode, gather client electricity consumption data, data carried out with characteristic quantity collection, analysis using intelligent algorithm, client is judged with this Whether end occurs electricity stealing, it is possible to prevente effectively from the problem of workload of traditional electricity anti-theft method is excessive, efficiency is low.

In summary, in the prior art for how the anti-electricity-theft problem of efficiently and accurately, still lack effective solution.

The content of the invention

In order to solve the deficiencies in the prior art, the invention provides a kind of electricity anti-theft method based on random forests algorithm, It is anti-to being lifted to introduce random forests algorithm by using random forests algorithm from system architecture, data processing etc. by the present invention The important function of stealing efforts efficiency.

A kind of discovery method of the electricity stealing based on random forest, comprises the following steps：

Obtain power system customer data and the user data for needing to judge is extracted from marketing system and is screened, pick Except in the absence of the possible data of stealing；

Initial data after screening is pre-processed, including：Stealing user data and normal user data are carried out pair Than, the two is compared with the difference of electrical feature, extract difference it is obvious, it is signature use electrical feature, structure afterwards Expert's sample set is built, and extraction operation is carried out to feature, the feature, which carries out extraction, includes extracting Variance feature and extraction containing zero Percentage feature；

Pretreated data are tested using random forests algorithm and that final experimental result is calculated is specific For：By random forests algorithm, decision tree classification is carried out to user data, final classification result is voted by the decision tree trained Determine, judge whether user has electricity stealing with this.

Further, the screening to data includes：The user data extracted from marketing system includes all kinds of electricity consumptions Type, the information in the absence of the possible large user of stealing is rejected with reference to electricity consumption type, meanwhile, for having checked and verify stealing or electricity consumption The information of the user of terminal alarms, it should also be removed.

Further, the extraction specific formula of Variance feature is：

Wherein：V_iIt is the variance of user power utilization amount；It is the power consumption of i-th of user's kth day；It is that user averagely uses Electricity；K is the size of amount of user data；

Variance major embodiment has gone out the fluctuation situation of data, when a certain user power utilization data significantly fluctuation occur now As power consumption is fluctuated for a long time, variance is larger, then the user has larger stealing possibility.

Further, the extraction, which contains the zero specific formula of percentage feature, is：

Wherein：It is to contain zero percentage；X_jIt is that i-th of user has comprising j zero data；X_iIt is the total number of i-th of user According to amount；

Outside depolarization special circumstances, certain user power utilization amount is all zero daily, then user's stealing possibility is high；If certain user In addition to a small number of dates, most of time power consumption is zero, then has and larger there may be electricity stealing；If certain user power utilization amount is broken It is zero to continue, then existing certainly possible has electricity stealing.

Further, the random forest is one by one group of decision tree classifier { h (X, θ_k), k=1,2 ..., K } group Into integrated classifier, wherein { θ_kIt is to obey independent identically distributed random vector, k represents of decision tree in random forest Number, under given independent variable X, each decision tree classifier determines optimal classification results by voting.

Further, decision tree classification uses CART Decision-Tree Methods, particular content in the decision tree classifier For：CART algorithms calculate Gini (t) desired values of each possible dividing mode in this feature, to each feature, look for The minimum one kind of Gini (t) desired values is as optimum division on to this feature, then the optimum division of more all candidate features Gini (t) desired values, a feature for finally possessing minimum Gini (t) desired value are selected as disruptive features on this node, And branch is created according to each characteristic value, said process is repeated, further sample is entered in each non-leaf nodes Row division, untill the stopping criterion for reaching certain.

Further, the generation specific algorithm step of the random forest is as follows：

Assuming that the forest scale to be built is k, concentrated in training sample, by Bagging algorithms generate k it is new Self-service sample set；

Each self-service sample set is used to build a classification tree, then k new classification trees of common property life；

Provided with n feature, then m is randomly selected at each node of every one tree_tryIt is individual；

Feature m_try≤ n, the information content contained by calculating each feature, according to the minimum principle of node impurity level in m_try The feature of a most classification capacity is selected to carry out node split in individual feature；

Each tree grows to greatest extent, until the impurity level of each leaf node reaches minimum, does not do any cutting；

New position sample is predicted according to the multiple CART Tree Classifiers built, the classification results of unknown sample are by tree Depending on the ballot of grader is how many.

Further, the Bagging algorithms are a kind of Ensemble Learning Algorithms, give some weak learning algorithms and training set Sample T={ (x₁,y₁),(x₂,y₂),......(x_n,y_n), the extraction sample put back to is carried out to it, afterwards each basis Training subset identical with original training set but different in a quantity is generated on grader, can be trained afterwards Different basic classification device.

Further, the Bagging algorithms particular content is as follows：

Assuming that initial data concentrate, total sample number n, therefrom randomly, independently, extract m number with putting back to According to (m≤n), brand-new self-service training dataset is formed；

Said process is repeated, forms multiple separate self-service training datasets；

By each separate self-service training dataset, the separate sub-classifier of identical quantity is trained；

The differentiation result of final algorithm, multiple separate sub-classifiers are respective more than differentiates that result is voted Determine.

Compared with prior art, the beneficial effects of the invention are as follows：

The present invention realizes the judgement to stealing row greatly to eliminate existing artificial anti-electricity-theft side using random forests algorithm The drawbacks of method existing consumption a large amount of manpower and materials, the job costs of anti-electricity-theft work are reduced, improve anti-electricity-theft work Operating efficiency.The anti-electricity-theft work of big data instrument assist process is used simultaneously, the degree of accuracy of anti-electricity-theft work is favorably improved, is Power industry trend of the times.

Brief description of the drawings

The Figure of description for forming the part of the application is used for providing further understanding of the present application, and the application's shows Meaning property embodiment and its illustrate be used for explain the application, do not form the improper restriction to the application.

Fig. 1 is user's stealing identification process of the present invention；

Fig. 2 is electricity consumption data processing and the feature extraction flow of the present invention；

Fig. 3 is decision tree brief configuration schematic diagram in the random forests algorithm used of the invention.

Embodiment

It is noted that described further below is all exemplary, it is intended to provides further instruction to the application.It is unless another Indicate, all technologies used herein and scientific terminology are with usual with the application person of an ordinary skill in the technical field The identical meanings of understanding.

It should be noted that term used herein above is merely to describe embodiment, and be not intended to restricted root According to the illustrative embodiments of the application.As used herein, unless the context clearly indicates otherwise, otherwise singulative It is also intended to include plural form, additionally, it should be understood that, when in this manual using term "comprising" and/or " bag Include " when, it indicates existing characteristics, step, operation, device, component and/or combinations thereof.

Term explains part:Random forest, it is a grader for including multiple decision trees.By to a large amount of initial data The sampling put back to is taken, builds Sub Data Set, then sub-tree is built by Sub Data Set.Subtree is divided by feature to be selected Branch, data to be selected are subjected to coding specification via feature to be selected, finally the multiple sorting of operation according to mass data in the algorithm As a result the situation of each batch data is determined.

As background technology is introduced, exist in the prior art on how accurately to realize the problem of anti-electricity-theft, in order to Solves technical problem as above, present applicant proposes a kind of discovery method of the electricity stealing based on random forest.The application obtains Power taking Force system user data, rejecting contain zero percentage feature in the absence of the possible data of stealing, extraction Variance feature and extraction Final testing result is drawn to data prediction, using random forests algorithm to data progress measuring and calculation.Pass through intelligent algorithm Analysis calculating is carried out to power system customer data, anti-electricity-theft operating efficiency can be effectively improved, a large amount of reduction manpower and materials Consumption.

In a kind of typical embodiment of the application, as shown in Figure 1, there is provided a kind of stealing row based on random forest For discovery method, comprise the following steps：

The possible data of stealing are not present in step 1, acquisition power system customer data, rejecting, and its particular content is：Will be through It is managed collectively, is stored by the power system customer electricity consumption data of the technical limit spacings such as remote meter reading, automatic data logging, and from marketing The user data that being extracted in system needs to judge is screened, and is selectively rejected the preferably non-resident electricity consumption classification of prestige and is used Family, and reject all terminal alarms users and all stealing user data.

On the screening to data：The user data extracted from marketing system includes all kinds of electricity consumption types, as resident gives birth to Apply flexibly electric, big commercial power, general industry and commerce electricity consumption etc..In screening process, it should be recognized that electricity stealing is only a few user Behavior, it is appropriate to reject such as school, the information of bank large user to improve operating efficiency, reducing amount of calculation；Meanwhile for The information of user through checking and verify stealing or electric terminal alarm, it should also be removed.

Step 2, the initial data after simplifying is pre-processed, judge that examining the important of stealing user closely uses electrical feature, structure Expert's sample set is built, carries out feature extraction.

The initial analysis to data：Stealing user data is contrasted with normal user data, to the two electricity consumption The difference of feature is compared, and extracts that difference is obvious, signature uses electrical feature；Expert's sample set is built, and it is right Feature carries out extraction operation.

In random forests algorithm, the characteristic value mainly extracted has variance and containing zero percentage, with extract Variance feature and Extraction carries out algorithm computing containing zero percentage feature as principal character.Specific method is as follows：

(1) Variance feature is extracted

Variance major embodiment has gone out the fluctuation situation of data.When a certain user power utilization data significantly fluctuation occur now As power consumption is fluctuated for a long time, variance is larger, then the user has larger stealing possibility.Extract Variance feature formula It is as follows：

Wherein：V_iIt is the variance of user power utilization amount；It is the power consumption of i-th of user's kth day；It is that user averagely uses Electricity；K is the size of amount of user data.

(2) extraction contains zero percentage feature

User power utilization data are analyzed, we can be obtained to draw a conclusion：

(1) outside depolarization special circumstances, certain user power utilization amount is all zero daily, then user's stealing possibility is high；

(2) if certain user is in addition to a small number of dates, most of time power consumption is zero, then has and larger there may be stealing row For；

(3) if certain user power utilization amount is discontinuously zero, existing certainly possible has electricity stealing.

Wherein：It is to contain zero percentage；X_jIt is that i-th of user has comprising j zero data；X_iIt is the total number of i-th of user According to amount.

Step 3, by random forests algorithm measuring and calculation is carried out to sample data, draw final experiment prediction result.

It can be seen that on electricity consumption data feature extraction flow and tagsort as shown in Fig. 2 including obtaining user data, number Data preprocess, data characteristics extraction and random forests algorithm prediction obtain result.

The utilization of random forests algorithm：By random forests algorithm, decision tree classification, final classification are carried out to user data As a result chosen in a vote by the decision tree trained, judge whether user has electricity stealing with this.

Random forest is one by one group of decision tree classifier { h (X, θ_k), k=1,2 ..., K composition Ensemble classifier Device, wherein { θ_kIt is to obey independent identically distributed random vector, k represents the number of decision tree in random forest, given from change Measure under X, each decision tree classifier determines optimal classification results by voting.

Random forest is the grader that many decision trees integrate, if decision tree is regarded as one in classification task Individual expert, random forest are exactly that many experts classify to certain task together.

As shown in figure 3, the specific algorithm step of generation random forest is as follows：

Assuming that our the forest scales to be built are k.We are concentrated in training sample, and k is generated by Bagging methods Individual new self-service sample set.

Each self-service sample set is used to build a classification tree, then k new classification trees of common property life.

Provided with n feature, then m is randomly selected at each node of every one tree_tryIt is individual

Feature (m_try≤ n), the information content contained by calculating each feature, exist according to the minimum principle of node impurity level m_tryThe feature of a most classification capacity is selected to carry out node split in individual feature.

Each tree grows to greatest extent, until the impurity level of each leaf node reaches minimum, does not do any cutting.

The Bagging algorithms are a kind of Ensemble Learning Algorithms.Give some weak learning algorithm and training set sample T= {(x₁,y₁),(x₂,y₂),......(x_n,y_n), the extraction sample put back to is carried out to it, afterwards in each fundamental classifier Training subset identical with original training set but different in a quantity is generated, different base can be trained afterwards This grader.Algorithm particular content is as follows：

Bagging algorithms, it is a kind of most directly perceived and simplest method in Integrated Algorithm handled training set.It is right (such as decision tree, artificial neural network scheduling algorithm) is sayed in unstable learning algorithm, Bagging algorithms can effectively improve calculation The generalization ability of method.

Boosting algorithms and Bagging algorithms are quite similar, and two kinds of algorithms have used the grader of same type.They Most important difference be that Bagging algorithms are to be randomly selected from each meta classifier, therefore when choosing training set It is separate between each meta classifier trained, in the absence of obvious correlation；And Boosting algorithms are selecting When taking training set, each grader is obtained by serial training, the training set obtained by each round serial training all with before Learning outcome be related.In addition, each grader obtained by Bagging Algorithm for Training, its weight proportion, which is set, is Identical, and obtained by Boosting algorithms be different.

In the training process, the mode that Boosting algorithms generate different sub-classifiers is to concentrate sample to enter training data Row weights again.Its core concept is：Sample weights are redistributed, mistake can be given during grader is trained one by one Sample is divided more to pay close attention to.Specific algorithm is described as：First, it is equal to assign each sample of training data concentration for Boosting algorithms Weight, use it for training first sub-classifier and test training sample, obtain the prediction result of each sample.

For prediction result, divide sample more attention rates to give mistake, the weight of the sample of classification error is improved, simultaneously The weight of the sample of classification error is reduced.After weight adjustment, we are entered using this new training set to next sub-classifier Row training, and repeat the above steps, until error rate is less than some threshold value set in advance.

The division methods that the CART algorithms divide using two points of recurrence, follow strictly base during node-classification Buddhist nun's index minimizes this principle, and sample recursively is divided into two sample sets on node, and this, which is divided in, reaches certain Stop at one default stopping criterion.It follows that the Dou Youliangge branches on each nonleaf node on CART trees.

For CART trees in node division, the fragmentation criterion taken is Geordie (Gini) index：During division, choosing has minimum The attribute of gini index value is the Split Attribute of node.

Assuming that data set T { X, Y } includes the sample of k classification, Geordie index definition is as follows：

Wherein, p (j | t) is probability of the classification j in node t.

When having the sample to be all under the jurisdiction of same type at node t, Gini (t) desired values are zero, represent this node this When sample it is pure；For sample when classification field is uniformly distributed, Gini (t) desired values reach maximum, represent this at node t Now sample is most impure for one node.Sample set is divided into m part, then is for division Gini (t) indexes：

Wherein, m is the number of child node, n_iIt is the sample number at sub 7 node is, n is the sample number at parent node.

CART algorithms will be calculated dividing corresponding Gini (t) desired value each time, the value is got in node split It is small, illustrate that a kind of this division methods is more reasonable.Each feature concentrated for candidate feature,

CART algorithms calculate Gini (t) desired values of each possible dividing mode in this feature, special to each Sign, find one kind that Gini (t) desired values are minimum in this feature and drawn as optimum division, then the optimal of more all candidate features Point Gini (t) desired values, finally possessing a feature of minimum Gini (t) desired value, to be selected as division on this node special Sign, and branch is created according to each characteristic value.Said process is repeated, further to sample in each non-leaf nodes Divided, untill the stopping criterion for reaching certain.

In addition, the example of the electricity anti-theft method based on random forests algorithm of above-mentioned example is merely illustrative of, it is actual to answer Can be as needed in, such as consider for the convenient of realization of the configuration requirement or software of corresponding hardware, by above-mentioned work( Can distribution completed by different functional module, will the CSAT evaluation system internal structure be divided into it is different Functional module, to complete all or part of function described above.Wherein each function mould can both use the form of hardware real It is existing, it can also be realized in the form of software function module.

One of ordinary skill in the art will appreciate that realizing all or part of flow in above-described embodiment method, being can To instruct the hardware of correlation to complete by computer program, described program can be stored in computer read/write memory medium In, as independent production marketing or use.Described program upon execution, can perform the whole of the embodiment such as above-mentioned each method Or part steps.Wherein, described storage medium can be magnetic disc, CD, read-only memory, or random access memory Deng.

The preferred embodiment of the application is the foregoing is only, is not limited to the application, for the skill of this area For art personnel, the application can have various modifications and variations.It is all within spirit herein and principle, made any repair Change, equivalent substitution, improvement etc., should be included within the protection domain of the application.

Claims

1. a kind of discovery method of the electricity stealing based on random forest, it is characterized in that, comprise the following steps：

Obtain power system customer data and the user data for needing to judge is extracted from marketing system and is screened, reject not The possible data of stealing be present；

Initial data after screening is pre-processed, including：Stealing user data is contrasted with normal user data, it is right The two is compared with the difference of electrical feature, extract difference it is obvious, it is signature use electrical feature, build expert afterwards Sample set, and extraction operation is carried out to feature, the feature, which carries out extraction, to be included extracting Variance feature and extracting containing zero percentage Feature；

Pretreated data are tested using random forests algorithm and final experimental result is calculated and are specially：It is logical Random forests algorithm to be crossed, decision tree classification is carried out to user data, final classification result is chosen in a vote by the decision tree trained, Judge whether user has electricity stealing with this.

2. a kind of discovery method of the electricity stealing based on random forest as claimed in claim 1, it is characterized in that, the logarithm According to screening include：The user data extracted from marketing system includes all kinds of electricity consumption types, rejects and does not deposit with reference to electricity consumption type In the information of the possible large user of stealing, meanwhile, the information of the user for having checked and verify stealing or electric terminal alarm, also should When being removed.

3. a kind of discovery method of the electricity stealing based on random forest as claimed in claim 1, it is characterized in that, the extraction The specific formula of Variance feature is：

<mrow> <msub> <mi>V</mi> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mrow> <munder> <mo>&Sigma;</mo> <mi>k</mi> </munder> <msup> <mrow> <mo>(</mo> <msub> <mi>X</mi> <msub> <mi>i</mi> <mi>k</mi> </msub> </msub> <mo>-</mo> <msub> <mover> <mi>X</mi> <mo>&OverBar;</mo> </mover> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> <mi>k</mi> </mfrac> </mrow>

Wherein：V_iIt is the variance of user power utilization amount；It is the power consumption of i-th of user's kth day；It is the average power consumption of user；k It is the size of amount of user data；

Variance major embodiment has gone out the fluctuation situation of data, when significantly wave phenomenon, use occur in a certain user power utilization data Electricity is fluctuated for a long time, variance is larger, then the user has larger stealing possibility.

4. a kind of discovery method of the electricity stealing based on random forest as claimed in claim 1, it is characterized in that, the extraction It is containing the zero specific formula of percentage feature：

<mrow> <msub> <mi>P</mi> <mrow> <msub> <mi>Zero</mi> <mi>i</mi> </msub> </mrow> </msub> <mo>=</mo> <mfrac> <msub> <mi>X</mi> <mi>j</mi> </msub> <msub> <mi>X</mi> <mi>i</mi> </msub> </mfrac> <mo>&times;</mo> <mn>100</mn> <mi>%</mi> </mrow>

Wherein：It is to contain zero percentage；X_jIt is that i-th of user has comprising j zero data；X_iIt is the total data of i-th of user Amount；

Outside depolarization special circumstances, certain user power utilization amount is all zero daily, then user's stealing possibility is high；If certain user is except few Outside phase a few days, most of time power consumption is zero, then has and larger there may be electricity stealing；If certain user power utilization amount is discontinuously Zero, then existing certainly possible has electricity stealing.

5. a kind of discovery method of the electricity stealing based on random forest as claimed in claim 1, it is characterized in that, it is described random Forest is one by one group of decision tree classifier { h (X, θ_k), k=1,2 ..., K } composition integrated classifier, wherein { θ_kIt is clothes From independent identically distributed random vector, k represents the number of decision tree in random forest, under given independent variable X, each decision tree Grader determines optimal classification results by voting.

6. a kind of discovery method of the electricity stealing based on random forest as claimed in claim 5, it is characterized in that, the decision-making Decision tree classification uses CART Decision-Tree Methods in Tree Classifier, and particular content is：CART algorithms are calculated in this feature Gini (t) desired values of each possible dividing mode, to each feature, find in this feature Gini (t) desired values most Gini (t) desired value of the small one kind as optimum division, then the optimum division of more all candidate features, finally possesses minimum One feature of Gini (t) desired values is selected as disruptive features on this node, and is created and divided according to each characteristic value Branch, said process is repeated, further sample is divided in each non-leaf nodes, until the stopping for reaching certain is accurate Untill then.

7. a kind of discovery method of the electricity stealing based on random forest as claimed in claim 5, it is characterized in that, it is described random The generation specific algorithm step of forest is as follows：

Feature m_try≤ n, the information content contained by calculating each feature, according to the minimum principle of node impurity level in m_tryIndividual spy The feature of a most classification capacity is selected to carry out node split in sign；

New position sample is predicted according to the multiple CART Tree Classifiers built, the classification results of unknown sample press tree classification Depending on the ballot of device is how many.

8. a kind of discovery method of the electricity stealing based on random forest as claimed in claim 7, it is characterized in that, it is described Bagging algorithms are a kind of Ensemble Learning Algorithms, give some weak learning algorithm and training set sample T={ (x₁,y₁),(x₂, y₂),......(x_n,y_n), the extraction sample put back to is carried out to it, generates a number in each fundamental classifier afterwards Training subset identical with original training set in amount but different, different basic classification device can be trained afterwards.

9. a kind of discovery method of the electricity stealing based on random forest as claimed in claim 8, it is characterized in that, it is described Bagging algorithm particular contents are as follows：

Assuming that initial data concentrate, total sample number n, therefrom randomly, independently, with putting back to extract m data m≤ N, form brand-new self-service training dataset；

The differentiation result of final algorithm, multiple separate sub-classifiers are respective more than differentiates that result is voted certainly It is fixed.