CN112836926B - Enterprise operation condition evaluation method based on electric power big data - Google Patents

Enterprise operation condition evaluation method based on electric power big data Download PDF

Info

Publication number
CN112836926B
CN112836926B CN202011571639.7A CN202011571639A CN112836926B CN 112836926 B CN112836926 B CN 112836926B CN 202011571639 A CN202011571639 A CN 202011571639A CN 112836926 B CN112836926 B CN 112836926B
Authority
CN
China
Prior art keywords
enterprise
operation condition
abnormal
data
enterprises
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011571639.7A
Other languages
Chinese (zh)
Other versions
CN112836926A (en
Inventor
王茂宁
邹开欣
钟羽中
邓霖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202011571639.7A priority Critical patent/CN112836926B/en
Publication of CN112836926A publication Critical patent/CN112836926A/en
Application granted granted Critical
Publication of CN112836926B publication Critical patent/CN112836926B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Economics (AREA)
  • General Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Tourism & Hospitality (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Educational Administration (AREA)
  • Artificial Intelligence (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Business, Economics & Management (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Marketing (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Game Theory and Decision Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Public Health (AREA)
  • Water Supply & Treatment (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an enterprise operation condition evaluation method based on electric power big data. The invention can better show the enterprise operation condition from all aspects by extracting the original data in a grading way; the method and the system can dig the enterprise operation condition information in the large electric power data under the influence of subjective factors and experience factors as few as possible so as to ensure the accuracy of enterprise operation condition evaluation.

Description

Enterprise operation condition evaluation method based on electric power big data
Technical Field
The invention belongs to the technical field of electric power big data application, and relates to an enterprise operation condition evaluation technology based on electric power big data.
Background
The big data technology can promote the upgrading and the transformation of an information technology platform, supplement the capability of analyzing and utilizing unstructured data and enhance the value mining capability of mass data resources. The big electric power data is a novel asset of an electric power company, and can promote the business management of the electric power company to develop towards a more detailed and more efficient direction. Abundant information related to power utilization enterprises is stored in the electric power big data. And analyzing the big power data, and mining the enterprise operation condition hidden in the big power data.
With the recent rapid development of data availability, computing power, and new algorithms, machine learning has gradually become one of the key methods for implementing Artificial Intelligence (AI). Machine learning is a subset of the artificial intelligence in a wider field of computer science. It learns and discovers "patterns and insights" from "data" using computers and algorithms, because in many cases "patterns and insights" are hidden within "data". As era evolves, the data accumulated from business processes can be very complex for humans to understand. However, algorithms are able to more quickly and accurately mine "patterns and insights" from data than humans.
A power user credit evaluation method and system integrating payment indexes and industry disclosure indexes of power users is provided by Lanzhou power supply companies of power companies in Gansu province of the national grid. The system constructs the payment credit evaluation of the power consumer through indexes such as registration time, fulfillment rate, deviation share, average power factor, payment proportion, average payment days, arrearage percentage, prestoring percentage and the like. And establishing the industry credit of the power consumer according to the industry contribution rate, the industry public credit investigation and other indexes. And after the indexes are normalized, classifying by using a K-means clustering method to obtain the credit evaluation grade of the enterprise. The main problems of the power consumer credit evaluation method and the system are as follows: (1) because the power data reflect the production and operation conditions of the enterprises more and the enterprise credit reflected in the public credit investigation data is limited, the credit credibility of the enterprises evaluated by the data is not high; (2) because the number of the clustering centers needs to be divided manually, the influence of subjective factors and experience factors is brought, and the credit degree of an enterprise is not known in advance, so the credit grade is not reliable; (3) the credit level is a discrete variable, so that the difference of credit between enterprises at the same level cannot be reflected by embodying credit through the level.
Therefore, most of the existing methods for analyzing the power enterprises and evaluating the credit of the power enterprises based on the utilization of the large power data are difficult to objectively and effectively reflect the production and operation conditions of the enterprises, and further difficult to provide effective data support for enterprise development.
Disclosure of Invention
Aiming at the technical current situation that the reliability of enterprise user evaluation based on the large electric power data is poor and the difference degree of the operation conditions of each enterprise is difficult to reflect, the invention aims to provide an enterprise operation condition evaluation method of the large electric power data.
The invention provides an enterprise operation condition evaluation method based on electric power big data, which mainly comprises the following steps:
s1 data preprocessing
According to a plurality of data sets of an enterprise related to power utilization, enterprise samples lacking the data sets are filtered, and meanwhile missing values and zero values of the samples in the data sets are processed.
S2 hierarchical feature extraction
Extracting a plurality of secondary dimensional features for representing the enterprise electricity utilization information from the preprocessed data set, and then obtaining primary dimensional features for representing the abnormal degree of the enterprise electricity utilization information through an isolated forest abnormality detection algorithm according to the secondary dimensional features; the method comprises the following steps:
s21, extracting a secondary dimension characteristic value from the preprocessed data set according to the secondary dimension characteristic calculation logic, and carrying out normalization processing on the extracted secondary dimension characteristic value;
s22, obtaining a corresponding primary dimension characteristic value through an isolated forest anomaly detection algorithm according to the normalized secondary dimension characteristic value;
s3, adding all the primary dimension characteristic values of each enterprise to obtain the total abnormal score of the enterprise, then judging whether the operation condition of the enterprise is abnormal or not according to a given standard, and if so, entering the step S4; if not, the enterprise operation condition is normal;
s4, all enterprises with abnormal operation conditions are clustered to obtain enterprises with good operation conditions and enterprises with poor operation conditions.
In the method for evaluating the enterprise operation condition based on the big power data, in step S1, the data sets related to power consumption include an enterprise security basic power consumption information data set, an enterprise power consumption data set, an enterprise settlement power consumption and power charge data set, and an enterprise receivable power charge data set. Some outlier data has a large impact on the accuracy of the analysis. Therefore, in the data preprocessing stage, the abnormal data set and the missing values in the data set are respectively processed, specifically as follows:
(1) data set miss handling
For enterprise samples lacking a certain data set, filtering is straightforward. When a certain data set is absent from a certain enterprise sample, the evaluation of the business operation condition of the enterprise can be directly influenced. Therefore, in order to improve the accuracy of the business operation condition evaluation, the business samples are directly filtered.
(2) Missing value and zero value processing in data set
When there is a sample missing in the dataset, linear interpolation can be used for completion.
Since the logic operations involved in the present invention use relative quantities, to avoid generating infinite numbers or non-numerical values, the samples of zero values are replaced by a very small given value in the present invention.
In the method for evaluating the business operation condition based on the big power data, in step S2, in order to better express the business operation condition from various aspects, the invention extracts the original data features as hierarchical features, which are divided into two stages: the system comprises a secondary dimensional characteristic used for representing the enterprise electricity utilization information and a primary dimensional characteristic used for representing the degree of abnormality of the enterprise electricity utilization information. The secondary dimension characteristics include, but are not limited to, safety power utilization grade classification, power utilization duration, power consumption analysis in a first given time period of an enterprise, monthly electricity industry level, power consumption fluctuation condition of the enterprise, power consumption difference degree, periodic fluctuation, average total power increase trend in a second given time period of the enterprise, current accumulated late arrearage, late arrearage payment times in a third given time period and the like. The primary dimension characteristics include, but are not limited to, basic electricity consumption information, electricity level, electricity fluctuation, electricity trend, default electricity consumption information and the like.
In step S21, the secondary dimension feature calculation logic is as follows:
(1) classifying the safety power utilization level according to the safety power utilization level of an enterprise;
(2) for the electricity consumption time, the electricity consumption time is calculated according to years, and the electricity consumption time is calculated according to 1 year in less than 1 year;
(3) analyzing the power consumption of the enterprise in a first given time period according to the average value of the power consumption of the enterprise in the first given time period;
(4) for the monthly average electric quantity industry level, according to the ratio of the average electric quantity value in the first given time period of the enterprise to the average electric quantity value in the first given time period of the industry;
(5) for the enterprise electricity fluctuation condition, according to the standard value of the electric quantity in the first given time period of the enterprise and the average value of the electric quantity in the first given time period of the enterprise;
(6) for the power consumption difference degree, according to the ratio of (the maximum power consumption in the first given time period of the enterprise-the minimum power consumption of the enterprise) to the average value of the monthly power consumption in the first given time period of the enterprise;
(7) for periodic fluctuations, three aspects are involved: (i) according to the ratio of the electric quantity standard value of the enterprise in the last 3 months to the electric quantity standard value of the industry in the last 3 months; (ii) according to the ratio of the electric quantity standard value of the enterprise in the latest 6 months to the electric quantity standard value of the industry in the latest 6 months; (iii) according to the ratio of the electric quantity standard value of the enterprise in the last 9 months to the electric quantity standard value of the industry in the last 9 months;
(8) for the average total electricity quantity growth trend of the enterprise in a second given time period, according to the sum of the monthly growth rate of the total electricity consumption in the second given time period and/or the second given time period;
(9) accumulating the arrearage currently, and accumulating the amount of the arrearage according to the statistics;
(10) and (4) the number of the late payment times in the third given time period is close to the number of the late payment times in the third given time period according to statistics.
In step S22, the primary dimension features are obtained from the associated secondary dimension features by an isolated forest anomaly detection algorithm. The basic electricity utilization information is associated with safety electricity utilization grade classification and electricity utilization duration; the electricity level is associated with an analysis of electricity usage and a monthly average electricity industry level for a first given period of time for the enterprise; the electric quantity fluctuation is associated with the electric quantity fluctuation condition, the electric quantity difference degree and the periodic fluctuation of an enterprise; the power trend is associated with an average total power increase trend of the enterprise in a second given time period; the default electricity utilization information is associated with the current accumulated late arrearage and the number of late payment times in a third given time period.
In step S22, a corresponding first-order dimension characteristic value is obtained through an isolated forest anomaly detection algorithm according to the following sub-steps:
s221, a training set is constructed by utilizing all secondary dimension characteristic values after normalization processing of each enterprise, and then the constructed training set is utilized to train the isolated forest anomaly detection model to obtain an isolated forest anomaly detection model consisting of a plurality of isolated trees (isolation trees);
s222, traversing each enterprise, and inputting the normalized secondary dimension characteristic value of each enterprise associated with each primary dimension characteristic into the trained isolated forest anomaly detection model to obtain the primary dimension characteristic value of the enterprise.
In the step S221, the process of constructing an isolated tree includes the following sub-steps:
s2211 randomly extracts from each two-level dimension characteristic in the training set, and extracts together
Figure BDA0002862910210000041
Constructing an isolated tree training subset by using the data samples;
s2212 randomly selects a secondary dimension characteristic from the isolated tree training subset, randomly selects a value in all value ranges of the characteristic, performs binary division on the sample, divides the sample which is smaller than the value to the left of the node, divides the sample which is larger than or equal to the value to the right of the node, and obtains a splitting condition and data sets of the left side and the right side; repeating the above process on the data sets on the left side and the right side respectively until a termination condition is reached; the termination conditions include the following two items:
1) the data itself is not re-divisible (contains only one sample, or all samples are the same);
2) height of isolated tree reaches
Figure BDA0002862910210000042
And repeating the steps S2211 to S2212 until the number of the isolated trees reaches a set value, and forming an isolated forest abnormity detection model by all the constructed isolated trees.
In step S222, the abnormality score corresponding to the first-level dimensional characteristic parameter of the enterprise sample is obtained according to the following formula:
Figure BDA0002862910210000051
in the formula, x represents a normalized secondary dimension characteristic parameter set corresponding to a primary dimension characteristic parameter of an enterprise sample, h (x) represents the height of the enterprise sample x, which means that the leaf nodes can be reached only after several edges are needed from a root node of a tree, and E (h (x)) represents the average height of x in all isolated trees. The lower the height, the higher the anomaly score. c (n) represents the average path length of the binary search tree, which is calculated as follows:
c(n)=2H(n-1)-(2(n-1)/n);
n represents the number of businesses, H (n-1) represents the harmonic number:
H(n-1)=ln(n-1)+ξ;
where ξ represents the euler constant, its value is 0.5772156649.
And further normalizing the abnormal score corresponding to the primary dimension characteristic parameter of the enterprise sample to obtain the normalized abnormal score of the enterprise sample, namely the primary dimension characteristic value.
In the method for evaluating business operation status based on big power data, in step S3, the criterion for determining whether the business frequent status given in the present invention is abnormal is: and determining the enterprise with the total abnormal score of the enterprise, wherein the total abnormal score of the enterprise is more than or equal to 0.6 multiplied by the maximum value of the total abnormal score of all enterprises (namely the total abnormal score of the enterprise is more than the maximum value of the total abnormal score of all enterprises multiplied by 0.6), and the enterprise with the abnormal business condition is the enterprise with the abnormal business condition. These businesses with abnormal business status can still be divided into two categories: one is a well-operated enterprise, and the other is a poorly-operated enterprise. This is achieved by step S4 of the present invention.
In the above method for evaluating the business operation condition based on the big power data, in step S4, a K-means clustering method is used to cluster the business with abnormal operation condition, and the method specifically includes the following steps:
s41, randomly selecting 2 samples from the enterprises with abnormal operation conditions as the class center of the enterprise with good operation conditions and the class center of the enterprise with poor operation conditions;
s42, calculating the distance between the rest samples of the enterprise with abnormal operation condition and two enterprise centers with good operation condition and poor operation condition;
the distance between the rest samples of the enterprise with abnormal operation condition and two enterprise centers with good operation condition and poor operation condition can be calculated according to the following formula:
Figure BDA0002862910210000061
in the formula, yiRepresenting the primary dimension characteristic parameter of the ith enterprise sample; u. ofkDenotes the kth cluster center, where k is 1, 2; y isi,ukAre all p-dimensional vectors, p represents the number of primary dimension parameters, yi={yi1,yi2,…,yip},uk={uk1,uk2,…,ukp}。
S43, returning the rest samples of the enterprises with abnormal operation conditions to the enterprise class center with good operation conditions and the enterprise class center with poor operation conditions, and finishing clustering;
s44 recalculating the class center of the enterprise with good operation condition and the class center of the enterprise with poor operation condition according to the clustering result of the step S43;
recalculating the class centers of the enterprises with good operation conditions and the class centers of the enterprises with poor operation conditions according to the following formula;
Figure BDA0002862910210000062
in the formula, yiRepresenting the first-level dimensional characteristics of the ith enterprise sample; u. ofkDenotes the k-th cluster center, ckA cluster representing the kth category, where k is 1, 2; l ckL represents the number of enterprise samples in the kth category;
s45, judging whether the clustering termination condition is met, if so, finishing the final clustering of the enterprises with abnormal operation conditions, and entering the next step; otherwise, returning to the step S42;
in the step, the clustering termination condition is that the class center of an enterprise with good operation condition and the class center of an enterprise with poor operation condition do not change any more or reach the set upper limit threshold of iteration times, and only one of the two is satisfied;
s46, taking the ratio of the distance between the enterprise class center with poor operation condition of the enterprise sample and the distance between the enterprise class center with good operation condition as the score of the enterprise operation condition, and evaluating the enterprise operation condition of the abnormal enterprise;
in this step, a class center with good operation status and bad operation status in all enterprises (i.e., the best operation status and the worst operation status of the enterprise) can be obtained through K-means clustering, and the class center is far away from the origin in the positive feature space (the feature positively correlated to the electric power performance of the enterprise) and represents an enterprise with excellent performance. In the step, the ratio of each enterprise sample to the bad-class center and the good-class center in the original characteristic space is further used as an enterprise operation condition score, and the larger the ratio is, the closer the enterprise sample is to the good-class center and the farther the enterprise sample is from the bad-class center, the better the enterprise operation condition is; otherwise, the smaller the ratio is, the closer the enterprise sample is to the bad class center, and the farther the enterprise sample is from the good class center, the worse the enterprise operation condition is. Therefore, the difference of the business conditions among the enterprises can be reflected by the ratio.
Compared with the prior art, the enterprise operation condition evaluation method based on the electric power big data has the following outstanding advantages and beneficial technical effects:
1. according to the invention, the primary power data features of the enterprise are extracted to obtain the secondary dimensional features, and the secondary dimensional features are extracted to obtain the primary dimensional features, so that the primary data are extracted in a grading manner, the enterprise operation condition can be better represented in all aspects, and effective data can be provided for accurately evaluating the enterprise operation condition.
2. The method combines the isolated forest anomaly detection algorithm and the K-means clustering algorithm, and excavates the enterprise operation condition information in the electric power big data under the influence of subjective factors and experience factors as little as possible so as to ensure the accuracy of enterprise operation condition evaluation.
3. The invention can reflect the difference of the operation conditions among enterprises.
Drawings
Fig. 1 is a schematic flow chart of an enterprise operation condition evaluation method based on electric power big data according to the present invention.
FIG. 2 is a graph of normalized anomaly scores obtained by the isolation algorithm versus E (h (x)).
Detailed Description
The embodiments of the present invention will be given below with reference to the accompanying drawings, and the technical solutions of the present invention will be further clearly and completely described by the embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the disclosure of the invention without any inventive step, are within the scope of the invention.
Example 1
In this embodiment, based on the big data of the power consumption of multiple enterprises, the enterprise operation condition evaluation method provided by the present invention is used to evaluate and analyze the operation conditions of multiple enterprises, so as to explain the enterprise operation condition evaluation method provided by the present invention.
The enterprise operation condition evaluation method based on the large electric power data provided by the embodiment, as shown in fig. 1, mainly includes the following steps:
s1 data preprocessing
According to a plurality of data sets of an enterprise related to power utilization, enterprise samples lacking the data sets are filtered, and meanwhile missing values and zero values of the samples in the data sets are processed.
In the large electric power data, a plurality of data sets related to enterprise electricity consumption comprise an enterprise safety basic electricity consumption information data set, an enterprise electricity consumption data set, an enterprise settlement electricity quantity and electricity charge data set and an enterprise receivable electricity charge data set. Some outlier data has a large impact on the accuracy of the analysis. Therefore, in the data preprocessing stage, the abnormal data set and the missing values in the data set are respectively processed, which specifically includes the following steps:
(1) data set miss handling
For enterprise samples lacking a certain data set, filtering is straightforward.
(2) Missing value and zero value processing in data set
When there is a sample missing in the dataset, the completion can be done using conventional linear interpolation.
Since the logic operations involved in this embodiment use relative quantities, to avoid generating infinite numbers or non-numerical values, the zero samples are replaced with a very small given value of 0.001 in the present invention.
S2 hierarchical feature extraction
And extracting a plurality of secondary dimensional features for representing the enterprise electricity utilization information from the preprocessed data set, and then obtaining the primary dimensional features for representing the abnormal degree of the enterprise electricity utilization information through an isolated forest abnormality detection algorithm according to the secondary dimensional features.
The two-level dimensional features and the corresponding feature extraction logic related to the present embodiment are shown in table 1.
TABLE 1 two-level dimensional features and corresponding feature extraction logic
Figure BDA0002862910210000081
Note: the electric quantity standard value of the enterprise in the last N (N is 3,6,9,12) months, namely the actual electric quantity of the enterprise in each month in the last N months;
the association relationship between the primary dimension features and the secondary dimension features is shown in table 2.
TABLE 2 Association of first-level and second-level dimensional features
Figure BDA0002862910210000082
Figure BDA0002862910210000091
According to the second-level dimensional features, the corresponding feature extraction logic and the association relationship between the first-level dimensional features and the second-level dimensional features, the step S2 includes the following sub-steps:
s21, extracting a secondary dimension characteristic value from the preprocessed data set according to the secondary dimension characteristic calculation logic, and normalizing the extracted secondary dimension characteristic value.
Firstly, according to the secondary dimension characteristics and corresponding characteristic extraction logics given in the table 1, secondary dimension characteristic extraction is carried out on each enterprise.
Then, for one secondary dimension characteristic, carrying out normalization processing on corresponding enterprise sample data according to the following formula:
Figure BDA0002862910210000092
in the formula (I), the compound is shown in the specification,
Figure BDA0002862910210000093
represents the jth secondary dimensional feature, x, of the ith original enterprise samplemax,jRepresents the original enterprise sample maximum, x, in the jth secondary dimension featuremin,jIs the original enterprise sample minimum value in the jth secondary dimension feature; x is the number ofijRepresenting a jth secondary dimensional feature of the normalized ith original enterprise sample; i is 1,2, …, n, n represents the number of enterprise samples; j-1, 2, …, d, d represents the dimension of the secondary dimensional feature.
Table 3 shows some secondary dimensional features of some of the enterprise samples after normalization processing.
TABLE 3 two-level dimensional feature normalization results extracted by enterprises
Figure BDA0002862910210000094
Figure BDA0002862910210000101
Note: periodic fluctuation of-95, last 9 months in the periodic fluctuation, 5 months in the fluctuation situation
S22, according to the normalized secondary dimension characteristic value, obtaining a corresponding primary dimension characteristic value through an isolated forest anomaly detection algorithm.
The method further comprises the following steps of obtaining a corresponding first-level dimensional characteristic value through an isolated forest anomaly detection algorithm:
s221, a training set is constructed by utilizing all secondary dimension characteristic values after normalization processing of each enterprise, and then the constructed training set is utilized to train the isolated forest anomaly detection model to obtain the isolated forest anomaly detection model consisting of a plurality of isolated trees.
In this embodiment, the training set data constructed by all the secondary dimension characteristic values after normalization processing of each enterprise is X ═ X (X)1,X2,…,Xn) The number of data is n (number of enterprises), and for the ith enterprise sample, Xi=(xi1,xi2,…,xid) D is the data dimension (i.e., the number of secondary dimension features) and the number of isolated trees is 100.
The construction process of an isolation tree (isolation tree) comprises the following sub-steps:
s2211 randomly extracts from each two-level dimension characteristic in the training set, and extracts together
Figure BDA0002862910210000102
Constructing an isolated tree training subset by using the data samples;
s2212 randomly selects a secondary dimension characteristic from the isolated tree training subset, randomly selects a value in all value ranges of the characteristic, performs binary division on the sample, divides the sample which is smaller than the value to the left of the node, divides the sample which is larger than or equal to the value to the right of the node, and obtains a splitting condition and data sets of the left side and the right side; repeating the above process on the data sets on the left side and the right side respectively until a termination condition is reached; the termination conditions include the following two items:
1) the data itself is not re-divisible (contains only one sample, or all samples are the same);
2) height of isolated tree reaches
Figure BDA0002862910210000112
As long as 1 of the above two terms is satisfied, the training of the orphan tree ends.
And repeating the steps S2211 to S2212 until the number of the isolated trees reaches a set value, and forming an isolated forest abnormity detection model by all the constructed isolated trees.
S222, traversing each enterprise, and inputting the normalized secondary dimension characteristic value of each enterprise associated with each primary dimension characteristic into the trained isolated forest anomaly detection model to obtain the primary dimension characteristic value of the enterprise.
In this step, the abnormality score of the business sample is defined as follows, without considering the normalization of the tree height:
s(x)=2-E(h(x)) (2);
in the formula, x represents a normalized secondary dimension characteristic parameter set corresponding to a primary dimension characteristic parameter of an enterprise sample, h (x) represents the height of the enterprise sample x, which means that the leaf nodes can be reached only after several edges are needed from a root node of a tree, and E (h (x)) represents the average height of x in all isolated trees.
The above anomaly scores are normalized for tree height using c (n) (i.e., the average path length of the binary search tree).
Normalized anomaly scores were:
Figure BDA0002862910210000111
and the abnormal score is used as the abnormal score corresponding to the primary dimension characteristic parameter of the enterprise sample.
(n) the calculation formula is as follows:
c(n)=2H(n-1)-(2(n-1)/n) (4);
n represents the number of businesses, H (n-1) represents the harmonic number:
H(n-1)=ln(n-1)+ξ (5);
where ξ represents the euler constant, its value is 0.5772156649.
The relationship between s (x) and E (h (x)) is shown in FIG. 2. As can be seen from the figure, the closer the s (x, n) score is to-0.5, the higher the probability that it is an outlier; if the obtained data are all larger than 0, the data can be basically determined as normal data; if all scores are around 0, then the data contains no significant outlier samples.
Inputting all normalized secondary dimension characteristic parameters corresponding to the l-th primary dimension characteristic parameter of the ith sample into a trained isolated forest anomaly detection model, and obtaining an anomaly score s (X 'corresponding to the l-th primary dimension characteristic parameter of the ith sample according to a formula (3)'i,n)l,X′iAnd representing a set formed by all normalized secondary dimension characteristic parameters corresponding to the ith primary dimension characteristic parameter of the ith sample.
Then, the abnormal score s (X ') is obtained according to the following formula'i,n)lFurther normalization processing is carried out to obtain the normalization abnormal score of the ith first-level dimension characteristic parameter of the ith sample, namely the first-level dimension characteristic value yil
Figure BDA0002862910210000121
In the formula, s (X', n)max,lRepresents the maximum value of the abnormality score of the ith primary dimension characteristic in all enterprise samples, s (X', n)min,lRepresenting the minimum value of the abnormality score of the ith primary dimension characteristic in all enterprise samples; y isilRepresenting the ith primary dimension characteristic parameter of the ith sample; i is 1,2, …, n, n represents the number of enterprise samples; l ═ 1,2, …, p, p denote the dimensions of the secondary dimensional features.
The obtained primary dimension characteristic values of the partial enterprises according to the step S22 are shown in table 4.
TABLE 4 results of the enterprise extracted first dimension features
Enterprise Basic electricity consumption information Level of electric quantity Default electricity consumption information Trend of electric quantity Fluctuation of electric quantity
Enterprise
1 0 24.57 0 100 80.87
Enterprise 2 0 50.91 0 19.92 13.49
Enterprise 3 0 24.38 0 46.95 20.22
Enterprise 4 0 6.72 0 0 8.84
Enterprise 5 0 34.70 0 53.30 44.48
Enterprise 6 0 5.95 0 0.85 13.51
Enterprise 7 0 5.47 0 82.81 68.78
Enterprise 8 0 14.21 0 2.90 75.49
Enterprise 9 0 46.73 88.92 18.27 75.57
Enterprise 10 0 2.89 0 5.51 37.90
Enterprise 11 0 10.59 0 4.57 15.99
Enterprise 12 0 19.41 0 8.20 17.44
Enterprise 13 0 22.30 0 15.00 40.17
Enterprise 14 0 17.52 0 5.80 12.71
Enterprise 15 0 1.67 0 24.19 57.76
S3, adding all the primary dimension characteristic values of each enterprise to obtain the total abnormal score of the enterprise, then judging whether the operation condition of the enterprise is abnormal or not according to a given standard, and if so, entering the step S4; if not, the enterprise operation condition is normal.
In this step, the criterion for determining whether the given frequent condition of the enterprise is abnormal is as follows: and determining the enterprise with the total abnormal score of the enterprise, wherein the total abnormal score of the enterprise is more than or equal to 0.6 multiplied by the maximum value of the total abnormal score of all enterprises (namely the total abnormal score of the enterprise is more than the maximum value of the total abnormal score of all enterprises multiplied by 0.6), and the enterprise with the abnormal business condition is the enterprise with the abnormal business condition. These businesses with abnormal business status can still be divided into two categories: one is a well-performing business and the other is a poorly performing business, which is accomplished by step S4 of the present invention.
S4, all enterprises with abnormal operation conditions are clustered to obtain enterprises with good operation conditions and enterprises with poor operation conditions.
In the step, a K-means clustering method is adopted to realize clustering of enterprises with abnormal operation conditions, and the method specifically comprises the following sub-steps:
s41 randomly selects 2 samples from the enterprises with abnormal operation condition as the class center of the enterprise with good operation condition and the class center of the enterprise with poor operation condition.
S42, calculating the distance between the rest samples of the enterprise with abnormal operation condition and two enterprise centers with good operation condition and poor operation condition by taking the Euclidean distance as the distance measure:
Figure BDA0002862910210000131
in the formula, yiRepresenting the primary dimension characteristic parameter of the ith enterprise sample; u. ofkDenotes the kth cluster center, where k is 1, 2; y isi,ukAre all p-dimensional vectors, p represents the number of primary dimension parameters, yi={yi1,yi2,…,yip},uk={uk1,uk2,…,ukp}。
S43, returning the rest samples of the enterprises with abnormal operation conditions to the enterprise class center with good operation conditions and the enterprise class center with poor operation conditions, and completely clustering.
S44, recalculating the class center of the enterprise with good operation condition and the class center of the enterprise with poor operation condition according to the following formula according to the clustering result of the step S43;
Figure BDA0002862910210000132
in the formula, yiRepresenting the first-level dimensional characteristics of the ith enterprise sample; u. ofkDenotes the k-th cluster center, ckA cluster representing the kth category, where k is 1, 2; l ckAnd | represents the number of enterprise samples in the kth category.
S45, judging whether the clustering termination condition is met, if so, completely clustering the enterprises with abnormal operation conditions, and entering the next step; otherwise, the process returns to step S42.
In this step, the clustering termination condition is that the class center of an enterprise with good operation condition and the class center of an enterprise with poor operation condition do not change any more or reach the set upper limit threshold of iteration times, as long as one is satisfied.
S46, the ratio of the distance between the enterprise class center with poor operation condition and the enterprise class center with good operation condition is used as the enterprise operation condition score to evaluate the enterprise operation condition of the abnormal enterprise.
In this step, a class center with good operation status and bad operation status in all enterprises (i.e., the best operation status and the worst operation status of the enterprise) can be obtained through K-means clustering, and the class center is far away from the origin in the positive feature space (the feature positively correlated to the electric power performance of the enterprise) and represents an enterprise with excellent performance.
In order to reflect the business situation differences among the enterprises, the ratio of the distance between the bad class center and the good class center of each enterprise sample in the original feature space is further calculated in this step, and is defined as the business situation score, as shown in table 5. As can be seen from Table 5, the larger the ratio is, the closer the enterprise sample is to the good class center, and the farther the enterprise sample is from the bad class center, the better the business condition of the enterprise is; otherwise, the smaller the ratio is, the closer the enterprise sample is to the bad class center, and the farther the enterprise sample is from the good class center, the worse the enterprise operation condition is.
TABLE 5 Enterprise Condition of Business scores
Figure BDA0002862910210000141

Claims (5)

1. An enterprise operation condition evaluation method based on electric power big data is characterized by comprising the following steps:
s1 data preprocessing
Filtering enterprise samples lacking data sets according to a plurality of data sets related to power utilization of an enterprise, and processing missing values and zero values of the samples in the data sets; the data sets related to the electricity consumption comprise an enterprise safety basic electricity consumption information data set, an enterprise electricity consumption data set, an enterprise settlement electricity quantity and electricity charge data set and an enterprise receivable electricity charge data set;
s2 hierarchical feature extraction
Extracting a plurality of secondary dimensional features for representing the enterprise electricity utilization information from the preprocessed data set, and then obtaining a primary dimensional feature value for representing the abnormal degree of the enterprise electricity utilization information through an isolated forest abnormality detection model according to the secondary dimensional features; the method comprises the following steps:
s21, extracting a secondary dimension characteristic value from the preprocessed data set according to the secondary dimension characteristic calculation logic, and carrying out normalization processing on the extracted secondary dimension characteristic value;
s22, according to the normalized secondary dimension characteristic value, obtaining a corresponding primary dimension characteristic value through an isolated forest anomaly detection model; the method comprises the following steps:
s221, a training set is constructed by utilizing all secondary dimension characteristic values after normalization processing of each enterprise, and then the constructed training set is utilized to train the isolated forest anomaly detection model to obtain an isolated forest anomaly detection model consisting of a plurality of isolated trees;
s222, traversing each enterprise, and inputting the normalized secondary dimension characteristic value associated with each primary dimension characteristic value of the enterprise into a trained isolated forest anomaly detection model to obtain the primary dimension characteristic value of the enterprise; in the step, the abnormal score corresponding to the first-level dimensional characteristic parameter of the enterprise sample is obtained according to the following formula:
Figure 560873DEST_PATH_IMAGE001
in the formula,xrepresenting a normalized secondary dimension characteristic parameter set corresponding to the primary dimension characteristic parameters of the enterprise sample,
Figure 501016DEST_PATH_IMAGE002
representing samples of an enterprisexThe height of (c) means that several edges need to be traversed from the root node of the tree to reach the leaf nodes,
Figure 179122DEST_PATH_IMAGE003
to representxAverage height in all isolated trees;
c(n) The average path length of the binary search tree is expressed by the following calculation formula:
Figure 613646DEST_PATH_IMAGE004
nthe number of the enterprises is expressed,H(n-1) represents the harmonic number:
Figure 326387DEST_PATH_IMAGE005
in the formula (I), the compound is shown in the specification,
Figure 685036DEST_PATH_IMAGE006
represents the Euler constant;
further normalizing the abnormal score corresponding to the primary dimension characteristic parameter of the enterprise sample to obtain a normalized abnormal score of the enterprise sample, namely a primary dimension characteristic value;
s3, adding all the primary dimension characteristic values of each enterprise to obtain the total abnormal score of the enterprise, then judging whether the operation condition of the enterprise is abnormal according to the standard of whether the given frequent condition of the enterprise is abnormal, and if so, entering the step S4; if not, the enterprise operation condition is normal;
s4, obtaining the enterprises with good operation condition and the enterprises with poor operation condition by the K-means clustering method for all the enterprises with abnormal operation condition; the method comprises the following steps:
in step S4, clustering of enterprises with abnormal business conditions is implemented by using a K-means clustering method, which specifically includes the following sub-steps:
s41, randomly selecting 2 samples from the enterprises with abnormal operation conditions as the class center of the enterprise with good operation conditions and the class center of the enterprise with poor operation conditions;
s42, calculating the distance between the rest samples of the enterprise with abnormal operation condition and two enterprise centers with good operation condition and poor operation condition;
s43, returning the rest samples of the enterprises with abnormal operation conditions to the enterprise class center with good operation conditions and the enterprise class center with poor operation conditions which are closest to the samples, and completely clustering;
s44 recalculating the class center of the enterprise with good operation condition and the class center of the enterprise with poor operation condition according to the clustering result of the step S43;
s45, judging whether the clustering termination condition is met, if so, completely clustering the enterprises with abnormal operation conditions, and entering the next step; otherwise, returning to the step S42;
s46, the ratio of the distance between the enterprise class center with poor operation condition and the enterprise class center with good operation condition is used as the enterprise operation condition score to evaluate the enterprise operation condition of the abnormal enterprise.
2. The method for evaluating business conditions of enterprises based on big electric power data as claimed in claim 1, wherein the construction process of an isolated tree comprises the following sub-steps:
s2211 randomly extracts from each two-level dimension characteristic in the training set, and extracts together
Figure 788121DEST_PATH_IMAGE007
Constructing an isolated tree training subset by using the data samples;
s2212 randomly selects a secondary dimension characteristic from the isolated tree training subset, randomly selects a value in all value ranges of the characteristic, performs binary division on the sample, divides the sample which is smaller than the value to the left of the node, divides the sample which is larger than or equal to the value to the right of the node, and obtains a splitting condition and data sets of the left side and the right side; repeating the above process on the data sets on the left side and the right side respectively until a termination condition is reached; the termination conditions include the following two items:
1) the data itself is not re-divisible, containing only one sample, or all samples are the same;
2) height of isolated tree reaches log2(
Figure 885390DEST_PATH_IMAGE008
)。
3. The method for evaluating business conditions of enterprises based on big electric power data as claimed in claim 1, wherein in step S3, the criterion for determining whether the given enterprise frequent condition is abnormal is: and determining the enterprises with abnormal business conditions by multiplying the maximum value of the total abnormal scores of all the enterprises by more than 0.6.
4. The method for evaluating business operations based on big data of electricity according to claim 1, wherein in step S42, the distances between the remaining samples of the business with abnormal operations and the two business-like centers with good operations and the business-like centers with poor operations are calculated according to the following formula:
Figure 577271DEST_PATH_IMAGE009
in the formula (I), the compound is shown in the specification,y i is shown asiThe enterprise sample primary dimension characteristic parameters;u k is shown askA cluster center, herek=1,2;y i u k Du ShipA dimension vector is set to the vector of the dimension,pand representing the number of the primary dimension parameters.
5. The method for evaluating business operations based on big electric power data according to claim 4, wherein in step S44, the class centers of the businesses with good business operations and the class centers of the businesses with poor business operations are recalculated according to the following formula;
Figure 469004DEST_PATH_IMAGE010
in the formula (I), the compound is shown in the specification,y i is shown asiThe enterprise sample primary dimension characteristics;u k is shown askThe center of each cluster is determined by the center of each cluster,c k is shown askClusters of individual classes, herek=1,2;
Figure 59385DEST_PATH_IMAGE011
Is shown askNumber of samples of business in each category.
CN202011571639.7A 2020-12-27 2020-12-27 Enterprise operation condition evaluation method based on electric power big data Active CN112836926B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011571639.7A CN112836926B (en) 2020-12-27 2020-12-27 Enterprise operation condition evaluation method based on electric power big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011571639.7A CN112836926B (en) 2020-12-27 2020-12-27 Enterprise operation condition evaluation method based on electric power big data

Publications (2)

Publication Number Publication Date
CN112836926A CN112836926A (en) 2021-05-25
CN112836926B true CN112836926B (en) 2022-03-11

Family

ID=75924914

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011571639.7A Active CN112836926B (en) 2020-12-27 2020-12-27 Enterprise operation condition evaluation method based on electric power big data

Country Status (1)

Country Link
CN (1) CN112836926B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018031693A (en) * 2016-08-25 2018-03-01 一般財団法人電力中央研究所 Isolation evaluation method, isolation evaluation device and isolation evaluation program for aerial power transmission line, and method for displaying isolation evaluation data
CN108985632A (en) * 2018-07-16 2018-12-11 国网上海市电力公司 A kind of electricity consumption data abnormality detection model based on isolated forest algorithm
CN110717535A (en) * 2019-09-30 2020-01-21 北京九章云极科技有限公司 Automatic modeling method and system based on data analysis processing system
CN111338897A (en) * 2020-02-24 2020-06-26 京东数字科技控股有限公司 Identification method of abnormal node in application host, monitoring equipment and electronic equipment
CN111612323A (en) * 2020-05-15 2020-09-01 国网河北省电力有限公司电力科学研究院 Electric power credit investigation evaluation method based on big data model
CN111695639A (en) * 2020-06-17 2020-09-22 浙江经贸职业技术学院 Power consumer power consumption abnormity detection method based on machine learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018031693A (en) * 2016-08-25 2018-03-01 一般財団法人電力中央研究所 Isolation evaluation method, isolation evaluation device and isolation evaluation program for aerial power transmission line, and method for displaying isolation evaluation data
CN108985632A (en) * 2018-07-16 2018-12-11 国网上海市电力公司 A kind of electricity consumption data abnormality detection model based on isolated forest algorithm
CN110717535A (en) * 2019-09-30 2020-01-21 北京九章云极科技有限公司 Automatic modeling method and system based on data analysis processing system
CN111338897A (en) * 2020-02-24 2020-06-26 京东数字科技控股有限公司 Identification method of abnormal node in application host, monitoring equipment and electronic equipment
CN111612323A (en) * 2020-05-15 2020-09-01 国网河北省电力有限公司电力科学研究院 Electric power credit investigation evaluation method based on big data model
CN111695639A (en) * 2020-06-17 2020-09-22 浙江经贸职业技术学院 Power consumer power consumption abnormity detection method based on machine learning

Also Published As

Publication number Publication date
CN112836926A (en) 2021-05-25

Similar Documents

Publication Publication Date Title
CN111324642A (en) Model algorithm type selection and evaluation method for power grid big data analysis
CN108898479B (en) Credit evaluation model construction method and device
Cagnina et al. An efficient particle swarm optimization approach to cluster short texts
CN110852856A (en) Invoice false invoice identification method based on dynamic network representation
CN112508726B (en) False public opinion identification system based on information spreading characteristics and processing method thereof
CN112907222B (en) Multi-source heterogeneous power grid operation supervision service data fusion method
CN115794803B (en) Engineering audit problem monitoring method and system based on big data AI technology
CN113052225A (en) Alarm convergence method and device based on clustering algorithm and time sequence association rule
CN111738843B (en) Quantitative risk evaluation system and method using running water data
Subramanian et al. Ensemble variable selection for Naive Bayes to improve customer behaviour analysis
CN107729377A (en) Customer classification method and system based on data mining
Ciflikli et al. Enhancing product quality of a process
Zheng et al. Anomalous telecom customer behavior detection and clustering analysis based on ISP’s operating data
CN117670066B (en) Questor management method, system, equipment and storage medium based on intelligent decision
CN115481841A (en) Material demand prediction method based on feature extraction and improved random forest
CN117455529A (en) User electricity utilization characteristic image construction method and system based on big data technology
CN117453764A (en) Data mining analysis method
CN112508363A (en) Deep learning-based power information system state analysis method and device
CN112836926B (en) Enterprise operation condition evaluation method based on electric power big data
Ma The Research of Stock Predictive Model based on the Combination of CART and DBSCAN
Liu et al. Study on Chinese text clustering algorithm based on K-mean and evaluation method on effect of clustering for software-intensive system
CN112258235A (en) Method and system for discovering new service of electric power marketing audit
Yahia et al. K-nearest neighbor and C4. 5 algorithms as data mining methods: advantages and difficulties
Zheng et al. Combustion process modeling based on deep sparse least squares support vector regression
Liu et al. A comparison of machine learning algorithms for prediction of past due service in commercial credit

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant