CN112836926B

CN112836926B - Enterprise operation condition evaluation method based on electric power big data

Info

Publication number: CN112836926B
Application number: CN202011571639.7A
Authority: CN
Inventors: 王茂宁; 邹开欣; 钟羽中; 邓霖
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2020-12-27
Filing date: 2020-12-27
Publication date: 2022-03-11
Anticipated expiration: 2040-12-27
Also published as: CN112836926A

Abstract

The invention discloses an enterprise operation condition evaluation method based on electric power big data. The invention can better show the enterprise operation condition from all aspects by extracting the original data in a grading way; the method and the system can dig the enterprise operation condition information in the large electric power data under the influence of subjective factors and experience factors as few as possible so as to ensure the accuracy of enterprise operation condition evaluation.

Description

Enterprise operation condition evaluation method based on electric power big data

Technical Field

The invention belongs to the technical field of electric power big data application, and relates to an enterprise operation condition evaluation technology based on electric power big data.

Background

The big data technology can promote the upgrading and the transformation of an information technology platform, supplement the capability of analyzing and utilizing unstructured data and enhance the value mining capability of mass data resources. The big electric power data is a novel asset of an electric power company, and can promote the business management of the electric power company to develop towards a more detailed and more efficient direction. Abundant information related to power utilization enterprises is stored in the electric power big data. And analyzing the big power data, and mining the enterprise operation condition hidden in the big power data.

With the recent rapid development of data availability, computing power, and new algorithms, machine learning has gradually become one of the key methods for implementing Artificial Intelligence (AI). Machine learning is a subset of the artificial intelligence in a wider field of computer science. It learns and discovers "patterns and insights" from "data" using computers and algorithms, because in many cases "patterns and insights" are hidden within "data". As era evolves, the data accumulated from business processes can be very complex for humans to understand. However, algorithms are able to more quickly and accurately mine "patterns and insights" from data than humans.

A power user credit evaluation method and system integrating payment indexes and industry disclosure indexes of power users is provided by Lanzhou power supply companies of power companies in Gansu province of the national grid. The system constructs the payment credit evaluation of the power consumer through indexes such as registration time, fulfillment rate, deviation share, average power factor, payment proportion, average payment days, arrearage percentage, prestoring percentage and the like. And establishing the industry credit of the power consumer according to the industry contribution rate, the industry public credit investigation and other indexes. And after the indexes are normalized, classifying by using a K-means clustering method to obtain the credit evaluation grade of the enterprise. The main problems of the power consumer credit evaluation method and the system are as follows: (1) because the power data reflect the production and operation conditions of the enterprises more and the enterprise credit reflected in the public credit investigation data is limited, the credit credibility of the enterprises evaluated by the data is not high; (2) because the number of the clustering centers needs to be divided manually, the influence of subjective factors and experience factors is brought, and the credit degree of an enterprise is not known in advance, so the credit grade is not reliable; (3) the credit level is a discrete variable, so that the difference of credit between enterprises at the same level cannot be reflected by embodying credit through the level.

Therefore, most of the existing methods for analyzing the power enterprises and evaluating the credit of the power enterprises based on the utilization of the large power data are difficult to objectively and effectively reflect the production and operation conditions of the enterprises, and further difficult to provide effective data support for enterprise development.

Disclosure of Invention

Aiming at the technical current situation that the reliability of enterprise user evaluation based on the large electric power data is poor and the difference degree of the operation conditions of each enterprise is difficult to reflect, the invention aims to provide an enterprise operation condition evaluation method of the large electric power data.

The invention provides an enterprise operation condition evaluation method based on electric power big data, which mainly comprises the following steps:

s1 data preprocessing

According to a plurality of data sets of an enterprise related to power utilization, enterprise samples lacking the data sets are filtered, and meanwhile missing values and zero values of the samples in the data sets are processed.

S2 hierarchical feature extraction

Extracting a plurality of secondary dimensional features for representing the enterprise electricity utilization information from the preprocessed data set, and then obtaining primary dimensional features for representing the abnormal degree of the enterprise electricity utilization information through an isolated forest abnormality detection algorithm according to the secondary dimensional features; the method comprises the following steps:

s21, extracting a secondary dimension characteristic value from the preprocessed data set according to the secondary dimension characteristic calculation logic, and carrying out normalization processing on the extracted secondary dimension characteristic value;

s22, obtaining a corresponding primary dimension characteristic value through an isolated forest anomaly detection algorithm according to the normalized secondary dimension characteristic value;

s3, adding all the primary dimension characteristic values of each enterprise to obtain the total abnormal score of the enterprise, then judging whether the operation condition of the enterprise is abnormal or not according to a given standard, and if so, entering the step S4; if not, the enterprise operation condition is normal;

s4, all enterprises with abnormal operation conditions are clustered to obtain enterprises with good operation conditions and enterprises with poor operation conditions.

In the method for evaluating the enterprise operation condition based on the big power data, in step S1, the data sets related to power consumption include an enterprise security basic power consumption information data set, an enterprise power consumption data set, an enterprise settlement power consumption and power charge data set, and an enterprise receivable power charge data set. Some outlier data has a large impact on the accuracy of the analysis. Therefore, in the data preprocessing stage, the abnormal data set and the missing values in the data set are respectively processed, specifically as follows:

(1) data set miss handling

For enterprise samples lacking a certain data set, filtering is straightforward. When a certain data set is absent from a certain enterprise sample, the evaluation of the business operation condition of the enterprise can be directly influenced. Therefore, in order to improve the accuracy of the business operation condition evaluation, the business samples are directly filtered.

(2) Missing value and zero value processing in data set

When there is a sample missing in the dataset, linear interpolation can be used for completion.

Since the logic operations involved in the present invention use relative quantities, to avoid generating infinite numbers or non-numerical values, the samples of zero values are replaced by a very small given value in the present invention.

In the method for evaluating the business operation condition based on the big power data, in step S2, in order to better express the business operation condition from various aspects, the invention extracts the original data features as hierarchical features, which are divided into two stages: the system comprises a secondary dimensional characteristic used for representing the enterprise electricity utilization information and a primary dimensional characteristic used for representing the degree of abnormality of the enterprise electricity utilization information. The secondary dimension characteristics include, but are not limited to, safety power utilization grade classification, power utilization duration, power consumption analysis in a first given time period of an enterprise, monthly electricity industry level, power consumption fluctuation condition of the enterprise, power consumption difference degree, periodic fluctuation, average total power increase trend in a second given time period of the enterprise, current accumulated late arrearage, late arrearage payment times in a third given time period and the like. The primary dimension characteristics include, but are not limited to, basic electricity consumption information, electricity level, electricity fluctuation, electricity trend, default electricity consumption information and the like.

In step S21, the secondary dimension feature calculation logic is as follows:

(1) classifying the safety power utilization level according to the safety power utilization level of an enterprise;

(2) for the electricity consumption time, the electricity consumption time is calculated according to years, and the electricity consumption time is calculated according to 1 year in less than 1 year;

(3) analyzing the power consumption of the enterprise in a first given time period according to the average value of the power consumption of the enterprise in the first given time period;

(4) for the monthly average electric quantity industry level, according to the ratio of the average electric quantity value in the first given time period of the enterprise to the average electric quantity value in the first given time period of the industry;

(5) for the enterprise electricity fluctuation condition, according to the standard value of the electric quantity in the first given time period of the enterprise and the average value of the electric quantity in the first given time period of the enterprise;

(6) for the power consumption difference degree, according to the ratio of (the maximum power consumption in the first given time period of the enterprise-the minimum power consumption of the enterprise) to the average value of the monthly power consumption in the first given time period of the enterprise;

(7) for periodic fluctuations, three aspects are involved: (i) according to the ratio of the electric quantity standard value of the enterprise in the last 3 months to the electric quantity standard value of the industry in the last 3 months; (ii) according to the ratio of the electric quantity standard value of the enterprise in the latest 6 months to the electric quantity standard value of the industry in the latest 6 months; (iii) according to the ratio of the electric quantity standard value of the enterprise in the last 9 months to the electric quantity standard value of the industry in the last 9 months;

(8) for the average total electricity quantity growth trend of the enterprise in a second given time period, according to the sum of the monthly growth rate of the total electricity consumption in the second given time period and/or the second given time period;

(9) accumulating the arrearage currently, and accumulating the amount of the arrearage according to the statistics;

(10) and (4) the number of the late payment times in the third given time period is close to the number of the late payment times in the third given time period according to statistics.

In step S22, the primary dimension features are obtained from the associated secondary dimension features by an isolated forest anomaly detection algorithm. The basic electricity utilization information is associated with safety electricity utilization grade classification and electricity utilization duration; the electricity level is associated with an analysis of electricity usage and a monthly average electricity industry level for a first given period of time for the enterprise; the electric quantity fluctuation is associated with the electric quantity fluctuation condition, the electric quantity difference degree and the periodic fluctuation of an enterprise; the power trend is associated with an average total power increase trend of the enterprise in a second given time period; the default electricity utilization information is associated with the current accumulated late arrearage and the number of late payment times in a third given time period.

In step S22, a corresponding first-order dimension characteristic value is obtained through an isolated forest anomaly detection algorithm according to the following sub-steps:

s221, a training set is constructed by utilizing all secondary dimension characteristic values after normalization processing of each enterprise, and then the constructed training set is utilized to train the isolated forest anomaly detection model to obtain an isolated forest anomaly detection model consisting of a plurality of isolated trees (isolation trees);

s222, traversing each enterprise, and inputting the normalized secondary dimension characteristic value of each enterprise associated with each primary dimension characteristic into the trained isolated forest anomaly detection model to obtain the primary dimension characteristic value of the enterprise.

In the step S221, the process of constructing an isolated tree includes the following sub-steps:

s2211 randomly extracts from each two-level dimension characteristic in the training set, and extracts together

Constructing an isolated tree training subset by using the data samples;

s2212 randomly selects a secondary dimension characteristic from the isolated tree training subset, randomly selects a value in all value ranges of the characteristic, performs binary division on the sample, divides the sample which is smaller than the value to the left of the node, divides the sample which is larger than or equal to the value to the right of the node, and obtains a splitting condition and data sets of the left side and the right side; repeating the above process on the data sets on the left side and the right side respectively until a termination condition is reached; the termination conditions include the following two items:

1) the data itself is not re-divisible (contains only one sample, or all samples are the same);

2) height of isolated tree reaches

And repeating the steps S2211 to S2212 until the number of the isolated trees reaches a set value, and forming an isolated forest abnormity detection model by all the constructed isolated trees.

In step S222, the abnormality score corresponding to the first-level dimensional characteristic parameter of the enterprise sample is obtained according to the following formula:

in the formula, x represents a normalized secondary dimension characteristic parameter set corresponding to a primary dimension characteristic parameter of an enterprise sample, h (x) represents the height of the enterprise sample x, which means that the leaf nodes can be reached only after several edges are needed from a root node of a tree, and E (h (x)) represents the average height of x in all isolated trees. The lower the height, the higher the anomaly score. c (n) represents the average path length of the binary search tree, which is calculated as follows:

c(n)＝2H(n-1)-(2(n-1)/n)；

n represents the number of businesses, H (n-1) represents the harmonic number:

H(n-1)＝ln(n-1)+ξ；

where ξ represents the euler constant, its value is 0.5772156649.

And further normalizing the abnormal score corresponding to the primary dimension characteristic parameter of the enterprise sample to obtain the normalized abnormal score of the enterprise sample, namely the primary dimension characteristic value.

In the method for evaluating business operation status based on big power data, in step S3, the criterion for determining whether the business frequent status given in the present invention is abnormal is: and determining the enterprise with the total abnormal score of the enterprise, wherein the total abnormal score of the enterprise is more than or equal to 0.6 multiplied by the maximum value of the total abnormal score of all enterprises (namely the total abnormal score of the enterprise is more than the maximum value of the total abnormal score of all enterprises multiplied by 0.6), and the enterprise with the abnormal business condition is the enterprise with the abnormal business condition. These businesses with abnormal business status can still be divided into two categories: one is a well-operated enterprise, and the other is a poorly-operated enterprise. This is achieved by step S4 of the present invention.

In the above method for evaluating the business operation condition based on the big power data, in step S4, a K-means clustering method is used to cluster the business with abnormal operation condition, and the method specifically includes the following steps:

s41, randomly selecting 2 samples from the enterprises with abnormal operation conditions as the class center of the enterprise with good operation conditions and the class center of the enterprise with poor operation conditions;

s42, calculating the distance between the rest samples of the enterprise with abnormal operation condition and two enterprise centers with good operation condition and poor operation condition;

the distance between the rest samples of the enterprise with abnormal operation condition and two enterprise centers with good operation condition and poor operation condition can be calculated according to the following formula:

in the formula, y_iRepresenting the primary dimension characteristic parameter of the ith enterprise sample; u. of_kDenotes the kth cluster center, where k is 1, 2; y is_i，u_kAre all p-dimensional vectors, p represents the number of primary dimension parameters, y_i＝{y_i1,y_i2,…,y_ip},u_k＝{u_k1,u_k2,…,u_kp}。

S43, returning the rest samples of the enterprises with abnormal operation conditions to the enterprise class center with good operation conditions and the enterprise class center with poor operation conditions, and finishing clustering;

s44 recalculating the class center of the enterprise with good operation condition and the class center of the enterprise with poor operation condition according to the clustering result of the step S43;

recalculating the class centers of the enterprises with good operation conditions and the class centers of the enterprises with poor operation conditions according to the following formula;

in the formula, y_iRepresenting the first-level dimensional characteristics of the ith enterprise sample; u. of_kDenotes the k-th cluster center, c_kA cluster representing the kth category, where k is 1, 2; l c_kL represents the number of enterprise samples in the kth category;

s45, judging whether the clustering termination condition is met, if so, finishing the final clustering of the enterprises with abnormal operation conditions, and entering the next step; otherwise, returning to the step S42;

in the step, the clustering termination condition is that the class center of an enterprise with good operation condition and the class center of an enterprise with poor operation condition do not change any more or reach the set upper limit threshold of iteration times, and only one of the two is satisfied;

s46, taking the ratio of the distance between the enterprise class center with poor operation condition of the enterprise sample and the distance between the enterprise class center with good operation condition as the score of the enterprise operation condition, and evaluating the enterprise operation condition of the abnormal enterprise;

in this step, a class center with good operation status and bad operation status in all enterprises (i.e., the best operation status and the worst operation status of the enterprise) can be obtained through K-means clustering, and the class center is far away from the origin in the positive feature space (the feature positively correlated to the electric power performance of the enterprise) and represents an enterprise with excellent performance. In the step, the ratio of each enterprise sample to the bad-class center and the good-class center in the original characteristic space is further used as an enterprise operation condition score, and the larger the ratio is, the closer the enterprise sample is to the good-class center and the farther the enterprise sample is from the bad-class center, the better the enterprise operation condition is; otherwise, the smaller the ratio is, the closer the enterprise sample is to the bad class center, and the farther the enterprise sample is from the good class center, the worse the enterprise operation condition is. Therefore, the difference of the business conditions among the enterprises can be reflected by the ratio.

Compared with the prior art, the enterprise operation condition evaluation method based on the electric power big data has the following outstanding advantages and beneficial technical effects:

1. according to the invention, the primary power data features of the enterprise are extracted to obtain the secondary dimensional features, and the secondary dimensional features are extracted to obtain the primary dimensional features, so that the primary data are extracted in a grading manner, the enterprise operation condition can be better represented in all aspects, and effective data can be provided for accurately evaluating the enterprise operation condition.

2. The method combines the isolated forest anomaly detection algorithm and the K-means clustering algorithm, and excavates the enterprise operation condition information in the electric power big data under the influence of subjective factors and experience factors as little as possible so as to ensure the accuracy of enterprise operation condition evaluation.

3. The invention can reflect the difference of the operation conditions among enterprises.

Drawings

Fig. 1 is a schematic flow chart of an enterprise operation condition evaluation method based on electric power big data according to the present invention.

FIG. 2 is a graph of normalized anomaly scores obtained by the isolation algorithm versus E (h (x)).

Detailed Description

The embodiments of the present invention will be given below with reference to the accompanying drawings, and the technical solutions of the present invention will be further clearly and completely described by the embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the disclosure of the invention without any inventive step, are within the scope of the invention.

Example 1

In this embodiment, based on the big data of the power consumption of multiple enterprises, the enterprise operation condition evaluation method provided by the present invention is used to evaluate and analyze the operation conditions of multiple enterprises, so as to explain the enterprise operation condition evaluation method provided by the present invention.

The enterprise operation condition evaluation method based on the large electric power data provided by the embodiment, as shown in fig. 1, mainly includes the following steps:

s1 data preprocessing

In the large electric power data, a plurality of data sets related to enterprise electricity consumption comprise an enterprise safety basic electricity consumption information data set, an enterprise electricity consumption data set, an enterprise settlement electricity quantity and electricity charge data set and an enterprise receivable electricity charge data set. Some outlier data has a large impact on the accuracy of the analysis. Therefore, in the data preprocessing stage, the abnormal data set and the missing values in the data set are respectively processed, which specifically includes the following steps:

(1) data set miss handling

For enterprise samples lacking a certain data set, filtering is straightforward.

(2) Missing value and zero value processing in data set

When there is a sample missing in the dataset, the completion can be done using conventional linear interpolation.

Since the logic operations involved in this embodiment use relative quantities, to avoid generating infinite numbers or non-numerical values, the zero samples are replaced with a very small given value of 0.001 in the present invention.

S2 hierarchical feature extraction

And extracting a plurality of secondary dimensional features for representing the enterprise electricity utilization information from the preprocessed data set, and then obtaining the primary dimensional features for representing the abnormal degree of the enterprise electricity utilization information through an isolated forest abnormality detection algorithm according to the secondary dimensional features.

The two-level dimensional features and the corresponding feature extraction logic related to the present embodiment are shown in table 1.

TABLE 1 two-level dimensional features and corresponding feature extraction logic

Note: the electric quantity standard value of the enterprise in the last N (N is 3,6,9,12) months, namely the actual electric quantity of the enterprise in each month in the last N months;

the association relationship between the primary dimension features and the secondary dimension features is shown in table 2.

TABLE 2 Association of first-level and second-level dimensional features

According to the second-level dimensional features, the corresponding feature extraction logic and the association relationship between the first-level dimensional features and the second-level dimensional features, the step S2 includes the following sub-steps:

s21, extracting a secondary dimension characteristic value from the preprocessed data set according to the secondary dimension characteristic calculation logic, and normalizing the extracted secondary dimension characteristic value.

Firstly, according to the secondary dimension characteristics and corresponding characteristic extraction logics given in the table 1, secondary dimension characteristic extraction is carried out on each enterprise.

Then, for one secondary dimension characteristic, carrying out normalization processing on corresponding enterprise sample data according to the following formula:

in the formula (I), the compound is shown in the specification,

represents the jth secondary dimensional feature, x, of the ith original enterprise sample_max,jRepresents the original enterprise sample maximum, x, in the jth secondary dimension feature_min,jIs the original enterprise sample minimum value in the jth secondary dimension feature; x is the number of_ijRepresenting a jth secondary dimensional feature of the normalized ith original enterprise sample; i is 1,2, …, n, n represents the number of enterprise samples; j-1, 2, …, d, d represents the dimension of the secondary dimensional feature.

Table 3 shows some secondary dimensional features of some of the enterprise samples after normalization processing.

TABLE 3 two-level dimensional feature normalization results extracted by enterprises

Note: periodic fluctuation of-95, last 9 months in the periodic fluctuation, 5 months in the fluctuation situation

S22, according to the normalized secondary dimension characteristic value, obtaining a corresponding primary dimension characteristic value through an isolated forest anomaly detection algorithm.

The method further comprises the following steps of obtaining a corresponding first-level dimensional characteristic value through an isolated forest anomaly detection algorithm:

s221, a training set is constructed by utilizing all secondary dimension characteristic values after normalization processing of each enterprise, and then the constructed training set is utilized to train the isolated forest anomaly detection model to obtain the isolated forest anomaly detection model consisting of a plurality of isolated trees.

In this embodiment, the training set data constructed by all the secondary dimension characteristic values after normalization processing of each enterprise is X ═ X (X)₁,X₂,…,X_n) The number of data is n (number of enterprises), and for the ith enterprise sample, X_i＝(x_i1,x_i2,…,x_id) D is the data dimension (i.e., the number of secondary dimension features) and the number of isolated trees is 100.

The construction process of an isolation tree (isolation tree) comprises the following sub-steps:

Constructing an isolated tree training subset by using the data samples;

2) height of isolated tree reaches

As long as 1 of the above two terms is satisfied, the training of the orphan tree ends.

In this step, the abnormality score of the business sample is defined as follows, without considering the normalization of the tree height:

s(x)＝2^-E(h(x)) (2)；

in the formula, x represents a normalized secondary dimension characteristic parameter set corresponding to a primary dimension characteristic parameter of an enterprise sample, h (x) represents the height of the enterprise sample x, which means that the leaf nodes can be reached only after several edges are needed from a root node of a tree, and E (h (x)) represents the average height of x in all isolated trees.

The above anomaly scores are normalized for tree height using c (n) (i.e., the average path length of the binary search tree).

Normalized anomaly scores were:

and the abnormal score is used as the abnormal score corresponding to the primary dimension characteristic parameter of the enterprise sample.

(n) the calculation formula is as follows:

c(n)＝2H(n-1)-(2(n-1)/n) (4)；

n represents the number of businesses, H (n-1) represents the harmonic number:

H(n-1)＝ln(n-1)+ξ (5)；

where ξ represents the euler constant, its value is 0.5772156649.

The relationship between s (x) and E (h (x)) is shown in FIG. 2. As can be seen from the figure, the closer the s (x, n) score is to-0.5, the higher the probability that it is an outlier; if the obtained data are all larger than 0, the data can be basically determined as normal data; if all scores are around 0, then the data contains no significant outlier samples.

Inputting all normalized secondary dimension characteristic parameters corresponding to the l-th primary dimension characteristic parameter of the ith sample into a trained isolated forest anomaly detection model, and obtaining an anomaly score s (X 'corresponding to the l-th primary dimension characteristic parameter of the ith sample according to a formula (3)'_i,n)_l，X′_iAnd representing a set formed by all normalized secondary dimension characteristic parameters corresponding to the ith primary dimension characteristic parameter of the ith sample.

Then, the abnormal score s (X ') is obtained according to the following formula'_i,n)_lFurther normalization processing is carried out to obtain the normalization abnormal score of the ith first-level dimension characteristic parameter of the ith sample, namely the first-level dimension characteristic value y_il：

In the formula, s (X', n)_max,lRepresents the maximum value of the abnormality score of the ith primary dimension characteristic in all enterprise samples, s (X', n)_min,lRepresenting the minimum value of the abnormality score of the ith primary dimension characteristic in all enterprise samples; y is_ilRepresenting the ith primary dimension characteristic parameter of the ith sample; i is 1,2, …, n, n represents the number of enterprise samples; l ═ 1,2, …, p, p denote the dimensions of the secondary dimensional features.

The obtained primary dimension characteristic values of the partial enterprises according to the step S22 are shown in table 4.

TABLE 4 results of the enterprise extracted first dimension features

Enterprise	Basic electricity consumption information	Level of electric quantity	Default electricity consumption information	Trend of electric quantity	Fluctuation of electric quantity
						Enterprise
1	0	24.57	0	100	80.87
						Enterprise 2	0	50.91	0	19.92	13.49
Enterprise 3	0	24.38	0	46.95	20.22
						Enterprise 4	0	6.72	0	0	8.84
Enterprise 5	0	34.70	0	53.30	44.48
						Enterprise 6	0	5.95	0	0.85	13.51
Enterprise 7	0	5.47	0	82.81	68.78
						Enterprise 8	0	14.21	0	2.90	75.49
Enterprise 9	0	46.73	88.92	18.27	75.57
						Enterprise 10	0	2.89	0	5.51	37.90
Enterprise 11	0	10.59	0	4.57	15.99
						Enterprise 12	0	19.41	0	8.20	17.44
Enterprise 13	0	22.30	0	15.00	40.17
						Enterprise 14	0	17.52	0	5.80	12.71
Enterprise 15	0	1.67	0	24.19	57.76
						…	…	…	…	…	…

S3, adding all the primary dimension characteristic values of each enterprise to obtain the total abnormal score of the enterprise, then judging whether the operation condition of the enterprise is abnormal or not according to a given standard, and if so, entering the step S4; if not, the enterprise operation condition is normal.

In this step, the criterion for determining whether the given frequent condition of the enterprise is abnormal is as follows: and determining the enterprise with the total abnormal score of the enterprise, wherein the total abnormal score of the enterprise is more than or equal to 0.6 multiplied by the maximum value of the total abnormal score of all enterprises (namely the total abnormal score of the enterprise is more than the maximum value of the total abnormal score of all enterprises multiplied by 0.6), and the enterprise with the abnormal business condition is the enterprise with the abnormal business condition. These businesses with abnormal business status can still be divided into two categories: one is a well-performing business and the other is a poorly performing business, which is accomplished by step S4 of the present invention.

In the step, a K-means clustering method is adopted to realize clustering of enterprises with abnormal operation conditions, and the method specifically comprises the following sub-steps:

s41 randomly selects 2 samples from the enterprises with abnormal operation condition as the class center of the enterprise with good operation condition and the class center of the enterprise with poor operation condition.

S42, calculating the distance between the rest samples of the enterprise with abnormal operation condition and two enterprise centers with good operation condition and poor operation condition by taking the Euclidean distance as the distance measure:

S43, returning the rest samples of the enterprises with abnormal operation conditions to the enterprise class center with good operation conditions and the enterprise class center with poor operation conditions, and completely clustering.

S44, recalculating the class center of the enterprise with good operation condition and the class center of the enterprise with poor operation condition according to the following formula according to the clustering result of the step S43;

in the formula, y_iRepresenting the first-level dimensional characteristics of the ith enterprise sample; u. of_kDenotes the k-th cluster center, c_kA cluster representing the kth category, where k is 1, 2; l c_kAnd | represents the number of enterprise samples in the kth category.

S45, judging whether the clustering termination condition is met, if so, completely clustering the enterprises with abnormal operation conditions, and entering the next step; otherwise, the process returns to step S42.

In this step, the clustering termination condition is that the class center of an enterprise with good operation condition and the class center of an enterprise with poor operation condition do not change any more or reach the set upper limit threshold of iteration times, as long as one is satisfied.

S46, the ratio of the distance between the enterprise class center with poor operation condition and the enterprise class center with good operation condition is used as the enterprise operation condition score to evaluate the enterprise operation condition of the abnormal enterprise.

In this step, a class center with good operation status and bad operation status in all enterprises (i.e., the best operation status and the worst operation status of the enterprise) can be obtained through K-means clustering, and the class center is far away from the origin in the positive feature space (the feature positively correlated to the electric power performance of the enterprise) and represents an enterprise with excellent performance.

In order to reflect the business situation differences among the enterprises, the ratio of the distance between the bad class center and the good class center of each enterprise sample in the original feature space is further calculated in this step, and is defined as the business situation score, as shown in table 5. As can be seen from Table 5, the larger the ratio is, the closer the enterprise sample is to the good class center, and the farther the enterprise sample is from the bad class center, the better the business condition of the enterprise is; otherwise, the smaller the ratio is, the closer the enterprise sample is to the bad class center, and the farther the enterprise sample is from the good class center, the worse the enterprise operation condition is.

TABLE 5 Enterprise Condition of Business scores

Claims

1. An enterprise operation condition evaluation method based on electric power big data is characterized by comprising the following steps:

s1 data preprocessing

Filtering enterprise samples lacking data sets according to a plurality of data sets related to power utilization of an enterprise, and processing missing values and zero values of the samples in the data sets; the data sets related to the electricity consumption comprise an enterprise safety basic electricity consumption information data set, an enterprise electricity consumption data set, an enterprise settlement electricity quantity and electricity charge data set and an enterprise receivable electricity charge data set;

s2 hierarchical feature extraction

Extracting a plurality of secondary dimensional features for representing the enterprise electricity utilization information from the preprocessed data set, and then obtaining a primary dimensional feature value for representing the abnormal degree of the enterprise electricity utilization information through an isolated forest abnormality detection model according to the secondary dimensional features; the method comprises the following steps:

s22, according to the normalized secondary dimension characteristic value, obtaining a corresponding primary dimension characteristic value through an isolated forest anomaly detection model; the method comprises the following steps:

s221, a training set is constructed by utilizing all secondary dimension characteristic values after normalization processing of each enterprise, and then the constructed training set is utilized to train the isolated forest anomaly detection model to obtain an isolated forest anomaly detection model consisting of a plurality of isolated trees;

s222, traversing each enterprise, and inputting the normalized secondary dimension characteristic value associated with each primary dimension characteristic value of the enterprise into a trained isolated forest anomaly detection model to obtain the primary dimension characteristic value of the enterprise; in the step, the abnormal score corresponding to the first-level dimensional characteristic parameter of the enterprise sample is obtained according to the following formula:

in the formula,xrepresenting a normalized secondary dimension characteristic parameter set corresponding to the primary dimension characteristic parameters of the enterprise sample,

representing samples of an enterprisexThe height of (c) means that several edges need to be traversed from the root node of the tree to reach the leaf nodes,

to representxAverage height in all isolated trees;

c(n) The average path length of the binary search tree is expressed by the following calculation formula:

；

nthe number of the enterprises is expressed,H(n-1) represents the harmonic number:

；

in the formula (I), the compound is shown in the specification,

represents the Euler constant;

further normalizing the abnormal score corresponding to the primary dimension characteristic parameter of the enterprise sample to obtain a normalized abnormal score of the enterprise sample, namely a primary dimension characteristic value;

s3, adding all the primary dimension characteristic values of each enterprise to obtain the total abnormal score of the enterprise, then judging whether the operation condition of the enterprise is abnormal according to the standard of whether the given frequent condition of the enterprise is abnormal, and if so, entering the step S4; if not, the enterprise operation condition is normal;

s4, obtaining the enterprises with good operation condition and the enterprises with poor operation condition by the K-means clustering method for all the enterprises with abnormal operation condition; the method comprises the following steps:

in step S4, clustering of enterprises with abnormal business conditions is implemented by using a K-means clustering method, which specifically includes the following sub-steps:

s43, returning the rest samples of the enterprises with abnormal operation conditions to the enterprise class center with good operation conditions and the enterprise class center with poor operation conditions which are closest to the samples, and completely clustering;

s45, judging whether the clustering termination condition is met, if so, completely clustering the enterprises with abnormal operation conditions, and entering the next step; otherwise, returning to the step S42;

2. The method for evaluating business conditions of enterprises based on big electric power data as claimed in claim 1, wherein the construction process of an isolated tree comprises the following sub-steps:

Constructing an isolated tree training subset by using the data samples;

1) the data itself is not re-divisible, containing only one sample, or all samples are the same;

2) height of isolated tree reaches log₂(

)。

3. The method for evaluating business conditions of enterprises based on big electric power data as claimed in claim 1, wherein in step S3, the criterion for determining whether the given enterprise frequent condition is abnormal is: and determining the enterprises with abnormal business conditions by multiplying the maximum value of the total abnormal scores of all the enterprises by more than 0.6.

4. The method for evaluating business operations based on big data of electricity according to claim 1, wherein in step S42, the distances between the remaining samples of the business with abnormal operations and the two business-like centers with good operations and the business-like centers with poor operations are calculated according to the following formula:

in the formula (I), the compound is shown in the specification,y _iis shown asiThe enterprise sample primary dimension characteristic parameters;u _kis shown askA cluster center, herek=1，2；y _i，u _kDu ShipA dimension vector is set to the vector of the dimension,pand representing the number of the primary dimension parameters.

5. The method for evaluating business operations based on big electric power data according to claim 4, wherein in step S44, the class centers of the businesses with good business operations and the class centers of the businesses with poor business operations are recalculated according to the following formula;

in the formula (I), the compound is shown in the specification,y _iis shown asiThe enterprise sample primary dimension characteristics;u _kis shown askThe center of each cluster is determined by the center of each cluster,c _kis shown askClusters of individual classes, herek=1，2；

Is shown askNumber of samples of business in each category.