CN113159137A

CN113159137A - Gas load clustering method and device

Info

Publication number: CN113159137A
Application number: CN202110354433.7A
Authority: CN
Inventors: 黄冬虹; 刘丹; 王亮; 董妍; 赵兴昊
Original assignee: Beijing Gas Group Co Ltd
Current assignee: Beijing Gas Group Co Ltd
Priority date: 2021-04-01
Filing date: 2021-04-01
Publication date: 2021-07-23

Abstract

The invention provides a gas load clustering method and device. The method comprises the following steps: clustering the gas load data; and taking the clustering centers of a few classes as a base point, and interpolating on a connecting line of the base point and each data point of the class or an extension line of the base point and each data point of the class. According to the invention, by improving the SMOTE algorithm, the interpolation base points are changed into the clustering centers from the original few data points, so that the problem that the SMOTE algorithm is easy to generate distribution marginalization is solved, and the clustering effect is improved. The clustering method provided by the invention is not only suitable for gas load data, but also suitable for other unbalanced data sets.

Description

Gas load clustering method and device

Technical Field

The invention relates to the technical field of gas load data clustering classification, in particular to a gas load clustering method and device.

Background

The load clustering research of the gas boiler system is to carry out scientific and effective division on gas loads, and the user load clustering is to excavate the relationship and the composition among loads of different types and different regions through clustering analysis. When planning and designing a gas boiler system, whether the economy of a project is estimated or the construction scale of the gas boiler system is determined, the load of the gas boiler is important basic data in the design process. The load clustering analysis is to combine the data mining technology and the application of a gas boiler system, analyze the gas load characteristics through data mining, mine out the hidden load pattern in a large amount of unordered and irregular loads, classify the hidden load pattern, and solve the problems in the gas boiler system through the obtained typical load curve, such as gas load prediction, demand side response analysis, and the like. Different types of users, such as civilian, commercial, industrial, agricultural, etc., have wide variation in gas consumption patterns, and even users of the same type may have different gas consumption patterns. The gas consumption modes of different gas users are dug based on load data classification, so that the gas company can be supported to perform market competition strategies such as ordered heating, peak load shifting management and time-sharing gas utilization and provide more personalized heating service, the understanding of the gas consumption modes of different gas users is improved, and more efficient demand side management is performed. In addition, the user can adjust the consumption strategy more economically and optimally according to the problems found by load classification, so that the cost can be reduced, and the energy utilization efficiency can be improved.

The actual gas load data is an unbalanced data set. In unbalanced data sets, the data quantity difference between two types of data is large, the few types of data are far less than the majority types of data, and the few types of samples are more difficult to identify than the majority types of samples. And by adopting oversampling to increase a few types of samples, the method can be used for clustering unbalanced gas load data. The SMOTE (Synthetic minimum Oversampling Technique) algorithm is a commonly used Oversampling method at present, and can well solve the problem of over randomness during random upsampling. However, the algorithm is easy to generate the problem of distribution marginalization, so that the class boundary is fuzzified, and the clustering effect is influenced.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a gas load clustering method and a gas load clustering device, which are used for interpolating clustered data points based on an improved SMOTE algorithm and improving the clustering effect of unbalanced gas load data.

In order to achieve the above object, the present invention adopts the following technical solutions.

In a first aspect, the present invention provides a gas load clustering method, including:

clustering the gas load data;

and taking the clustering centers of a few classes as a base point, and interpolating on a connecting line of the base point and each data point of the class or an extension line of the base point and each data point of the class.

Further, the method further comprises the step of performing dimensionality reduction on the gas load data by adopting a Principal Component Analysis (PCA) before clustering.

Further, the method clusters the gas load data by using an FCM algorithm.

Further, before interpolation, the method also comprises the steps of identifying and eliminating dangerous points:

determining a few class boundary points: for each minority data point, solving K neighbors of the minority data point, wherein if the K neighbors comprise the majority data point, the minority data point is a minority boundary point;

counting the number of the minority class boundary points, and if the number of the minority class boundary points is more than 1, respectively calculating Euclidean distances d1 and d2 between each minority class boundary point and a majority class data point in K neighbors of the minority class boundary point and a clustering center;

if d1 of a certain minority class boundary point is larger than the minimum value of d2, and K neighbors of the certain minority class boundary point are all majority class data points, the minority class boundary point is a danger point;

all the dangerous points are deleted and clustering is carried out again.

Further, the interpolation method specifically includes:

calculating the ratio of the number of the majority class data points to the number of the minority class data points and rounding to obtain an interpolation multiplying power n;

calculating the maximum value D of Euclidean distances from the minority class center u to all data points of the class;

calculating the Euclidean distance D between u and each data point x of the class, and rounding D/D to obtain H;

n-1 interpolation points x for each data point x are calculated as follows_new：

x_new＝u+rand(0,H)×(x-u)

In the formula, rand (0, H) is a random number between 0 and H, and n-1 times of execution is carried out to obtain n-1 interpolation points.

In a second aspect, the present invention provides a gas load clustering device, including:

the clustering module is used for clustering the gas load data;

and the interpolation module is used for taking the clustering centers of a few classes as a base point and carrying out interpolation on a connecting line of the base point and each data point of the class or an extension line thereof.

Further, the device further comprises a dimensionality reduction module which is used for carrying out dimensionality reduction on the gas load data by adopting a Principal Component Analysis (PCA).

Further, the clustering module clusters the gas load data by using an FCM algorithm.

Further, the device also comprises a dangerous point eliminating module which is used for identifying and eliminating the dangerous points according to the following method before interpolation is carried out:

all the dangerous points are deleted and clustering is carried out again.

Further, the interpolation module performs interpolation according to the following method:

calculating the ratio of the number of the majority type data points to the number of the minority type data points, and obtaining an interpolation multiplying power n after rounding;

x_new＝u+rand(0,H)×(x-u)

Compared with the prior art, the invention has the following beneficial effects.

The method and the device realize the clustering of the unbalanced data set by clustering the gas load data, taking the clustering centers of a small number of classes as a base point and interpolating on the connecting line of the base point and each data point of the class or the extension line thereof. According to the invention, by improving the SMOTE algorithm, the interpolation base points are changed into the clustering centers from the original few data points, so that the problem that the SMOTE algorithm is easy to generate distribution marginalization is solved, and the clustering effect is improved. In addition, the clustering method provided by the invention is not only suitable for gas load data, but also suitable for other unbalanced data sets.

Drawings

Fig. 1 is a flowchart of a gas load clustering method according to an embodiment of the present invention.

FIG. 2 is a diagram of an unbalanced sample data set, where the circles are the minority class data points and the five stars are the majority class data points.

Fig. 3 is a schematic diagram of clustered data points after sampling, in which a hollow circle is a few types of data points, a solid circle is an interpolation point, and an asterisk is a clustering center.

Fig. 4 is a block diagram of a gas load clustering device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the present invention is further described below with reference to the accompanying drawings and the detailed description. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a gas load clustering method according to an embodiment of the present invention, including the following steps:

step 101, clustering gas load data;

and step 102, taking the clustering centers of a few classes as a base point, and interpolating on a connecting line of the base point and each data point of the class or an extension line of the base point and each data point of the class.

In this embodiment, step 101 is mainly used for clustering the gas load data. Clustering is a machine learning technique that involves grouping of data points. Given a set of data points, a clustering algorithm can be used to divide each data point into a particular group. In theory, data points in the same group should have similar attributes and/or characteristics, while data points in different groups should have different attributes and/or characteristics. Clustering is an unsupervised learning method, and is a statistical data analysis technique commonly used in many fields. There are many clustering algorithms, and the main clustering algorithms include the following categories: a partitioning method, a hierarchical method, a density-based method, a mesh-based method, and a model-based method. Such as k-means clustering algorithm in a partitioning method, an agglomeration type hierarchical clustering algorithm in a hierarchical method, a neural network clustering algorithm in a model-based method, and the like. The clustering algorithm is not specifically limited in this embodiment.

In this embodiment, step 102 is mainly used to perform interpolation on a few types of data points. As mentioned above, the gas load data is an unbalanced data set, the data quantity difference between the two types of data is large, and the data of the minority type is far less than the data of the majority type. As shown in FIG. 2, circles are a few class of data points and five stars are a majority class of data points, and the number of circles in the graph is significantly less than the number of five stars. Since most machine learning algorithms for classification are designed for balanced data sets, which results in poor prediction performance of the model on unbalanced data, especially for the prediction effect of the classes with fewer samples, it is more important that the identification of the samples in the minority class is more difficult than the identification of the samples in the majority class. For this reason, the present embodiment mainly processes the minority class data, specifically, performs oversampling (interpolation) on the minority class data points, so that the unbalanced data set tends to be balanced.

In the prior art, interpolation is mostly realized by adopting an SMOTE algorithm. The principle of the SMOTE algorithm is described below. The primary purpose of the SMOTE algorithm is to make an unbalanced data set a balanced data set by increasing the amount of minority class data. Assuming a certain unbalanced data set, for each gas load data x in the minority class data, the nearest neighbor K minority class data is searched around the gas load data x. Assuming that the up-sampling rate of the data set is n (approximately equal to the ratio of the amount of majority class data to the amount of minority class data), then n data y are randomly extracted from the K nearest neighbor data₁,y₂,...,y_n. Calculating interpolation according to the following formula to obtain n interpolation points:

p_i＝x+rand(0,1)×(y_i-x)

in the formula, rand (0,1) represents a random number in the interval (0,1), and i is 1, 2.

Through the interpolation operation, the number of data points of a majority class and a minority class can be balanced, so that the accuracy of classification of the unbalanced data set is improved. However, since the algorithm interpolates based on the minority class data, the distribution of the minority class data determines its selectable neighbors, and if a minority class data is located at the edge of the distribution of the class, the interpolation between the data point and its neighboring data point will also be located near the edge and will be more and more marginalized, thereby blurring the boundary between the minority class and the majority class and making the boundary more and more blurred. Although the boundary ambiguity improves the balance of the data set, the difficulty of the clustering algorithm is increased and the clustering effect is influenced. For this reason, the embodiment improves the SMOTE algorithm: and interpolation is carried out between the base point and each data point by taking the clustering centers of the minority classes as base points, so that the interpolation points of the minority classes of data points positioned on the boundary are still near the boundary, and the problem of boundary ambiguity is solved.

Fig. 3 is a schematic diagram of an interpolation result obtained by using the improved interpolation method, in which a hollow circle is a few types of data points, a solid circle is an interpolation point, and an asterisk is a clustering center. As can be seen, the interpolation points almost all fall within the boundary, and no boundary ambiguity is generated.

As an optional embodiment, the method further comprises performing dimensionality reduction on the gas load data by using Principal Component Analysis (PCA) before clustering.

In this embodiment, the gas load data is subjected to dimensionality reduction before clustering, so as to reduce the imbalance degree of the data set and the calculation amount of clustering operation by reducing the data dimensionality and the number of data, thereby improving the clustering speed and the clustering effect. There are many dimension reduction algorithms, and the commonly used ones are PCA, Sammon mapping, and feature index dimension reduction.

PCA is often used in machine learning and is one of the steps of data preprocessing. PCA, when simplifying and reducing dimensions of data, is mainly based on the following two factors: firstly, a high-dimensional feature space contains a lot of unnecessary redundant information, and the features have correlation with each other; second, high dimensional data is computationally complex. The goal of PCA is to preserve the information of the original dataset as much as possible, and to simplify the high-dimensional data, in case of loss. PCA minimizes the variance of the cost function by selecting several variables that contribute most to the information content of the data samples. The main components remained have the following characteristics: the preserved principal component needs to be smaller than the dimensionality of the original data set; each principal component is a linear combination of the original variables, but there is no correlation between the principal components; as much information as possible of the original data samples is retained with as little loss as possible.

Sammon mapping is a distance preserving technique. The Sammon mapping has only one specific purpose, namely to reduce the dimensionality of a limited number of points. This technique is considered a variant of PCA. The goal of the Sammon algorithm is to minimize the error function. The advantage of using this technique is that it is computationally simple, and results can be obtained even for non-linear datasets, provided that the dataset is less complex. The Sammon algorithm is also suitable for non-linear datasets.

The characteristic index dimension reduction is also a processing mode aiming at high-dimensional load data. The common load characteristic indexes include indexes such as average load, peak-to-valley difference, load rate, and load rate. The load curve is influenced by time, temperature, season, living habits, regional production modes and the like, so that the research difficulty is high. For this purpose, the following 6 load characteristic indexes are selected: the highest load rate is the hour rate of utilization, the daily peak-valley difference rate, the peak load rate, the flat load rate and the valley load rate. The load mode of the user is more comprehensively described from different periods and different load rates all day.

Experiments show that in the aspect of data dimension reduction, the PCA algorithm has high operation efficiency and good dimension reduction effect, and the obtained clustering result is basically consistent with the clustering result without dimension reduction; the disadvantage of the Sammon algorithm is that the operation time is long; the clustering effect obtained by the feature selection dimensionality reduction algorithm has larger deviation. Therefore, the PCA algorithm is adopted in the embodiment to perform the dimension reduction processing on the gas load data.

As an alternative embodiment, the method uses FCM algorithm to cluster the gas load data.

This embodiment provides a specific clustering algorithm, that is, a fuzzy C-means clustering algorithm FCM. FCM is one of the most widely used algorithms in clustering based on objective functions. The FCM is optimized by a traditional hard clustering algorithm, the hard clustering adopts the principle that the membership degree is either 0 or 1, a condition-based nonlinear programming problem is constructed by using a mean square approximation method, and the clustering problem is solved by means of an objective function. The FCM algorithm is an unsupervised learning method, firstly, c objects in a sample are randomly selected as initial clustering centers, and the value of an initialized fuzzy partition matrix is between [0 and 1 ]; calculating a fuzzy partition matrix and a class center, and calculating an objective function value; and continuously updating and repeating the processes until the objective function value is minimum or the iteration times are greater than the maximum iteration stopping times, and then dividing the samples into each class according to the size of the fuzzy division matrix. FCM measures the distance between a sample and a cluster center by Euclidean distance, and the smaller the distance, the higher the similarity between objects is, and the objects are more easily classified into the same class. FCM is a mature prior art and a detailed algorithm flow is not given here.

As an alternative embodiment, before the interpolation, the method further comprises the steps of identifying and eliminating the dangerous points:

all the dangerous points are deleted and clustering is carried out again.

The embodiment provides a technical scheme for eliminating the dangerous points. The "dangerous point" is an image calling method, which may also be called "interference point", and its existence may cause a large influence or interference to the clustering algorithm. Therefore, in order to improve the clustering effect, it is necessary to identify and delete the dangerous points and then re-perform clustering. The risk point in this embodiment satisfies the following condition: are a minority of class boundary points; the K neighbors are all most kinds of data points; the distance between the data points and the cluster center is closer than that between some majority of data points in the K neighbors and the cluster center. That is to say, the dangerous points are almost surrounded by most kinds of data points, and the existence of the dangerous points will certainly cause danger or interference to the clustering algorithm, so that the difficulty of the clustering algorithm is greatly increased, and the clustering effect is influenced. According to the above features of the dangerous point, the method for identifying the dangerous point of the embodiment is: firstly, determining a few types of boundary points, and then judging whether each boundary point meets the latter two conditions. Of course, the number of the boundary points of the minority class is also limited, i.e. must be greater than or equal to 1, otherwise, if there is only one boundary point and it is determined as a dangerous point, there is no boundary point after deletion (or the overall distribution of the minority class is greatly influenced). The boundary point determining method includes that K neighbors of each minority data point are obtained, and if the K neighbors include the majority data point, the minority data point is the minority boundary point.

As an alternative embodiment, the interpolation method specifically includes:

x_new＝u+rand(0,H)×(x-u)

This embodiment provides a technical solution of an improved SMOTE algorithm. The main improvement is that interpolation is carried out by taking a few types of data points as a base point instead of the conventional interpolation by taking a clustering center as a base point. The interpolation point calculation method is as above formula, and compared with the interpolation formula of the SMOTE algorithm, the base point is changed from x to u, and 1 is also changed to H. H is the value obtained by rounding the ratio D/D of the maximum Euclidean distance D between all minority class data points and the clustering center u and the Euclidean distance D between the minority class data points x and u. Because H is larger than or equal to 1, the range of the interpolation points is widened after the processing, and overfitting caused by small interpolation range can be avoided. This process may cause some interpolation points of the boundary points to be "out of bounds", but since this is a few, few points out of bounds will have little effect on clustering. Of course, the above interpolation formula can be further improved: when x is the boundary point, H is still changed to 1; h is only used when x is a non-boundary point. This results in a smaller probability of "over-bounding".

Fig. 4 is a schematic composition diagram of a gas load clustering device according to an embodiment of the present invention, where the device includes:

the clustering module 11 is used for clustering the gas load data;

and the interpolation module 22 is configured to take the clustering centers of the minority classes as a base point, and perform interpolation on a connecting line between the base point and each data point of the class or an extension line thereof.

The apparatus of this embodiment may be used to implement the technical solution of the method embodiment shown in fig. 1, and the implementation principle and the technical effect are similar, which are not described herein again. The same applies to the following embodiments, which are not further described.

As an optional embodiment, the device further comprises a dimension reduction module for performing dimension reduction processing on the gas load data by using Principal Component Analysis (PCA).

As an alternative embodiment, the clustering module 11 clusters the gas load data by using an FCM algorithm.

As an optional embodiment, the apparatus further includes a risk point elimination module, configured to identify and eliminate a risk point according to the following method before performing interpolation:

all the dangerous points are deleted and clustering is carried out again.

As an alternative embodiment, the interpolation module 22 performs interpolation according to the following method:

x_new＝u+rand(0,H)×(x-u)

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A gas load clustering method is characterized by comprising the following steps:

clustering the gas load data;

2. The gas load clustering method of claim 1, further comprising performing dimensionality reduction on the gas load data using Principal Component Analysis (PCA) prior to clustering.

3. The gas load clustering method of claim 1, wherein the method clusters gas load data using an FCM algorithm.

4. The gas load clustering method according to claim 1, further comprising the steps of identifying and eliminating dangerous points before interpolation:

all the dangerous points are deleted and clustering is carried out again.

5. The gas load clustering method according to claim 4, wherein the interpolation method specifically comprises:

x_new＝u+rand(0,H)×(x-u)

6. A gas load clustering device, comprising:

the clustering module is used for clustering the gas load data;

7. The gas load clustering method of claim 6, wherein the device further comprises a dimensionality reduction module for performing dimensionality reduction on the gas load data by using Principal Component Analysis (PCA).

8. The gas load clustering method of claim 6, wherein the clustering module clusters the gas load data using an FCM algorithm.

9. The gas load clustering method according to claim 6, wherein the device further comprises a danger point eliminating module for identifying and eliminating the danger points before interpolation according to the following method:

all the dangerous points are deleted and clustering is carried out again.

10. The gas load clustering method of claim 9, wherein the interpolation module interpolates according to the following method:

x_new＝u+rand(0,H)×(x-u)