CN113159137A - Gas load clustering method and device - Google Patents

Gas load clustering method and device Download PDF

Info

Publication number
CN113159137A
CN113159137A CN202110354433.7A CN202110354433A CN113159137A CN 113159137 A CN113159137 A CN 113159137A CN 202110354433 A CN202110354433 A CN 202110354433A CN 113159137 A CN113159137 A CN 113159137A
Authority
CN
China
Prior art keywords
data
point
class
minority
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110354433.7A
Other languages
Chinese (zh)
Inventor
黄冬虹
刘丹
王亮
董妍
赵兴昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gas Group Co Ltd
Original Assignee
Beijing Gas Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gas Group Co Ltd filed Critical Beijing Gas Group Co Ltd
Priority to CN202110354433.7A priority Critical patent/CN113159137A/en
Publication of CN113159137A publication Critical patent/CN113159137A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Water Supply & Treatment (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides a gas load clustering method and device. The method comprises the following steps: clustering the gas load data; and taking the clustering centers of a few classes as a base point, and interpolating on a connecting line of the base point and each data point of the class or an extension line of the base point and each data point of the class. According to the invention, by improving the SMOTE algorithm, the interpolation base points are changed into the clustering centers from the original few data points, so that the problem that the SMOTE algorithm is easy to generate distribution marginalization is solved, and the clustering effect is improved. The clustering method provided by the invention is not only suitable for gas load data, but also suitable for other unbalanced data sets.

Description

Gas load clustering method and device
Technical Field
The invention relates to the technical field of gas load data clustering classification, in particular to a gas load clustering method and device.
Background
The load clustering research of the gas boiler system is to carry out scientific and effective division on gas loads, and the user load clustering is to excavate the relationship and the composition among loads of different types and different regions through clustering analysis. When planning and designing a gas boiler system, whether the economy of a project is estimated or the construction scale of the gas boiler system is determined, the load of the gas boiler is important basic data in the design process. The load clustering analysis is to combine the data mining technology and the application of a gas boiler system, analyze the gas load characteristics through data mining, mine out the hidden load pattern in a large amount of unordered and irregular loads, classify the hidden load pattern, and solve the problems in the gas boiler system through the obtained typical load curve, such as gas load prediction, demand side response analysis, and the like. Different types of users, such as civilian, commercial, industrial, agricultural, etc., have wide variation in gas consumption patterns, and even users of the same type may have different gas consumption patterns. The gas consumption modes of different gas users are dug based on load data classification, so that the gas company can be supported to perform market competition strategies such as ordered heating, peak load shifting management and time-sharing gas utilization and provide more personalized heating service, the understanding of the gas consumption modes of different gas users is improved, and more efficient demand side management is performed. In addition, the user can adjust the consumption strategy more economically and optimally according to the problems found by load classification, so that the cost can be reduced, and the energy utilization efficiency can be improved.
The actual gas load data is an unbalanced data set. In unbalanced data sets, the data quantity difference between two types of data is large, the few types of data are far less than the majority types of data, and the few types of samples are more difficult to identify than the majority types of samples. And by adopting oversampling to increase a few types of samples, the method can be used for clustering unbalanced gas load data. The SMOTE (Synthetic minimum Oversampling Technique) algorithm is a commonly used Oversampling method at present, and can well solve the problem of over randomness during random upsampling. However, the algorithm is easy to generate the problem of distribution marginalization, so that the class boundary is fuzzified, and the clustering effect is influenced.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a gas load clustering method and a gas load clustering device, which are used for interpolating clustered data points based on an improved SMOTE algorithm and improving the clustering effect of unbalanced gas load data.
In order to achieve the above object, the present invention adopts the following technical solutions.
In a first aspect, the present invention provides a gas load clustering method, including:
clustering the gas load data;
and taking the clustering centers of a few classes as a base point, and interpolating on a connecting line of the base point and each data point of the class or an extension line of the base point and each data point of the class.
Further, the method further comprises the step of performing dimensionality reduction on the gas load data by adopting a Principal Component Analysis (PCA) before clustering.
Further, the method clusters the gas load data by using an FCM algorithm.
Further, before interpolation, the method also comprises the steps of identifying and eliminating dangerous points:
determining a few class boundary points: for each minority data point, solving K neighbors of the minority data point, wherein if the K neighbors comprise the majority data point, the minority data point is a minority boundary point;
counting the number of the minority class boundary points, and if the number of the minority class boundary points is more than 1, respectively calculating Euclidean distances d1 and d2 between each minority class boundary point and a majority class data point in K neighbors of the minority class boundary point and a clustering center;
if d1 of a certain minority class boundary point is larger than the minimum value of d2, and K neighbors of the certain minority class boundary point are all majority class data points, the minority class boundary point is a danger point;
all the dangerous points are deleted and clustering is carried out again.
Further, the interpolation method specifically includes:
calculating the ratio of the number of the majority class data points to the number of the minority class data points and rounding to obtain an interpolation multiplying power n;
calculating the maximum value D of Euclidean distances from the minority class center u to all data points of the class;
calculating the Euclidean distance D between u and each data point x of the class, and rounding D/D to obtain H;
n-1 interpolation points x for each data point x are calculated as followsnew
xnew=u+rand(0,H)×(x-u)
In the formula, rand (0, H) is a random number between 0 and H, and n-1 times of execution is carried out to obtain n-1 interpolation points.
In a second aspect, the present invention provides a gas load clustering device, including:
the clustering module is used for clustering the gas load data;
and the interpolation module is used for taking the clustering centers of a few classes as a base point and carrying out interpolation on a connecting line of the base point and each data point of the class or an extension line thereof.
Further, the device further comprises a dimensionality reduction module which is used for carrying out dimensionality reduction on the gas load data by adopting a Principal Component Analysis (PCA).
Further, the clustering module clusters the gas load data by using an FCM algorithm.
Further, the device also comprises a dangerous point eliminating module which is used for identifying and eliminating the dangerous points according to the following method before interpolation is carried out:
determining a few class boundary points: for each minority data point, solving K neighbors of the minority data point, wherein if the K neighbors comprise the majority data point, the minority data point is a minority boundary point;
counting the number of the minority class boundary points, and if the number of the minority class boundary points is more than 1, respectively calculating Euclidean distances d1 and d2 between each minority class boundary point and a majority class data point in K neighbors of the minority class boundary point and a clustering center;
if d1 of a certain minority class boundary point is larger than the minimum value of d2, and K neighbors of the certain minority class boundary point are all majority class data points, the minority class boundary point is a danger point;
all the dangerous points are deleted and clustering is carried out again.
Further, the interpolation module performs interpolation according to the following method:
calculating the ratio of the number of the majority type data points to the number of the minority type data points, and obtaining an interpolation multiplying power n after rounding;
calculating the maximum value D of Euclidean distances from the minority class center u to all data points of the class;
calculating the Euclidean distance D between u and each data point x of the class, and rounding D/D to obtain H;
n-1 interpolation points x for each data point x are calculated as followsnew
xnew=u+rand(0,H)×(x-u)
In the formula, rand (0, H) is a random number between 0 and H, and n-1 times of execution is carried out to obtain n-1 interpolation points.
Compared with the prior art, the invention has the following beneficial effects.
The method and the device realize the clustering of the unbalanced data set by clustering the gas load data, taking the clustering centers of a small number of classes as a base point and interpolating on the connecting line of the base point and each data point of the class or the extension line thereof. According to the invention, by improving the SMOTE algorithm, the interpolation base points are changed into the clustering centers from the original few data points, so that the problem that the SMOTE algorithm is easy to generate distribution marginalization is solved, and the clustering effect is improved. In addition, the clustering method provided by the invention is not only suitable for gas load data, but also suitable for other unbalanced data sets.
Drawings
Fig. 1 is a flowchart of a gas load clustering method according to an embodiment of the present invention.
FIG. 2 is a diagram of an unbalanced sample data set, where the circles are the minority class data points and the five stars are the majority class data points.
Fig. 3 is a schematic diagram of clustered data points after sampling, in which a hollow circle is a few types of data points, a solid circle is an interpolation point, and an asterisk is a clustering center.
Fig. 4 is a block diagram of a gas load clustering device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the present invention is further described below with reference to the accompanying drawings and the detailed description. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of a gas load clustering method according to an embodiment of the present invention, including the following steps:
step 101, clustering gas load data;
and step 102, taking the clustering centers of a few classes as a base point, and interpolating on a connecting line of the base point and each data point of the class or an extension line of the base point and each data point of the class.
In this embodiment, step 101 is mainly used for clustering the gas load data. Clustering is a machine learning technique that involves grouping of data points. Given a set of data points, a clustering algorithm can be used to divide each data point into a particular group. In theory, data points in the same group should have similar attributes and/or characteristics, while data points in different groups should have different attributes and/or characteristics. Clustering is an unsupervised learning method, and is a statistical data analysis technique commonly used in many fields. There are many clustering algorithms, and the main clustering algorithms include the following categories: a partitioning method, a hierarchical method, a density-based method, a mesh-based method, and a model-based method. Such as k-means clustering algorithm in a partitioning method, an agglomeration type hierarchical clustering algorithm in a hierarchical method, a neural network clustering algorithm in a model-based method, and the like. The clustering algorithm is not specifically limited in this embodiment.
In this embodiment, step 102 is mainly used to perform interpolation on a few types of data points. As mentioned above, the gas load data is an unbalanced data set, the data quantity difference between the two types of data is large, and the data of the minority type is far less than the data of the majority type. As shown in FIG. 2, circles are a few class of data points and five stars are a majority class of data points, and the number of circles in the graph is significantly less than the number of five stars. Since most machine learning algorithms for classification are designed for balanced data sets, which results in poor prediction performance of the model on unbalanced data, especially for the prediction effect of the classes with fewer samples, it is more important that the identification of the samples in the minority class is more difficult than the identification of the samples in the majority class. For this reason, the present embodiment mainly processes the minority class data, specifically, performs oversampling (interpolation) on the minority class data points, so that the unbalanced data set tends to be balanced.
In the prior art, interpolation is mostly realized by adopting an SMOTE algorithm. The principle of the SMOTE algorithm is described below. The primary purpose of the SMOTE algorithm is to make an unbalanced data set a balanced data set by increasing the amount of minority class data. Assuming a certain unbalanced data set, for each gas load data x in the minority class data, the nearest neighbor K minority class data is searched around the gas load data x. Assuming that the up-sampling rate of the data set is n (approximately equal to the ratio of the amount of majority class data to the amount of minority class data), then n data y are randomly extracted from the K nearest neighbor data1,y2,...,yn. Calculating interpolation according to the following formula to obtain n interpolation points:
pi=x+rand(0,1)×(yi-x)
in the formula, rand (0,1) represents a random number in the interval (0,1), and i is 1, 2.
Through the interpolation operation, the number of data points of a majority class and a minority class can be balanced, so that the accuracy of classification of the unbalanced data set is improved. However, since the algorithm interpolates based on the minority class data, the distribution of the minority class data determines its selectable neighbors, and if a minority class data is located at the edge of the distribution of the class, the interpolation between the data point and its neighboring data point will also be located near the edge and will be more and more marginalized, thereby blurring the boundary between the minority class and the majority class and making the boundary more and more blurred. Although the boundary ambiguity improves the balance of the data set, the difficulty of the clustering algorithm is increased and the clustering effect is influenced. For this reason, the embodiment improves the SMOTE algorithm: and interpolation is carried out between the base point and each data point by taking the clustering centers of the minority classes as base points, so that the interpolation points of the minority classes of data points positioned on the boundary are still near the boundary, and the problem of boundary ambiguity is solved.
Fig. 3 is a schematic diagram of an interpolation result obtained by using the improved interpolation method, in which a hollow circle is a few types of data points, a solid circle is an interpolation point, and an asterisk is a clustering center. As can be seen, the interpolation points almost all fall within the boundary, and no boundary ambiguity is generated.
As an optional embodiment, the method further comprises performing dimensionality reduction on the gas load data by using Principal Component Analysis (PCA) before clustering.
In this embodiment, the gas load data is subjected to dimensionality reduction before clustering, so as to reduce the imbalance degree of the data set and the calculation amount of clustering operation by reducing the data dimensionality and the number of data, thereby improving the clustering speed and the clustering effect. There are many dimension reduction algorithms, and the commonly used ones are PCA, Sammon mapping, and feature index dimension reduction.
PCA is often used in machine learning and is one of the steps of data preprocessing. PCA, when simplifying and reducing dimensions of data, is mainly based on the following two factors: firstly, a high-dimensional feature space contains a lot of unnecessary redundant information, and the features have correlation with each other; second, high dimensional data is computationally complex. The goal of PCA is to preserve the information of the original dataset as much as possible, and to simplify the high-dimensional data, in case of loss. PCA minimizes the variance of the cost function by selecting several variables that contribute most to the information content of the data samples. The main components remained have the following characteristics: the preserved principal component needs to be smaller than the dimensionality of the original data set; each principal component is a linear combination of the original variables, but there is no correlation between the principal components; as much information as possible of the original data samples is retained with as little loss as possible.
Sammon mapping is a distance preserving technique. The Sammon mapping has only one specific purpose, namely to reduce the dimensionality of a limited number of points. This technique is considered a variant of PCA. The goal of the Sammon algorithm is to minimize the error function. The advantage of using this technique is that it is computationally simple, and results can be obtained even for non-linear datasets, provided that the dataset is less complex. The Sammon algorithm is also suitable for non-linear datasets.
The characteristic index dimension reduction is also a processing mode aiming at high-dimensional load data. The common load characteristic indexes include indexes such as average load, peak-to-valley difference, load rate, and load rate. The load curve is influenced by time, temperature, season, living habits, regional production modes and the like, so that the research difficulty is high. For this purpose, the following 6 load characteristic indexes are selected: the highest load rate is the hour rate of utilization, the daily peak-valley difference rate, the peak load rate, the flat load rate and the valley load rate. The load mode of the user is more comprehensively described from different periods and different load rates all day.
Experiments show that in the aspect of data dimension reduction, the PCA algorithm has high operation efficiency and good dimension reduction effect, and the obtained clustering result is basically consistent with the clustering result without dimension reduction; the disadvantage of the Sammon algorithm is that the operation time is long; the clustering effect obtained by the feature selection dimensionality reduction algorithm has larger deviation. Therefore, the PCA algorithm is adopted in the embodiment to perform the dimension reduction processing on the gas load data.
As an alternative embodiment, the method uses FCM algorithm to cluster the gas load data.
This embodiment provides a specific clustering algorithm, that is, a fuzzy C-means clustering algorithm FCM. FCM is one of the most widely used algorithms in clustering based on objective functions. The FCM is optimized by a traditional hard clustering algorithm, the hard clustering adopts the principle that the membership degree is either 0 or 1, a condition-based nonlinear programming problem is constructed by using a mean square approximation method, and the clustering problem is solved by means of an objective function. The FCM algorithm is an unsupervised learning method, firstly, c objects in a sample are randomly selected as initial clustering centers, and the value of an initialized fuzzy partition matrix is between [0 and 1 ]; calculating a fuzzy partition matrix and a class center, and calculating an objective function value; and continuously updating and repeating the processes until the objective function value is minimum or the iteration times are greater than the maximum iteration stopping times, and then dividing the samples into each class according to the size of the fuzzy division matrix. FCM measures the distance between a sample and a cluster center by Euclidean distance, and the smaller the distance, the higher the similarity between objects is, and the objects are more easily classified into the same class. FCM is a mature prior art and a detailed algorithm flow is not given here.
As an alternative embodiment, before the interpolation, the method further comprises the steps of identifying and eliminating the dangerous points:
determining a few class boundary points: for each minority data point, solving K neighbors of the minority data point, wherein if the K neighbors comprise the majority data point, the minority data point is a minority boundary point;
counting the number of the minority class boundary points, and if the number of the minority class boundary points is more than 1, respectively calculating Euclidean distances d1 and d2 between each minority class boundary point and a majority class data point in K neighbors of the minority class boundary point and a clustering center;
if d1 of a certain minority class boundary point is larger than the minimum value of d2, and K neighbors of the certain minority class boundary point are all majority class data points, the minority class boundary point is a danger point;
all the dangerous points are deleted and clustering is carried out again.
The embodiment provides a technical scheme for eliminating the dangerous points. The "dangerous point" is an image calling method, which may also be called "interference point", and its existence may cause a large influence or interference to the clustering algorithm. Therefore, in order to improve the clustering effect, it is necessary to identify and delete the dangerous points and then re-perform clustering. The risk point in this embodiment satisfies the following condition: are a minority of class boundary points; the K neighbors are all most kinds of data points; the distance between the data points and the cluster center is closer than that between some majority of data points in the K neighbors and the cluster center. That is to say, the dangerous points are almost surrounded by most kinds of data points, and the existence of the dangerous points will certainly cause danger or interference to the clustering algorithm, so that the difficulty of the clustering algorithm is greatly increased, and the clustering effect is influenced. According to the above features of the dangerous point, the method for identifying the dangerous point of the embodiment is: firstly, determining a few types of boundary points, and then judging whether each boundary point meets the latter two conditions. Of course, the number of the boundary points of the minority class is also limited, i.e. must be greater than or equal to 1, otherwise, if there is only one boundary point and it is determined as a dangerous point, there is no boundary point after deletion (or the overall distribution of the minority class is greatly influenced). The boundary point determining method includes that K neighbors of each minority data point are obtained, and if the K neighbors include the majority data point, the minority data point is the minority boundary point.
As an alternative embodiment, the interpolation method specifically includes:
calculating the ratio of the number of the majority type data points to the number of the minority type data points, and obtaining an interpolation multiplying power n after rounding;
calculating the maximum value D of Euclidean distances from the minority class center u to all data points of the class;
calculating the Euclidean distance D between u and each data point x of the class, and rounding D/D to obtain H;
n-1 interpolation points x for each data point x are calculated as followsnew
xnew=u+rand(0,H)×(x-u)
In the formula, rand (0, H) is a random number between 0 and H, and n-1 times of execution is carried out to obtain n-1 interpolation points.
This embodiment provides a technical solution of an improved SMOTE algorithm. The main improvement is that interpolation is carried out by taking a few types of data points as a base point instead of the conventional interpolation by taking a clustering center as a base point. The interpolation point calculation method is as above formula, and compared with the interpolation formula of the SMOTE algorithm, the base point is changed from x to u, and 1 is also changed to H. H is the value obtained by rounding the ratio D/D of the maximum Euclidean distance D between all minority class data points and the clustering center u and the Euclidean distance D between the minority class data points x and u. Because H is larger than or equal to 1, the range of the interpolation points is widened after the processing, and overfitting caused by small interpolation range can be avoided. This process may cause some interpolation points of the boundary points to be "out of bounds", but since this is a few, few points out of bounds will have little effect on clustering. Of course, the above interpolation formula can be further improved: when x is the boundary point, H is still changed to 1; h is only used when x is a non-boundary point. This results in a smaller probability of "over-bounding".
Fig. 4 is a schematic composition diagram of a gas load clustering device according to an embodiment of the present invention, where the device includes:
the clustering module 11 is used for clustering the gas load data;
and the interpolation module 22 is configured to take the clustering centers of the minority classes as a base point, and perform interpolation on a connecting line between the base point and each data point of the class or an extension line thereof.
The apparatus of this embodiment may be used to implement the technical solution of the method embodiment shown in fig. 1, and the implementation principle and the technical effect are similar, which are not described herein again. The same applies to the following embodiments, which are not further described.
As an optional embodiment, the device further comprises a dimension reduction module for performing dimension reduction processing on the gas load data by using Principal Component Analysis (PCA).
As an alternative embodiment, the clustering module 11 clusters the gas load data by using an FCM algorithm.
As an optional embodiment, the apparatus further includes a risk point elimination module, configured to identify and eliminate a risk point according to the following method before performing interpolation:
determining a few class boundary points: for each minority data point, solving K neighbors of the minority data point, wherein if the K neighbors comprise the majority data point, the minority data point is a minority boundary point;
counting the number of the minority class boundary points, and if the number of the minority class boundary points is more than 1, respectively calculating Euclidean distances d1 and d2 between each minority class boundary point and a majority class data point in K neighbors of the minority class boundary point and a clustering center;
if d1 of a certain minority class boundary point is larger than the minimum value of d2, and K neighbors of the certain minority class boundary point are all majority class data points, the minority class boundary point is a danger point;
all the dangerous points are deleted and clustering is carried out again.
As an alternative embodiment, the interpolation module 22 performs interpolation according to the following method:
calculating the ratio of the number of the majority type data points to the number of the minority type data points, and obtaining an interpolation multiplying power n after rounding;
calculating the maximum value D of Euclidean distances from the minority class center u to all data points of the class;
calculating the Euclidean distance D between u and each data point x of the class, and rounding D/D to obtain H;
n-1 interpolation points x for each data point x are calculated as followsnew
xnew=u+rand(0,H)×(x-u)
In the formula, rand (0, H) is a random number between 0 and H, and n-1 times of execution is carried out to obtain n-1 interpolation points.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A gas load clustering method is characterized by comprising the following steps:
clustering the gas load data;
and taking the clustering centers of a few classes as a base point, and interpolating on a connecting line of the base point and each data point of the class or an extension line of the base point and each data point of the class.
2. The gas load clustering method of claim 1, further comprising performing dimensionality reduction on the gas load data using Principal Component Analysis (PCA) prior to clustering.
3. The gas load clustering method of claim 1, wherein the method clusters gas load data using an FCM algorithm.
4. The gas load clustering method according to claim 1, further comprising the steps of identifying and eliminating dangerous points before interpolation:
determining a few class boundary points: for each minority data point, solving K neighbors of the minority data point, wherein if the K neighbors comprise the majority data point, the minority data point is a minority boundary point;
counting the number of the minority class boundary points, and if the number of the minority class boundary points is more than 1, respectively calculating Euclidean distances d1 and d2 between each minority class boundary point and a majority class data point in K neighbors of the minority class boundary point and a clustering center;
if d1 of a certain minority class boundary point is larger than the minimum value of d2, and K neighbors of the certain minority class boundary point are all majority class data points, the minority class boundary point is a danger point;
all the dangerous points are deleted and clustering is carried out again.
5. The gas load clustering method according to claim 4, wherein the interpolation method specifically comprises:
calculating the ratio of the number of the majority class data points to the number of the minority class data points and rounding to obtain an interpolation multiplying power n;
calculating the maximum value D of Euclidean distances from the minority class center u to all data points of the class;
calculating the Euclidean distance D between u and each data point x of the class, and rounding D/D to obtain H;
n-1 interpolation points x for each data point x are calculated as followsnew
xnew=u+rand(0,H)×(x-u)
In the formula, rand (0, H) is a random number between 0 and H, and n-1 times of execution is carried out to obtain n-1 interpolation points.
6. A gas load clustering device, comprising:
the clustering module is used for clustering the gas load data;
and the interpolation module is used for taking the clustering centers of a few classes as a base point and carrying out interpolation on a connecting line of the base point and each data point of the class or an extension line thereof.
7. The gas load clustering method of claim 6, wherein the device further comprises a dimensionality reduction module for performing dimensionality reduction on the gas load data by using Principal Component Analysis (PCA).
8. The gas load clustering method of claim 6, wherein the clustering module clusters the gas load data using an FCM algorithm.
9. The gas load clustering method according to claim 6, wherein the device further comprises a danger point eliminating module for identifying and eliminating the danger points before interpolation according to the following method:
determining a few class boundary points: for each minority data point, solving K neighbors of the minority data point, wherein if the K neighbors comprise the majority data point, the minority data point is a minority boundary point;
counting the number of the minority class boundary points, and if the number of the minority class boundary points is more than 1, respectively calculating Euclidean distances d1 and d2 between each minority class boundary point and a majority class data point in K neighbors of the minority class boundary point and a clustering center;
if d1 of a certain minority class boundary point is larger than the minimum value of d2, and K neighbors of the certain minority class boundary point are all majority class data points, the minority class boundary point is a danger point;
all the dangerous points are deleted and clustering is carried out again.
10. The gas load clustering method of claim 9, wherein the interpolation module interpolates according to the following method:
calculating the ratio of the number of the majority type data points to the number of the minority type data points, and obtaining an interpolation multiplying power n after rounding;
calculating the maximum value D of Euclidean distances from the minority class center u to all data points of the class;
calculating the Euclidean distance D between u and each data point x of the class, and rounding D/D to obtain H;
n-1 interpolation points x for each data point x are calculated as followsnew
xnew=u+rand(0,H)×(x-u)
In the formula, rand (0, H) is a random number between 0 and H, and n-1 times of execution is carried out to obtain n-1 interpolation points.
CN202110354433.7A 2021-04-01 2021-04-01 Gas load clustering method and device Pending CN113159137A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110354433.7A CN113159137A (en) 2021-04-01 2021-04-01 Gas load clustering method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110354433.7A CN113159137A (en) 2021-04-01 2021-04-01 Gas load clustering method and device

Publications (1)

Publication Number Publication Date
CN113159137A true CN113159137A (en) 2021-07-23

Family

ID=76885963

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110354433.7A Pending CN113159137A (en) 2021-04-01 2021-04-01 Gas load clustering method and device

Country Status (1)

Country Link
CN (1) CN113159137A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115456315A (en) * 2022-11-11 2022-12-09 成都秦川物联网科技股份有限公司 Gas pipe network preset management method for intelligent gas and Internet of things system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930856A (en) * 2016-03-23 2016-09-07 深圳市颐通科技有限公司 Classification method based on improved DBSCAN-SMOTE algorithm
CN107194803A (en) * 2017-05-19 2017-09-22 南京工业大学 P2P net loan borrower credit risk assessment device
KR20190048119A (en) * 2017-10-30 2019-05-09 부산대학교 산학협력단 System and Method for Solutioning Class Imbalance Problem by Using FCM and SMOTE
CN110275910A (en) * 2019-06-20 2019-09-24 东北大学 A kind of oversampler method of unbalanced dataset
CN110852388A (en) * 2019-11-13 2020-02-28 吉林大学 Improved SMOTE algorithm based on K-means
CN111782904A (en) * 2019-12-10 2020-10-16 国网天津市电力公司电力科学研究院 Improved SMOTE algorithm-based unbalanced data set processing method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930856A (en) * 2016-03-23 2016-09-07 深圳市颐通科技有限公司 Classification method based on improved DBSCAN-SMOTE algorithm
CN107194803A (en) * 2017-05-19 2017-09-22 南京工业大学 P2P net loan borrower credit risk assessment device
KR20190048119A (en) * 2017-10-30 2019-05-09 부산대학교 산학협력단 System and Method for Solutioning Class Imbalance Problem by Using FCM and SMOTE
CN110275910A (en) * 2019-06-20 2019-09-24 东北大学 A kind of oversampler method of unbalanced dataset
CN110852388A (en) * 2019-11-13 2020-02-28 吉林大学 Improved SMOTE algorithm based on K-means
CN111782904A (en) * 2019-12-10 2020-10-16 国网天津市电力公司电力科学研究院 Improved SMOTE algorithm-based unbalanced data set processing method and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
周大镯: "《对变量时间序列研究》", 31 December 2012, 石家庄:河北人民出版社 *
王海: "过采样与集成学习方法在软件缺陷预测中的对比研究", 《计算机与现代化》 *
邱静 等: "基于优化模糊C均值聚类选取相似日的燃气负荷预测", 《上海师范大学学报(自然科学版)》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115456315A (en) * 2022-11-11 2022-12-09 成都秦川物联网科技股份有限公司 Gas pipe network preset management method for intelligent gas and Internet of things system
US11803173B2 (en) 2022-11-11 2023-10-31 Chengdu Qinchuan Iot Technology Co., Ltd. Preset management methods of gas pipeline network for smart gas and internet of things systems thereof
US11982994B2 (en) 2022-11-11 2024-05-14 Chengdu Qinchuan Iot Technology Co., Ltd. Determining a regional gas pipeline operating scheme

Similar Documents

Publication Publication Date Title
CN109871860B (en) Daily load curve dimension reduction clustering method based on kernel principal component analysis
CN110266672B (en) Network intrusion detection method based on information entropy and confidence degree downsampling
CN108596362B (en) Power load curve form clustering method based on adaptive piecewise aggregation approximation
CN111160401B (en) Abnormal electricity utilization discriminating method based on mean shift and XGBoost
CN109766950B (en) Industrial user short-term load prediction method based on morphological clustering and LightGBM
CN110245783B (en) Short-term load prediction method based on C-means clustering fuzzy rough set
CN104537673B (en) Infrared Image Segmentation based on multi thresholds and adaptive fuzzy clustering
CN106203478A (en) A kind of load curve clustering method for the big data of intelligent electric meter
CN109871872A (en) A kind of flow real-time grading method based on shell vector mode SVM incremental learning model
CN111556016A (en) Network flow abnormal behavior identification method based on automatic encoder
CN111275127B (en) Dynamic feature selection method based on condition mutual information
CN115374851A (en) Gas data anomaly detection method and device
CN113159137A (en) Gas load clustering method and device
Zhang et al. A density-center-based automatic clustering algorithm for IoT data analysis
CN112215490B (en) Power load cluster analysis method based on correlation coefficient improved K-means
CN117155701A (en) Network flow intrusion detection method
CN112270338A (en) Power load curve clustering method
CN111488903A (en) Decision tree feature selection method based on feature weight
de de Vargas et al. A way to obtain the quality of a partition by adjusted Rand index
Wang et al. Robust clustering with topological graph partition
Sexton et al. Data mining using a genetic algorithm‐trained neural network
Mao et al. Naive Bayesian algorithm classification model with local attribute weighted based on KNN
Li Logistic and SVM credit score models based on lasso variable selection
CN114519605A (en) Advertisement click fraud detection method, system, server and storage medium
Yan et al. A clustering method for power time series curves based on improved self-organizing mapping algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210723