CN111191687A

CN111191687A - Power communication data clustering method based on improved K-means algorithm

Info

Publication number: CN111191687A
Application number: CN201911286973.5A
Authority: CN
Inventors: 刘晴; 刘旭; 汤玮; 金海�; 姜海; 董武
Original assignee: Guizhou Power Grid Co Ltd
Current assignee: Guizhou Power Grid Co Ltd
Priority date: 2019-12-14
Filing date: 2019-12-14
Publication date: 2020-05-22
Anticipated expiration: 2039-12-14
Also published as: CN111191687B

Abstract

The invention discloses a power communication data clustering method based on an improved K-means algorithm, which comprises the following steps: s101, carrying out standardized processing on the power communication data; s102, manually selecting an initial classification number K from the normalized data, determining an element distance matrix according to the K value, and determining K initial clustering centers; s103, selecting an element, and determining a classification group corresponding to the element by calculating the distance between the element and each initial clustering center; s104, updating the clustering centers of the classification groups, and determining the actual clustering centers of the classification groups; s105, obtaining the classification of the power communication data until the classification group is not changed any more; on the basis of the traditional K-means clustering algorithm, the initial classification number K value can be dynamically adjusted and improved according to the clustering effect so as to improve the clustering effect; the initial elements can be selected more reasonably according to the element distance matrix so as to improve classification rationality and have strong practicability.

Description

Power communication data clustering method based on improved K-means algorithm

Technical Field

The invention belongs to the technical field of power communication, and particularly relates to a power communication data clustering method based on an improved K-means algorithm.

Background

The electric power communication network has huge redundant data, the development of redundant data processing is important content of electric power communication data management, and data clustering is a preposed link of redundant data processing, so that the huge electric power communication data are classified, the type of the redundant data is analyzed according to the actual condition of the data in each class, and a redundant data processing method is adopted according to local conditions.

The K-means algorithm is a main method for data clustering of the current power communication network, the implementation flow of the traditional K-means algorithm is shown in figure 1, and the main flow comprises the following steps:

(1) giving a K value, and randomly selecting an initial element; the K value is the number of element classifications obtained by clustering. The classification number K value of the traditional K-means algorithm is given manually, and initial elements of each initial classification are selected from the elements to be clustered manually;

(2) judging element classification; judging the subordination relation between each element and each classification one by one according to the distance between each element and each classification center position;

(3) updating the classification center position; and after the element judgment is finished each time, updating the newly added elements to update the positions of all the classification centers.

The K value and the initial element are key factors for realizing element clustering in the K-means algorithm, and the K value and the initial element in the traditional K-means algorithm are both given manually, lack of scientific support and difficult to ensure clustering effect.

Disclosure of Invention

The invention overcomes the defects of the prior art, and solves the technical problems that: the power communication data clustering method based on the improved K-means algorithm is capable of adjusting the initial classification number K and the initial clustering center.

In order to solve the technical problems, the invention adopts the technical scheme that: a power communication data clustering method based on an improved K-means algorithm comprises the following steps: s101, carrying out standardized processing on the power communication data; s102, manually selecting an initial classification number K from the normalized data, determining an element distance matrix according to the K value, and determining K initial clustering centers; s103, selecting an element, and determining a classification group corresponding to the element by calculating the distance between the element and each initial clustering center; s104, updating the clustering centers of the classification groups, and determining the actual clustering centers of the classification groups; and S105, repeating the step S103 until the classification group is not changed any more, and obtaining the classification of the power communication data.

Further, still include:

and S106, judging whether the initial classification number K meets the optimal classification value.

Preferably, the power communication data is subjected to normalization processing, specifically, the power communication data is converted into character-type numerical values, continuous numerical values and discrete numerical values which are easy to process;

the character-type numerical conversion process comprises the following steps: the character type numerical values in the power communication data are subjected to value sharing, and a conversion formula can be expressed as follows:

in the formula (1), x_i、

Respectively taking the values of the character type attribute i of the power communication data before and after processing, Cha₁、Cha₂… … is N character values of the attribute, which can be converted into values between 0 and 1 according to the character attribute value types;

the continuous type values include: the continuous numerical value in the power communication data is processed by adopting a normalization method, and the processing formula can be expressed as follows:

in the formula (2), x_i、

Respectively taking values of the continuous type attribute i of the power communication data before and after processing,

and taking values of the continuous attribute.

Preferably, the normalized data is subjected to manual selection of an initial classification number K, an element distance matrix is determined according to the K value, and K initial clustering centers are determined, which specifically includes:

s1021, manually selecting an initial classification number K;

s1022, calculating the distance between each element according to an Euclidean distance formula;

assuming that the power communication data to be analyzed after data normalization processing has N items, and the data has M items of attributes, x in the formula (3)_iDenotes the ith item, x_i,jThe j attribute value of the ith item of data is represented,_mrepresents dimension, d (x)_i,x_j) Representing data x_iAnd data x_jThe distance between them;

s1023, obtaining an element distance matrix according to the distance between the elements, and determining the average value of each row of elements, namely the average distance between the corresponding data of the row and all other data;

s1024, selecting the maximum average distance as the first initial clustering center, and selecting the remaining initial clustering centers to meet the target that the average distance between the remaining initial clustering centers and the selected initial elements is maximum, namely:

in the formula (4), J is the number of the selected initial elements, the number of the initial elements is increased one by one until the total number of the initial elements is equal to the number K of the initial classification, and t is set as the number K of the initial classificationHeart, then the set of initial cluster centers is (x)_t,1,x_t,2,Lx_t,M)。

Preferably, the selecting an element, and determining the classification group corresponding to the element by calculating the distance between the element and each initial cluster center specifically includes: calculating the distance between each selected element and each initial clustering center by the formula (5), namely:

in the formula (5), d (x)_i,x_t) And clustering the element i into the classification with the minimum distance according to the distance value of the element and each initial clustering center.

Preferably, the updating the clustering centers of the classification groups and determining the actual clustering centers of the classification groups specifically include: when an element is added, the j-th attribute value updating formula of the central position of the classification group can be expressed as:

in the formula (6), x_t,jTo add the j attribute value, x, of the actual cluster center t' in the cluster group_t,j' the j ' th attribute of the actual clustering center t ' in the classification group before adding elements is valued, x_i,jTo increase the number of elements in the group after an element, N_tAnd taking the value of the j attribute of the added element.

Preferably, the determining whether the initial classification number K satisfies the optimal classification value specifically includes:

s1061, calculating the distance between the actual clustering centers t' according to an Euclidean distance formula, namely:

in the formula (7), d (x)_t1,x_t2) As the actual cluster center t1Inter-class distance from the actual cluster center t 2;

s1061, calculating the minimum value of the inter-class distances among all the actual clustering centers t', namely the minimum inter-class distance TD^min；

S1062, calculating the average value of the inter-class distances among all the actual clustering centers t', namely the average inter-class distance TD^ave；

S1063, calculating the maximum value of the distances of all elements in the same classification, namely the maximum intra-class distance ITD^max；

S1064, judging the minimum inter-class distance TD^minWhether much less than the mean inter-class distance TD^aveReturning to step S102, otherwise, executing step S1065;

s1065, judging the ITD of the maximum intra-class distance by the root^maxWhether much larger than the average inter-class distance TD^aveIf so, returning to the step S102, otherwise, executing the step S1066;

and S1066, if the initial classification number K meets the optimal classification value, the classification of the power communication data can be output.

Preferably, the converted value ranges of the character-type value, the continuous-type value and the discrete-type value are all data between 0 and 1.

Compared with the prior art, the invention has the following beneficial effects:

the invention relates to a power communication data clustering method based on an improved K-means algorithm, which improves an initial classification number K and an initial clustering center which are manually given on the basis of the traditional K-means clustering algorithm, and can dynamically adjust and improve the value of the initial classification number K according to the clustering effect so as to improve the clustering effect; the initial elements can be selected more reasonably according to the element distance matrix so as to improve classification rationality.

Drawings

The present invention will be described in further detail with reference to the accompanying drawings;

FIG. 1 is a flow chart of a conventional K-means algorithm;

fig. 2 is a schematic flow chart of a power communication data clustering method based on an improved K-means algorithm according to an embodiment of the present invention;

fig. 3 is a schematic flow chart of a power communication data clustering method based on an improved K-means algorithm according to a second embodiment of the present invention;

fig. 4 is a schematic flow chart of a power communication data clustering method based on an improved K-means algorithm according to a third embodiment of the present invention;

fig. 5 is a schematic flow chart of a power communication data clustering method based on an improved K-means algorithm according to a fourth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the present invention; all other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 2 is a schematic flow chart of a power communication data clustering method based on an improved K-means algorithm according to an embodiment of the present invention, and as shown in fig. 2, the power communication data clustering method based on the improved K-means algorithm includes:

s101, carrying out standardized processing on the power communication data;

s102, manually selecting an initial classification number K from the normalized data, determining an element distance matrix according to the K value, and determining K initial clustering centers;

s103, selecting an element, and determining a classification group corresponding to the element by calculating the distance between the element and each initial clustering center;

s104, updating the clustering centers of the classification groups, and determining the actual clustering centers of the classification groups;

and S105, repeating the step S103 until the classification group is not changed any more, and obtaining the classification of the power communication data.

Specifically, in this embodiment, on the basis of the conventional K-means clustering algorithm, an initial classification number K value and an initial clustering center which are manually given are improved, in this embodiment, an element distance matrix is determined according to the given initial classification number K value, and a group of elements with the largest average distance is selected as the initial clustering center to enhance the discreteness of the initial elements, and the selection of the rest of the initial clustering centers can be more reasonably selected according to the element distance matrix, so that the clustering effect is improved, and the classification rationality is improved.

Fig. 3 is a schematic flow chart of a power communication data clustering method based on an improved K-means algorithm according to a second embodiment of the present invention, as shown in fig. 3, on the basis of the first embodiment, the method further includes:

In the embodiment, the initial classification number K selected manually can be dynamically adjusted and improved according to the clustering effect, so that the clustering effect is improved, the clustering rationality is improved, and the processing efficiency of redundant data of the power communication network is improved.

Further, in step S101, the power communication data is subjected to normalization processing, specifically, the power communication data is converted into character-type values, continuous-type values, and discrete-type values which are easy to process; and the converted value ranges of the character type numerical value, the continuous type numerical value and the discrete type numerical value are all data between 0 and 1.

The character-type numerical conversion process comprises the following steps: the character type numerical value can be counted to obtain the character value range, the common value of the character type numerical values in the electric power communication data is obtained without loss of generality, and the conversion formula can be expressed as follows:

in the formula (1), x_i、

Respectively taking the values of the character type attribute i of the power communication data before and after processing, Cha₁、Cha₂… … are N character type values of the attribute, according to the characterThe type attribute value category can be correspondingly converted into a numerical value between 0 and 1;

in the formula (2), x_i、

upper and lower limit values for the value of the continuous attribute;

the discrete numerical processing mode is similar to the character numerical processing mode, and the discrete numerical processing mode and the character numerical processing mode are also converted according to the value possibility.

Fig. 4 is a schematic flow chart of a power communication data clustering method based on an improved K-means algorithm according to a third embodiment of the present invention, as shown in fig. 4, on the basis of the second embodiment, the normalized data is subjected to manual selection of an initial classification number K, an element distance matrix is determined according to a K value, and K initial clustering centers are determined, which specifically includes:

s1021, manually selecting an initial classification number K;

assuming that the power communication data to be analyzed after data normalization processing has N items, and the data has M items of attributes, x in the formula (3)_iDenotes the ith item, x_i,jThe j attribute value of the ith item of data is represented,_mrepresenting dimension, defining the distance between data as the Euclidean space distance corresponding to each attribute valueI is then d (x)_i,x_j) Representing data x_iAnd data x_jThe distance between them;

s1023, obtaining an element distance matrix according to the distance between the elements, wherein the matrix is an NxN-order matrix, and the element in the ith row and the jth column is data x_iAnd data x_jDistance d (x) therebetween_i,x_j) Determining the average value of each row element, namely the average distance between the corresponding data of the row and all other data;

s1024, selecting the element with the largest average distance as the first initial clustering center, wherein the selection of the remaining initial clustering centers should meet the target that the average distance between the remaining initial clustering centers and the selected initial element is the largest, namely:

in the formula (4), J is the number of the selected initial elements, the number of the initial elements is increased one by one until the total number of the initial elements is equal to the initial classification number K, and the initial clustering center value obtained according to the method has the maximum average distance and is most beneficial to clustering; determining the position of a classification center according to the selected initial clustering center, wherein the initial classification center position is the attribute value of the corresponding initial element, and if t is the initial clustering center, the set of the initial clustering centers is (x)_t,1,x_t,2,Lx_t,M)。

Further, in step S103, the selecting an element, and determining the classification group corresponding to the element by calculating the distance between the element and each initial cluster center specifically includes: defining: the distance between the element and the classification is the Euclidean distance between the element and the classification initial clustering center, and the distance between the selected element and each initial clustering center is calculated, namely:

in the formula (5), d (x)_i,x_t) Is the distance between the element i and the initial cluster center t, in terms of the element to eachAnd clustering the distance value of the initial clustering centers into the classification with the minimum distance.

Further, in step S104, the updating the cluster centers of the classification groups, and determining the actual cluster centers of the classification groups, where the actual cluster centers refer to the average values of the attributes corresponding to all the elements belonging to the classification, and specifically includes: when an element is added, the j-th attribute value updating formula of the central position of the classification group can be expressed as:

Fig. 5 is a schematic flow chart of a power communication data clustering method based on an improved K-means algorithm according to a fourth embodiment of the present invention, as shown in fig. 5, on the basis of the third embodiment, the determining whether the initial classification number K satisfies the optimal classification value specifically includes:

in the formula (7), d (x)_t1,x_t2) The inter-class distance between the actual clustering center t1 and the actual clustering center t 2;

s1062, calculating the minimum value of the inter-class distances among all the actual clustering centers t', namely the minimum inter-class distance TD^min；

S1063, calculating the average value of the inter-class distances among all the actual clustering centers t', namely the average inter-class distance TD^ave；

S1064, calculating all element distances in the same classificationMaximum value of distance, i.e. maximum intra-class distance ITD^max；

S1065, judging the minimum inter-class distance TD^minWhether much less than the mean inter-class distance TD^aveIf so, returning to the step S102, otherwise, executing the step S1066;

s1066, judging the ITD of the maximum intra-class distance^maxWhether much larger than the average inter-class distance TD^aveIf so, returning to the step S102, otherwise, executing a step S1067;

and S1067, if the initial classification number K meets the optimal classification value, the classification of the power communication data can be output.

Specifically, if the manually selected initial classification number K is too large, which may cause the classification to exceed the actual requirement, there is a minimum inter-class distance TD^minMuch smaller than the mean inter-class distance TD^aveThe case (1); otherwise, if the initial classification number K is too small, the classification will be insufficient and the actual requirement will be met, and there is a maximum intra-class distance ITD of a certain group^maxFar greater than the average inter-class distance TD^aveThe case (1).

If the relationship exists:

TD^min>mmTD^ave(8)

in the formula (8), mm is a given small number, and can be generally 0.2, the value of K is considered to be overlarge, K-1 can be used for replacing the original value of K, and the step S102 is returned to for distance again;

if the relationship exists:

ITD^max>MMTD^ave(9)

in the formula (9), when MM is a given larger number and can be generally 8, the value of K is considered to be too small, K +1 can be used for replacing the original value of K, and the step (II) is returned to perform clustering again.

And when the initial classification number K does not satisfy the problems of the formulas (8) and (9), the initial classification number K is reasonable in value, and the output result is finished.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A power communication data clustering method based on an improved K-means algorithm is characterized by comprising the following steps: the method comprises the following steps:

s101, carrying out standardized processing on the power communication data;

2. The power communication data clustering method based on the improved K-means algorithm as claimed in claim 1, wherein: further comprising:

3. The power communication data clustering method based on the improved K-means algorithm as claimed in claim 1, wherein: the electric power communication data are subjected to normalized processing, specifically, the electric power communication data are converted into character type numerical values, continuous type numerical values and discrete type numerical values which are easy to process;

in the formula (1), x_i、

in the formula (2), x_i、

and taking values of the continuous attribute.

4. The power communication data clustering method based on the improved K-means algorithm as claimed in claim 1, wherein: the normalized data is subjected to manual selection of an initial classification number K, an element distance matrix is determined according to a K value, and K initial clustering centers are determined, and the method specifically comprises the following steps:

s1021, manually selecting an initial classification number K;

suppose thatThe electric power communication data to be analyzed after data normalization processing have N items, the data contain M items with attribute, and x in formula (3)_iDenotes the ith item, x_i,jJ attribute value representing ith item of data, m represents dimension, d (x)_i,x_j) Representing data x_iAnd data x_jThe distance between them;

in the formula (4), J is the number of the selected initial elements, the number of the initial elements is increased one by one until the total number of the initial elements is equal to the initial classification number K, and t is the initial clustering center, so that the set of the initial clustering centers is (x)_t,1,x_t,2,L x_t,M)。

5. The power communication data clustering method based on the improved K-means algorithm as claimed in claim 1, wherein: selecting an element, and determining a classification group corresponding to the element by calculating the distance between the element and each initial cluster center, specifically comprising: calculating the distance between each selected element and each initial clustering center by the formula (5), namely:

6. The power communication data clustering method based on the improved K-means algorithm as claimed in claim 1, wherein: the updating of the clustering centers of the classification groups and the determination of the actual clustering centers of the classification groups specifically include: when an element is added, the j-th attribute value updating formula of the central position of the classification group can be expressed as:

7. The power communication data clustering method based on the improved K-means algorithm as claimed in claim 2, wherein: the determining whether the initial classification number K satisfies the optimal classification value specifically includes:

8. The power communication data clustering method based on the improved K-means algorithm as claimed in claim 3, wherein: and the converted value ranges of the character type numerical value, the continuous type numerical value and the discrete type numerical value are all data between 0 and 1.