CN114328513A

CN114328513A - Big data attribute importance and identification degree early warning method based on clustering

Info

Publication number: CN114328513A
Application number: CN202111561388.9A
Authority: CN
Inventors: 乔亚男; 张兆杰; 孙虹; 刘浩宇; 翟术然; 李刚; 李野; 卢静雅; 赵勇; 陈娟; 董得龙; 何泽昊
Original assignee: State Grid Corp of China SGCC; State Grid Tianjin Electric Power Co Ltd; Marketing Service Center of State Grid Tianjin Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Tianjin Electric Power Co Ltd; Marketing Service Center of State Grid Tianjin Electric Power Co Ltd
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2022-04-12

Abstract

The invention discloses a clustering-based big data attribute importance and identification degree early warning method, which aims at the problem that the existing big data which is not convenient to cluster is early warned, and provides the following scheme, which comprises the following steps: s1, collecting big data, wherein the collected data set is N; s2, constructing K groups by a splitting method, wherein each group represents a cluster, and dividing a data set into the K groups; s3, grouping the data through a clustering algorithm; s4, presetting the data volume in the K groups, and simultaneously calculating the data volume in the K groups; s5, comparing the calculated data volume with a preset data volume, and giving an early warning when the calculated data volume is larger than the preset data volume; s6, counting the growth speed of the data in the K groups, the method can realize early warning of big data, and is beneficial to mastering the importance and the identification degree of the big data.

Description

Big data attribute importance and identification degree early warning method based on clustering

Technical Field

The invention relates to the technical field of big data early warning, in particular to a clustering-based big data attribute importance and identification degree early warning method.

Background

The process of dividing a collection of physical or abstract objects into classes composed of similar objects is called clustering. The cluster generated by clustering is a collection of a set of data objects that are similar to objects in the same cluster and distinct from objects in other clusters. "the groups of things and the groups of people" have a great number of classification problems in natural science and social science. Clustering analysis, also known as cluster analysis, is a statistical analysis method for studying (sample or index) classification problems. The clustering analysis originates from taxonomy, but clustering is not equal to classification. Clustering differs from classification in that the class into which the clustering is required to be divided is unknown. The clustering analysis content is very rich, and a system clustering method, an ordered sample clustering method, a dynamic clustering method, a fuzzy clustering method, a graph theory clustering method, a clustering forecasting method and the like are adopted.

In the prior art, early warning is not convenient for clustered big data, so an early warning method based on the importance and the identification degree of the clustered big data attributes is provided for solving the problems.

Disclosure of Invention

The invention aims to provide a clustering-based big data attribute importance and identification degree early warning method, so that the defect that the clustered big data is inconvenient to early warn in the prior art is overcome.

In order to achieve the purpose, the invention adopts the following technical scheme:

a big data attribute importance and identification degree early warning method based on clustering comprises the following steps:

s1, collecting big data, wherein the collected data set is N;

s2, constructing K groups by a splitting method, wherein each group represents a cluster, and dividing the data set into the K groups;

step S3, grouping the data through a clustering algorithm;

step S4, presetting the data volume in K groups, and simultaneously calculating the data volume in the K groups;

step S5, comparing the calculated data volume with a preset data volume, and when the calculated data volume is larger than the preset data volume, giving an early warning;

s6, counting the growth speed of the data in the K groups, and sequencing according to the counted speed;

step S7, setting a critical value of the growth speed, and comparing the highest growth speed with the critical value;

and step S8, when the highest growth speed of the sequence is larger than a critical value, the group is derived, and early warning is carried out.

The clustering algorithm is one of a K-MEANS algorithm, a K-MEDOIDS algorithm and a CLARANS algorithm.

In step S3, the clustering algorithm changes the grouping by iterative iteration, so that the grouping scheme after each improvement is better than that of the previous one, and completes the grouping of the data.

In step S4, when data in the packet has changed, the data amount is updated.

In step S5, the calculated data amount is compared with the preset data amount, that is, the preset data amount is subtracted from the calculated data amount to obtain a difference value, and when the difference value is positive, that is, the calculated data amount is greater than the preset data amount, an early warning is performed, otherwise, no early warning is performed.

In step S6, the growth speeds of the data in the K groups are counted and sorted according to the counted speeds, where the growth speeds of the data are: the amount of increase of data per unit time.

In step S7, a threshold value of the growth rate is set, the highest growth rate in the sequence is compared with the threshold value, and the threshold value is subtracted from the highest growth rate in the sequence.

In step S3, after grouping, two groups are randomly selected for inspection, similarity calculation is performed on the data in the group, and the calculated similarity value is compared with a preset value to determine the accuracy of the grouping.

And when the calculated similarity is lower than the preset similarity, early warning is carried out.

After early warning, data in the corresponding groups are extracted, the types of the data are analyzed, and the groups are re-grouped.

Compared with the prior art, the method has the advantages that the data volume in the K groups is preset, the data volume in the K groups is calculated at the same time, the calculated data volume is compared with the preset data volume, and when the calculated data volume is larger than the preset data volume, early warning is carried out;

this scheme is counted the growth rate of data in K grouping to sort according to the speed of statistics, set up the critical value of growth rate, compare the highest growth rate and the critical value of sequencing, when the highest growth rate of sequencing is greater than the critical value, derive this grouping, and carry out the early warning, be convenient for promote in the industry and use.

Drawings

Fig. 1 is a flowchart of a first embodiment of a cluster-based big data attribute importance and identification degree early warning method according to the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when used in this specification the singular forms "a", "an" and/or "the" include "specify the presence of stated features, steps, operations, elements, or modules, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Example one

Referring to fig. 1, the embodiment provides a clustering-based big data attribute importance and identification degree early warning method, which includes the following steps:

s1, collecting big data, wherein the collected data set is N;

step S3, grouping the data through a clustering algorithm;

In this embodiment, the clustering algorithm is one of a K-MEANS algorithm, a K-MEDOIDS algorithm, and a CLARANS algorithm.

In this embodiment, in step S3, the clustering algorithm changes the grouping by iterative iteration, so that the grouping scheme after each improvement is better than that of the previous one, and the grouping of the data is completed.

In this embodiment, in step S4, when there is a data change in the packet, the data amount is updated.

In this embodiment, in step S5, the calculated data amount is compared with the preset data amount, that is, the calculated data amount is subtracted from the preset data amount to obtain a difference value, when the difference value is positive, that is, the calculated data amount is greater than the preset data amount, the warning is performed, otherwise, no warning is performed when the difference value is negative.

In this embodiment, in step S6, statistics is performed on the growth rates of the data in the K packets, and sorting is performed according to the statistical rates, where the growth rates of the data are: the amount of increase of data per unit time.

In this embodiment, in step S7, a threshold value of the growth rate is set, the highest growth rate in the sequence is compared with the threshold value, and the threshold value is subtracted from the highest growth rate in the sequence.

In this embodiment, in step S3, after grouping, two groups are randomly selected for inspection, similarity calculation is performed on data in the group, the calculated similarity value is compared with a preset value, the accuracy of the grouping is determined, when the calculated similarity is lower than the preset similarity, an early warning is performed, after the early warning, data in the corresponding group is extracted, the type of the data is analyzed, and the data is grouped again.

Example two

Referring to fig. 1, the early warning method for importance and identification of big data attributes based on clustering provided in this embodiment includes the following steps:

s1, collecting big data, wherein the collected data set is N;

step S3, grouping the data through a density-based method;

EXAMPLE III

Referring to fig. 1, in this embodiment, a big data attribute importance and identification degree early warning method based on clustering includes the following steps:

s1, collecting big data, wherein the collected data set is N;

step S3, grouping the data through a hierarchical method, and the hierarchical method carries out hierarchical decomposition on the given data set until a certain condition is satisfied;

Example four

Referring to fig. 1, a big data attribute importance and identification degree early warning method based on clustering includes the following steps:

s1, collecting big data, wherein the collected data set is N;

step S3, grouping of data is accomplished by a grid-based method, which first divides the data space into a grid structure of finite cells, all processes are targeted for individual cells, and a significant advantage of this process is that the processing speed is fast, which is generally independent of the number of records in the target database, and is only dependent on how many cells the data space is divided into, representing the algorithm: STING algorithm, CLIQUE algorithm, WAVE-CLUSTER algorithm;

EXAMPLE five

s1, collecting big data, wherein the collected data set is N;

step S3, grouping data through a model-based method, assuming a model for each cluster by the model-based method, and then searching a data set which can well meet the model;

Comparative example 1

s1, collecting big data, wherein the collected data set is N;

step S3, grouping the data through a clustering algorithm;

and step S5, comparing the calculated data volume with a preset data volume, and giving an early warning when the calculated data volume is larger than the preset data volume.

Comparative example No. two

s1, collecting big data, wherein the collected data set is N;

step S3, grouping the data through a clustering algorithm;

and step S6, counting the growth speed of the data in the K groups, and sequencing according to the counted speed.

Comparative example No. three

s1, collecting big data, wherein the collected data set is N;

step S3, grouping the data through a clustering algorithm;

and step S7, setting a critical value of the growth speed, and comparing the highest growth speed with the critical value.

Experimental example 1

Experiments were conducted with the accuracy of the data grouping method set forth in examples one, two, three, four, five and comparative examples one, two, three, and the results are shown in the following table:

the data grouping methods proposed in examples one, two, three, four and five and comparative examples one, two and three are grouped as follows:

the technical means not described in detail in the present application are known techniques.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A big data attribute importance and identification degree early warning method based on clustering is characterized by comprising the following steps:

s1, collecting big data, wherein the collected data set is N;

step S3, grouping the data through a clustering algorithm;

2. The early warning method of big data attribute importance and identification degree based on clustering as claimed in claim 1, wherein the clustering algorithm is one of K-MEANS algorithm, K-medoid algorithm, CLARANS algorithm.

3. The early warning method for importance and identification of big data attribute based on clustering as claimed in claim 1, wherein in said step S3, clustering algorithm changes grouping by iterative method, so that the grouping scheme after each improvement is better than the previous one, and completes the grouping of data.

4. The early warning method for importance and identification of big data attribute based on clustering as claimed in claim 1, wherein in step S4, when there is data change in the group, the data amount is updated.

5. The method as claimed in claim 1, wherein in step S5, the calculated data amount is compared with a preset data amount, that is, the calculated data amount is subtracted from the preset data amount to obtain a difference value, and when the difference value is positive, that is, the calculated data amount is greater than the preset data amount, the early warning is performed, otherwise, no early warning is performed.

6. The early warning method for importance and identification of big data attribute based on clustering as claimed in claim 1, wherein in said step S6, the growth rate of data in K groups is counted and sorted according to the counted growth rate, and the growth rate of data is: the amount of increase of data per unit time.

7. The method for pre-warning importance and identification degree of big data attribute based on clustering as claimed in claim 1, wherein in step S7, a threshold value of growth speed is set, the highest growth speed ranked is compared with the threshold value, and the threshold value is subtracted from the highest growth speed ranked.

8. The early warning method for importance and identification of big data attributes based on clustering according to claim 1, wherein in step S3, after grouping, two groups are randomly selected for inspection, similarity calculation is performed on the data in the group, the calculated similarity value is compared with a preset value, and the accuracy of grouping is judged.

9. The early warning method for importance and identification of big data attributes based on clustering according to claim 8, wherein when the calculated similarity is lower than the preset similarity, the early warning is performed.

10. The early warning method of big data attribute importance and identification degree based on clustering according to claim 9, characterized in that after early warning, data in corresponding groups are extracted, the types of data are analyzed, and the groups are regrouped.