CN113806610A

CN113806610A - Data clustering method, system and storage medium

Info

Publication number: CN113806610A
Application number: CN202111156414.XA
Authority: CN
Inventors: 邓少冬; 盛龙
Original assignee: Xi'an Mix Intelligent Technology Co ltd
Current assignee: Xi'an Mix Intelligent Technology Co ltd
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2021-12-17
Also published as: WO2023050461A1

Abstract

The invention discloses a data clustering method, a system and a storage medium, comprising the following steps: determining a data clustering condition; clustering data according to the data clustering conditions to obtain at least one first clustering result, and calculating the entropy load of each first clustering result; the entropy carriers represent the size of average information quantity carried by the corresponding first clustering results; and taking the maximum entropy load in the entropy loads, wherein the corresponding first clustering result is a data clustering result. The invention carries out clustering from the whole data, realizes the integrity of data clustering, and obtains more complete and accurate clustering results; in addition, dependence and processing on any special data do not exist in the clustering process, and any data type is not limited, so that the method is generally suitable for clustering of any data; the maximum average information bearing amount is used as a basis for determining the clustering result, and for a computer system with a certain storage space, the larger the information amount which can be stored, the higher the information expression efficiency is.

Description

Data clustering method, system and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a data clustering method, a data clustering system and a storage medium.

Background

With the development and popularization of the internet in recent years, the quantity of data such as images, videos, texts and the like and the dimensionality of representation data are increased, and in order to utilize the massive data, the high-dimensional data needs to be clustered quickly and effectively, so that a large number of clustering algorithms are derived.

Clustering algorithms have been widely used in important fields such as data mining, face recognition, medical image analysis, and image segmentation, as one of the important research subjects in the field of machine learning. The image clustering is to divide and classify target data of completely unknown labels into different clusters, belongs to an exploratory technology for grouping through data characteristics, can be generally used for sorting image information or generating training sample labels and the like, and belongs to a common image processing means.

The conventional image Clustering method generally performs image Clustering Based on image features extracted from images by using a conventional Clustering algorithm, for example, Clustering by using algorithms such as a K-Means Clustering algorithm (K-Means Clustering algorithm) or a Density-Based Spatial Clustering of Applications with Noise (DBSCAN).

Taking a K-Means algorithm as an example, the conventional K-Means algorithm needs to input a sample set, a clustered cluster tree K, a maximum iteration number N, and finally output cluster division, and the general process is as follows: selecting K objects from the data as initial clustering centers; calculating the distance from each clustering object to the clustering center for division; calculating each cluster center again; and calculating the standard measure function until the maximum iteration times are reached, stopping, and otherwise, continuing to operate.

However, based on the above algorithm process, the K-Means algorithm has the following major disadvantages:

a.K values are difficult to determine because it is not possible to determine in advance what category a given sample set should be divided into to be optimal;

b.K-Means adopts the iteration method, the obtained result is only the local optimal clustering result, and the integrity is lacked;

c. sensitive to outliers and outliers;

d. the mean value of the sample set is required to exist, and the data category is limited;

e. the clustering effect relies on the initialization of the cluster centers, while the initial cluster centers are randomly selected.

The applicant also carries out sufficient research on other clustering algorithms, and finds that other traditional clustering algorithms contain too much dependence and processing on special data besides the K-Means algorithm, so that the algorithms do not have universal applicability and integrity to data clustering, and the data clustering field lacks sufficient exploration on a clustering method for overcoming the defects of lack of universal applicability and integrity.

Disclosure of Invention

The invention aims to provide a data clustering method, a data clustering system and a data clustering storage medium, which solve the technical problem that the traditional clustering algorithm in the prior art in the field is lack of integrity and universal applicability.

In order to achieve the above object, an embodiment of the present invention provides a data clustering method, where the method includes:

determining a data clustering condition;

clustering data according to the data clustering condition to obtain at least one first clustering result, wherein each first clustering result in the at least one first clustering result comprises at least one data set; calculating an entropy load corresponding to each first clustering result, wherein the entropy load represents the size of the average information quantity carried by the corresponding first clustering result;

and taking the maximum entropy load in the entropy loads corresponding to each first clustering result, wherein the first clustering result corresponding to the maximum entropy load is a data clustering result.

Preferably, the data clustering condition is determined according to similarity between data.

Preferably, clustering data according to the data clustering condition includes: and clustering the data according to the combination relation of the data with different dimensions.

Further preferably, the combination relationship of the data with different dimensions is determined according to the dimensions concerned by the data clustering, and includes: and fixing the dimension data which is not concerned, and combining and traversing the concerned dimension data.

Further preferably, the clustering data according to the combination relationship of the different dimensional data specifically includes:

(v₁,v₂,v₃,……,v_j)，

v_j＝{a_mj}＝a_1j,a_2j,……,a_mj；

wherein v is_jFor data of j dimension, data v_jThe difference values are arranged in a sequence from small to large { a }_mj}，a_mjIs a sequence { a_mjThe m-th item in (a)_mjRepresentative data v_jMaximum difference between a_1jRepresentative data v_jA minimum difference between; when v is_jFor dimensional data that is not of interest for data clustering, then v_jIs taken as the sequence { a_mjAny at least one of }; when v is_jFor data clustering of the dimensional data of interest, v_jIs taken as the sequence { a_mjThe precedence order of the items in the } goes through each item, and v_jClustering based on v when taking the latter item_jAnd taking the first clustering result obtained by the previous item for further clustering.

Preferably, the method for calculating the entropy vector comprises the following steps:

wherein, a_mjIs a sequence { a_mjThe m-th item in { a }_mjIs the data v of the jth dimension_jThe difference values between the two are arranged in a sequence from small to large, a is the base number of a logarithmic function, a is more than 1, and the entropy is loaded

Denotes v_jGet the sequence { a_mjThe mth item a in_mjCarrying out clustering to obtain the size of the average information quantity borne by the first clustering result; n is v_jGet the sequence { a_mjThe mth item a in_mjThe first clustering result obtained by clustering containsThe number of data sets; k is a radical of_iIs the number of elements in the ith data set, N is the total number of data, p_iIs the ratio of the number of elements in the ith data set to the total number of data.

Further preferably, a is 2, and the entropy carrier calculated thereby represents bits, which are binary and represent a measure of the average amount of information.

Preferably, the method comprises the step of forming an information structure tree comprising:

re-determining data clustering conditions, further clustering a certain data set in the data clustering results according to a new data clustering condition by executing the clustering method according to the new data clustering condition to obtain a new maximum entropy load, wherein the clustering result corresponding to the new maximum entropy load comprises a plurality of diversity combinations, the information corresponding to the diversity combinations is subdivision information of the certain data set, the certain data set is used as a father node, the diversity combinations are used as child nodes, and an information structure tree is formed step by step.

Preferably, the method comprises the step of forming a clustering process tree comprising:

when said v is_jIs taken as the sequence { a_mjV when the sequence of the items in the item is traversed each item for clustering, v_jGet a_qjThe first clustering result obtained by clustering is arranged in the q-th level of the clustering process tree, q is more than or equal to 1 and less than or equal to m, v_jGet a_mjThe first clustering result obtained by clustering is the root node of the clustering process tree v_jGet a_1jClustering to obtain a first clustering result which is a leaf node of the clustering process tree, wherein the degree of the leaf node is zero; the q-th level set is used as a father node, and all elements forming the set by the q-1-th level clustering are used as child nodes of the set, so that a clustering process tree is formed step by step.

In order to achieve the above object, another embodiment of the present invention provides a data clustering system, which includes a memory, a processor, and a program stored in the memory and executable on the processor, where the program, when executed by the processor, implements the steps of the above data clustering method.

To achieve the above object, another embodiment of the present invention provides a computer-readable storage medium, wherein: the storage medium stores at least one program executable by at least one processor, and the at least one program, when executed by the at least one processor, implements the steps of a method for clustering data as described above.

The data clustering method, the data clustering system and the data clustering storage medium have the following beneficial effects:

(1) the data clustering method, the data clustering system and the storage medium provided by the invention cluster from the whole data according to the data clustering condition to obtain at least one first clustering result, and the data clustering result is obtained through the first clustering result with the largest bearing average information quantity, so that the integrity of data clustering is realized, and the obtained clustering result is more complete and accurate; in addition, dependence and processing on any special data do not exist in the clustering process, and any data type is not limited, so that the method is generally suitable for clustering of any data, and has very high practicability; the maximum average information bearing amount is used as a basis for determining a clustering result, and for a computer system with a certain storage space, the larger the information amount which can be stored, the higher the information expression efficiency is;

(2) the data clustering method, the data clustering system and the storage medium provided by the invention cluster the at least one first clustering result again based on the at least one first clustering result to obtain the local subdivision information of the at least one first clustering result, so that the coordination and unification of the integrity and the locality of the data clustering are realized;

(3) the method, the system and the storage medium for clustering data form an information structure tree, the entropy load corresponding to each branch of the information structure tree is the maximum entropy load under a certain clustering condition, and the information quantity which can be stored in a computer system with a certain storage space is the maximum, so that the information expression efficiency is also the highest;

(4) the clustering method, the system and the storage medium of the data provided by the invention also form a clustering process tree in the clustering process, the clustering process tree clusters and distinguishes the data continuously from coarse to fine according to the granularity of the concerned dimension data, all information of single data point step by step can be reflected visually, and all clustering information of the data is traced and traceable.

Drawings

FIG. 1 is a schematic flow chart of a data clustering method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an application scenario of 12 data points in a storage medium and a method and a system for clustering data according to a second embodiment of the present invention;

FIG. 3 is a diagram illustrating a data clustering method, system and result of clustering with a data value difference of 1 in a storage medium according to a second embodiment of the present invention;

FIG. 4 is a diagram illustrating a data clustering method, system and result of clustering with a data value difference of 2 in a storage medium according to a second embodiment of the present invention;

FIG. 5 is a diagram illustrating a data clustering method, system and result of clustering with a data value difference of 3 in a storage medium according to a second embodiment of the present invention;

FIG. 6 is a diagram illustrating a data clustering method, system and result of clustering with a data value difference of 4 in a storage medium according to a second embodiment of the present invention;

FIG. 7 is a schematic diagram of a data clustering method, system and application scenario of 11 ordered data points in a storage medium according to a third embodiment of the present invention;

FIG. 8 is a schematic structural diagram of a clustering process tree of a data clustering method, a data clustering system and a storage medium according to a third embodiment of the present invention;

FIG. 9 is a schematic diagram of a data clustering method, system and result of clustering with a data value difference of 1 in a storage medium according to a third embodiment of the present invention;

FIG. 10 is a schematic diagram of a data clustering method, system and result of clustering with a data value difference of 2 in a storage medium according to a third embodiment of the present invention;

FIG. 11 is a schematic diagram of a data clustering method, system and result of clustering with a data value difference of 3 in a storage medium according to a third embodiment of the present invention;

FIG. 12 is a schematic diagram of a data clustering method, system and result of clustering with a data value difference of 4 in a storage medium according to a third embodiment of the present invention;

FIG. 13 is a schematic diagram of an application scenario of a data clustering method, system and storage medium according to a fourth embodiment of the present invention;

FIG. 14 is a schematic diagram of a data clustering method, system and application scenario of 156 ordered data points in a storage medium according to a fourth embodiment of the present invention;

FIG. 15 is a schematic structural diagram of a clustering process tree of a data clustering method, a system and a storage medium according to a fourth embodiment of the present invention;

FIG. 16 is a diagram illustrating a result of clustering with a data value difference of 0 in a data clustering method, system and storage medium according to a fourth embodiment of the present invention;

FIG. 17 is a diagram illustrating a data clustering method, system and result of clustering with a data value difference of 1 in a storage medium according to a fourth embodiment of the present invention;

FIG. 18 is a diagram illustrating a data clustering method, system and result of clustering with a data value difference of 2 in a storage medium according to a fourth embodiment of the present invention;

FIG. 19 is a diagram illustrating a data clustering method, system and result of clustering with a data value difference of 3 in a storage medium according to a fourth embodiment of the present invention;

FIG. 20 is a diagram illustrating a data clustering method, system and result of clustering with a data value difference of 4 in a storage medium according to a fourth embodiment of the present invention;

FIG. 21 is a diagram illustrating a data clustering method, system and result of clustering data of a "cup" set with a data value difference of 0 in a storage medium according to a fourth embodiment of the present invention;

FIG. 22 is a diagram illustrating a data clustering method, system and result of clustering data of a "cup" set with a data value difference of 2 in a storage medium according to a fourth embodiment of the present invention;

FIG. 23 is a diagram illustrating a data clustering method, system and result of clustering data of a "cup" set with a data value difference of 4 in a storage medium according to a fourth embodiment of the present invention;

fig. 24 is a schematic structural diagram of an information structure tree of a data clustering method, a data clustering system, and a storage medium according to four embodiments of the present invention.

Detailed Description

The technical solutions of the present invention are described in further detail below with reference to the accompanying drawings and specific examples, and the following is a detailed description of the present invention with reference to specific embodiments, and the specific embodiments of the present invention should not be construed as being limited to these descriptions. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Example one

An embodiment of the present invention provides a data clustering method, as shown in fig. 1, including the following steps:

(1) determining a data clustering condition, comprising the following steps:

determining factors affecting similarity between data;

determining a data dimension of interest for a data cluster from a plurality of factors;

determining a combination relation of different dimensional data;

and determining the clustering condition of the data according to the combination relation of the dimension data.

The determination of the data clustering condition is based on the similarity between data, and the similarity between data is often influenced by factors of multiple dimensions, so that the data is clustered according to the following combination relationship of different dimension data in the embodiment of the present invention, specifically as follows:

(v₁,v₂,v₃,……,v_j)，

v_j＝{a_mj}＝a_1j,a_2j,……,a_mj；

the combinatorial relationship is determined according to the concerned dimensionality of the data clustering, and comprises the following steps: and fixing the dimension data which is not concerned, and combining and traversing the concerned dimension data.

(2) And clustering the data according to the data clustering conditions to obtain at least one first clustering result, wherein each first clustering result comprises at least one data set. And calculating the entropy load corresponding to each first clustering result, wherein the entropy load represents the average information quantity carried by the corresponding first clustering result. The calculation method of the entropy load comprises the following steps:

Denotes v_jGet the sequence { a_mjThe mth item a in_mjCarrying out clustering to obtain the size of the average information quantity borne by the first clustering result; n is v_jGet the sequence { a_mjThe mth item a in_mjClustering to obtain the number of data sets contained in a first clustering result; k is a radical of_iIs the number of elements in the ith data set, N is the total number of data, p_iIs the ratio of the number of elements in the ith data set to the total number of data.

In the embodiment a of the present invention, a is preferably 2, and the entropy carrier obtained by the calculation represents bits, where the bits are binary and represent a unit of measurement of the average information amount.

The result obtained by each clustering is a plurality of data sets, each data set corresponds to a data category, when the computer system stores the clustering result, each data category has a code with fixed length corresponding to the data category, the average information quantity which can be stored by each code is fixed, correspondingly, the information expression efficiency of each code is also fixed, and the code with fixed length is expected to store the maximum average information quantity, so that the information expression efficiency is highest.

Entropy carrier

Represents the average information quantity carried by the clustering result obtained by the current clustering,

the larger the average information amount of each data type in the current clustering result is, the larger the average information amount that can be stored by the code corresponding to each data type is, and the higher the information expression efficiency of the code corresponding to each data type is, and thus, for a computer system with a certain storage space, the larger the information amount that can be stored is, and therefore, the higher the information expression efficiency thereof is.

(3) Taking the maximum entropy load I in the entropy loads of the first clustering results calculated in the step (2)_maxAccording to the maximum entropy carrier I_maxObtaining a data clustering result, specifically:

wherein, I_maxThe maximum entropy load represents the maximum average information amount carried by the clustering result obtained by clustering according to the clustering condition, and the computer system with a certain storage space has the maximum information amount and the highest information expression efficiency, so the maximum entropy load I_maxThe corresponding clustering result is expected.

After the step (1), the step (2), and the step (3) are executed to complete one-time clustering, the data clustering method according to the embodiment of the present invention may further include a step of forming an information structure tree, which specifically includes:

The entropy load corresponding to each branch of the information structure tree is the maximum entropy load under a certain clustering condition, and the amount of information which can be stored in a computer system with a certain storage space is the maximum, so that the information expression efficiency is the highest.

The data clustering method according to the first embodiment of the present invention may further include a step of forming a clustering process tree, which specifically includes:

when v is_jIs taken in the sequence { a }_mjV when the sequence of the items in the item is traversed each item for clustering, v_jGet a_qjThe first clustering result obtained by clustering is arranged in the q-th level of the clustering process tree, q is more than or equal to 1 and less than or equal to m, v_jGet a_mjThe first clustering result obtained by clustering is the root node of the clustering process tree v_jGet a_1jClustering to obtain a first clustering result which is a leaf node of the clustering process tree, wherein the degree of the leaf node is zero; set of q-th hierarchyAnd as a parent node, clustering all elements forming the set by the q-1 th level to be child nodes of the set, and gradually forming a clustering process tree. When v is_jIs taken in the sequence { a }_mjWhen the sequence of the items in the tree is traversed to each item for clustering, the clustering process tree is embodied according to concerned dimension data v_jThe data are clustered and distinguished continuously from coarse to fine according to the granularity of the data, all information clustered by a single data point step by step can be reflected visually by a clustering process tree, and all clustering information of the data can be traced and traced actively.

Example two

HSV is a color space created according to the intuitive nature of color, also known as a hexagonal pyramid model, in which the parameters of the color are: hue (h), saturation(s), lightness (v), the value ranges are respectively: h is 0-180, S is 0-255, V is 0-255, the image is composed of a plurality of data points, and each data point has an H value, an S value and a V value.

As shown in fig. 2, a second embodiment of the present invention provides a data clustering method, which aims at data of 12 scattered and unordered data points: the hue h value is clustered by the following method, and the method comprises the following steps:

(1) determining the condition of data clustering, specifically:

the similarity between the data of the embodiment is only influenced by the factors of one dimension: the difference Δ h between the hue h values, so the condition of data clustering in this embodiment is to cluster the data according to Δ h:

v₁＝Δh＝{a_m1}＝a₁₁,a₂₁,……,a_m1

＝1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,17,18,19,20,21,22,23,25；

wherein v is₁For data in dimension 1: Δ h, data v₁The difference between Δ h is arranged in the order of smaller to larger as the sequence { a }_m1}，a_m1Is a sequence { a_m1The m-th item in (a)_m125 represents a maximum difference between the data h of 25, a₁₁1 represents that the minimum difference between the data h is 1.Δ h is dimension data concerned by the second data clustering in the embodiment of the present invention, and the value of Δ h is in the sequence { a }_m1And traversing each item according to the sequence of the items in the item, and further clustering based on a clustering result obtained by taking the previous item by the delta h when the latter item is taken by the delta h for clustering.

Therefore, the condition of the second data clustering in the embodiment of the invention is as follows: a sequence Δ h ═ a according to the difference Δ h between the tone h values of the data points_m1The items in 1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,17,18,19,20,21,22,23 and 25 are sequentially traversed through each item, and when the latter item is clustered by Δ h, the clustering result obtained by taking the former item by Δ h is further clustered.

(2) Clustering data according to data clustering conditions, and calculating entropy carrier after clustering

Wherein, a_m1Is a sequence { a_m1The m-th item in { a }_m1Is the data v of the 1 st dimension₁A sequence in which differences Δ h are arranged in descending order; a is the base number of the logarithmic function, and a is more than 1; entropy carrier

Denotes v₁Get the sequence { a_m1The m-th item a in_m1Carrying out clustering to obtain the size of the average information quantity borne by the first clustering result; n is v₁Get the sequence { a_m1The m-th item a in_m1Clustering to obtain the number of data sets contained in a first clustering result; k is a radical of_iIs the number of elements in the ith data set, N is the total number of data, p_iFor elements in the ith data setThe ratio of the number of elements to the total number of data.

The preferred value of the second embodiment a of the present invention is a ═ 2, and the entropy carrier obtained by this calculation represents bits, and the bits are binary and represent the measurement unit of the average information amount.

Entropy carrier

The embodiment of the invention specifically clusters data according to two data clustering conditions as follows:

s201, clustering when Δ h is 1, clustering data points representing hue h values with a difference of 1 in one set, obtaining a first clustering result after clustering, where the first clustering result includes five data sets, as shown in fig. 3, where N is 5 and N is 12, and calculating an entropy carrier I at this time₁：

S202, clustering based on the clustering result of Δ h ═ 1, when Δ h ═ 2, clustering data points representing hue h values with a difference of 2 in one set, and obtaining a first clustering result after clustering, where the first clustering result includes three data sets, as shown in fig. 4, where N is 3 and N is 12, and calculating the entropy load I at that time₂：

S203, clustering based on the clustering result of Δ h ═ 2, when Δ h ═ 3, clustering data points representing hue h values with a difference of 3 in one set, and obtaining a first clustering result after clustering, where the first clustering result includes two data sets, as shown in fig. 5, where N is 2 and N is 12, and calculating the entropy load I at that time₃：

S204, based on the clustering result of delta h-3, clustering is carried out when delta h is 4, data points representing hue h value difference of 4 are clustered in one set, a first clustering result is obtained after clustering, the first clustering result comprises one data set, as shown in figure 6, all the data points are all clustered into one set, and the entropy I at the moment is calculated₄：

In this embodiment, the result of clustering with Δ h ═ 5 to 25 is the same as the result of clustering with Δ h ═ 4 in step S204, and the entropy loading is also the same, so the details are not repeated.

wherein, I_maxAnd the maximum entropy load represents the maximum average information quantity carried by the clustering result obtained by clustering according to the clustering condition. I is₃When the entropy loading obtained by clustering by the method of "Δ h ═ 3" is maximum, I is used for a computer system with a constant storage space₃The corresponding clustering method has the largest amount of information which can be stored and the highest expression efficiency of the information, so the maximum entropy I₃The corresponding clustering result is expected.

In the second embodiment of the present invention, the method for clustering one-dimensional data is exemplarily illustrated only by the difference Δ h between the hue h values of the data points, and substantially, the method, the system and the storage medium for clustering data are applicable to clustering any one-dimensional data.

EXAMPLE III

As shown in fig. 7, a third embodiment of the present invention provides a data clustering method, which is directed to data of 11 ordered data points in an orthogonal coordinate system: the hue h value, the x coordinate value and the y coordinate value are clustered by the following method, and the method comprises the following steps:

(1) determining the condition of data clustering, specifically:

the similarity between the three data in the embodiment of the invention is influenced by the factors of two dimensions: therefore, the condition of the three data clustering of the embodiment of the invention is that the data are clustered according to the combination relation of the delta h and the delta x:

(v₁,v₂)，

v₁＝Δh，

v₂＝Δx；

in the embodiment of the invention, the concerned dimension data of the three-data clustering is delta h, and the irrelevant dimension data is delta x, so that the combination relation is fixed delta x, the data delta h clustering is traversed, and for delta h:

v₁＝Δh＝{a_m1}＝a₁₁,a₂₁,……,a_m1

＝0,1,2,3,4,5,6,7；

wherein v is₁For data in dimension 1: Δ h, data v₁The difference between Δ h is arranged in the order of smaller to larger as the sequence { a }_m1}，a_m1Is a sequence { a_m1The m-th item in (a)_m17 represents the maximum difference between the data h as 7, a₁₁0 represents that the minimum difference between the data h is 0.Δ h is dimension data concerned by the three-data clustering in the embodiment of the present invention, and the value of Δ h is in the sequence { a }_m1And traversing each item according to the sequence of the items in the item, and further clustering based on a clustering result obtained by taking the previous item by the delta h when the latter item is taken by the delta h for clustering.

For Δ x:

v₂＝Δx＝{a_m2}＝a₁₂,a₂₂,……,a_m2

＝1,2,3,4,5,6,7,8,9,10；

wherein v is₂Data for the 2 nd dimension: Δ x, data v₂The difference between Δ x is arranged in the order of the sequence { a } from small to large_m2}，a_m2Is a sequence { a_m2The m-th item in (a)_m210 represents a maximum difference between the data Δ x of 10, a₁₂1 represents that the minimum difference between the data Δ x is 1. If Δ x is dimension data that is not concerned by the three-data clustering in the embodiment of the present invention, the value of Δ x is the sequence { a }_m2At least one of them, the embodiment of the present invention takes the sequence { a } three_m2The first term in (1), so Δ x ═ 1.

Therefore, the condition of the three-data clustering in the embodiment of the invention is as follows: the fixed Δ x is 1, and the sequence Δ h is { a } according to the difference Δ h between the tone h values of the respective data points_m1And traversing each item in the sequence of

items

0,1,2,3,4,5,6 and 7, and further clustering based on a clustering result obtained by taking the previous item based on the Δ h when the latter item is taken as the cluster.

Δ h take on the values in the sequence { a }_m1The sequence of the items in the tree is traversed through each item for clustering, and delta h is a₈₁7, placing a clustering result obtained by clustering in the 7 th level of the clustering process tree, wherein the clustering result obtained by clustering is a root node of the clustering process tree; Δ h is taken to be a₁₁The clustering result obtained by clustering is 0, the clustering result is arranged in the level 1 of the clustering process tree, the clustering result is a leaf node of the clustering process tree, and the degree of the leaf node is zero; when a certain set in the level 2 is used as a parent node, all elements forming the set in the level 1 are used as child nodes of the set, so as to gradually form a clustering process tree, as shown in fig. 8, when Δ h takes a value in a sequence { a }_m1When the sequence of the items in the data is traversed through each item for clustering, the process that the clustering process tree continuously clusters and distinguishes the data from coarse to fine according to the granularity of the concerned dimension data delta h is embodied, the clustering process tree can visually reflect all information of single data point step by step clustering, and all clustering information of the data is traced and traceable, and the source traceability is realized.

Wherein, a_m1Is a sequence { a_m1The m-th item in { a }_m1Is the data v of the 1 st dimension₁A sequence of differences Δ h arranged in descending order, a being the base of the logarithmic function, a > 1; entropy carrier

Denotes v₁Get the sequence { a_m1The m-th item a in_m1Carrying out clustering to obtain the size of the average information quantity borne by the first clustering result; n is v₁Get the sequence { a_m1The m-th item a in_m1Clustering to obtain the number of data sets contained in a first clustering result; k is a radical of_iIs the number of elements in the ith data set, N is the total number of data, p_iIs the ratio of the number of elements in the ith data set to the total number of data.

The preferred value of embodiment three a of the present invention is a ═ 2, and the entropy carrier obtained by this calculation represents bits, and the bits are binary and represent the measurement unit of the average information amount.

Entropy carrier

The data clustering according to the three data clustering conditions of the embodiment of the invention specifically comprises the following steps:

s301, fixing delta x to be 1, clustering when delta h is 0, and clustering data points representing the same hue h value in one set, wherein no data point meeting the clustering condition exists, and no clustering occurs to the data points, so that the entropy carrier I at the moment₀＝0。

S302, fixing delta x to be 1, clustering when delta h is 1, and clustering data points representing hue h values with difference of 1Clustering one set to obtain a first clustering result, where the first clustering result includes eight data sets, as shown in fig. 9, where N is 8 and N is 11, and calculating the entropy loading I at that time₁：

S303, fixing Δ x to 1, clustering when Δ h is 2 based on the clustering result of Δ h to 1, clustering to obtain a first clustering result, where the first clustering result includes four data sets, as shown in fig. 10, where N is 4 and N is 11, and calculating the entropy carrier I at that time₂：

S304, fixing Δ x to 1, clustering when Δ h is 3 based on the clustering result of Δ h to 2, clustering to obtain a first clustering result, where the first clustering result includes two data sets, as shown in fig. 11, where N is 2 and N is 11, and calculating the entropy carrier I at that time₃：

S305, fixing Δ x to 1, clustering when Δ h is 4 based on the clustering result of Δ h to 3, clustering to obtain a first clustering result, where the first clustering result includes a data set, as shown in fig. 12, where N is 1 and N is 11, and calculating the entropy carrier I at that time₄：

In the third embodiment of the present invention, the result of clustering with Δ h ═ 5 to 7 is the same as the result of clustering with Δ h ═ 4 in step S305, and the entropy loading is also the same, so that details are not repeated.

wherein, I_maxAnd the maximum entropy load represents the maximum average information quantity carried by the clustering result obtained by clustering according to the clustering condition. I is₂When the entropy carrier obtained by clustering by the method of "fixed Δ x is 1 and Δ h is 2" is the maximum, the amount of information that can be stored is the maximum for a computer system with a fixed storage space, and the expression efficiency of the information is the highest, so the maximum entropy carrier I is the maximum₂The corresponding clustering result is expected.

In the third embodiment of the present invention, the two-dimensional data clustering method is exemplarily illustrated only by the difference Δ h between hue h values and the difference Δ x between x coordinate values, and substantially, the data clustering method, system and storage medium of the present invention are applicable to any two-dimensional data clustering.

Example four

The fourth embodiment of the present invention takes the image segmentation field as an example to explain the data clustering method of the present invention, and the application scenario of image segmentation is shown in fig. 13.

As shown in fig. 14, the fourth embodiment is an image, and for the data of 156 ordered data points in the image: the hue h value, the x coordinate value and the y coordinate value are clustered by the following method:

(1) determining the condition of data clustering, specifically:

the similarity between the four data in the embodiment of the invention is only influenced by the factors of three dimensions: therefore, the four data clustering conditions in the embodiment of the invention are that the data are clustered according to the combination relationship of the delta h, the delta x and the delta y:

(v₁,v₂,v₃)，

v₁＝Δh，

v₂＝Δx，

v₃＝Δy；

in the embodiment of the invention, the concerned dimension data of the four-data clustering is delta h, and the unconcerned dimension data is delta x and delta y, so that the combination relation is that the delta x and the delta y are fixed, the data delta h clustering is traversed, and for the delta h:

v₁＝Δh＝{a_m1}＝a₁₁,a₂₁,……,a_m1

＝0,1,2,3,4,5,158,159,160,161,162,163；

wherein v is₁For data in dimension 1: Δ h, data v₁The difference between Δ h is arranged in the order of smaller to larger as the sequence { a }_m1}，a_m1Is a sequence { a_m1The m-th item in (a)_m1163 represents the maximum difference between data h as 163, a₁₁0 represents that the minimum difference between the data h is 0.Δ h is the dimensional data of interest for implementing the quad-like data clustering of the present invention, so the value of Δ h is in the sequence { a }_m1And traversing each item according to the sequence of the items in the item, and further clustering based on a clustering result obtained by taking the previous item by the delta h when the latter item is taken by the delta h for clustering.

For Δ x:

v₂＝Δx＝{a_m2}＝a₁₂,a₂₂,……,a_m2

＝1,2,3,4,5,6,7,8,9,10,11；

wherein v is₂Data for the 2 nd dimension: Δ x, data v₂The difference between Δ x is arranged in the order of the sequence { a } from small to large_m2}，a_m2Is a sequence { a_m2The m-th item in (a)_m2The maximum difference between 11 representative data Δ x is 11, a₁₂1 represents that the minimum difference between the data Δ x is 1. If Δ x is dimension data that is not concerned by the four-data clustering in the embodiment of the present invention, the value of Δ x is the sequence { a }_m2At least one of them, the fourth embodiment of the present inventionColumn { a_m2The first term in (1), so Δ x ═ 1.

For Δ y:

v₃＝Δy＝{a_m3}＝a₁₃,a₂₃,……,a_m3

＝1,2,3,4,5,6,7,8,9,10,11,12；

wherein v is₃For data in dimension 3: Δ y, data v₃The difference between Δ y is arranged in the order of smaller to larger as the sequence { a }_m3}，a_m3Is a sequence { a_m3The m-th item in (a)_m312 represents a maximum difference between data Δ y of 12, a₁₃1 represents that the minimum difference between the data Δ y is 1. If Δ y is dimension data that is not concerned by the four-data clustering in the embodiment of the present invention, the value of Δ y is the sequence { a }_m3At least one of them, the sequence { a } is taken four times in the embodiment of the present invention_m3The first term in (1), so Δ y ═ 1.

Therefore, the four data clustering conditions in the embodiment of the present invention are as follows: a fixed Δ x and a fixed Δ y are 1, and a sequence Δ h and { a) is determined according to the difference Δ h between the hue h values of the data points_m1And traversing each item according to the sequence of the items in 0,1,2,3,4,5,158,159,160,161,162 and 163, and further clustering based on the clustering result obtained by taking the previous item by the delta h when the latter item is clustered by the delta h.

Δ h is taken as sequence a_m1Traversing each item in the sequence of the middle items for clustering, taking a clustering result obtained by clustering 163 by delta h, and placing the clustering result in the 163 th level of the clustering process tree, wherein the clustering result obtained by clustering is a root node of the clustering process tree; Δ h is taken as 0 to perform clustering to obtain a clustering result, the clustering result is arranged in the level 1 of the clustering process tree, the clustering result is a leaf node of the clustering process tree, and the degree of the leaf node is zero; when a certain set in the level 2 is used as a parent node, all elements forming the set in the level 1 are used as child nodes of the set, so as to gradually form a clustering process tree, as shown in fig. 15, when Δ h is valued in the sequence a_m1When the sequence of the middle items traverses each item for clustering, the method reflects that the clustering process tree carries out clustering from coarse to fine according to the granularity of the concerned dimension data delta h,The continuous clustering and distinguishing can visually reflect that the data points of the single image are clustered step by step to form distinguishable objects, and further cluster to form all information of the whole image, thereby realizing that all clustering information of the data is traceable and has source traceability.

(2) Clustering data according to data clustering conditions, and calculating entropy load of the data after each clustering

In the fourth embodiment of the present invention, a is 2, and the entropy carrier thus calculated represents bits, and the bits are binary and represent a unit of measure of the average amount of information, so that it is more appropriate to take a to 2.

Entropy carrier

The four conditions for data clustering in the embodiment of the invention specifically comprise:

s401, fixing Δ x to 1, fixing Δ y to 1, clustering when Δ h to 0, clustering data points representing the same hue h value together in one set, and obtaining a first clustering result after clustering, where the first clustering result includes eighteen data sets, as shown in fig. 16, where N is 18 and N is 156, and calculating the entropy carrier I at that time₀：

S402, setting Δ x to 1, setting Δ y to 1, clustering based on the clustering result of Δ h to 0, clustering when Δ h to 1, clustering data points representing hue h values with a difference of 1, and obtaining a first clustering result after clustering, where the first clustering result includes fifteen data sets, as shown in fig. 17, where N is 15 and N is 156, and calculating the entropy load I at that time₁：

S403, fixing Δ x to 1, fixing Δ y to 1, clustering based on the clustering result of Δ h to 1, clustering when Δ h to 2, clustering data points representing hue h values with a difference of 2 into a set, and clustering to obtain a first clustering result, where the first clustering result includes nine data sets, as shown in fig. 18, where N is 9 and N is 156, and calculating the entropy load I at that time₂：

S404, fixing Δ x to 1, fixing Δ y to 1, clustering based on the clustering result of Δ h to 2, clustering when Δ h to 3, clustering data points representing hue h values with a difference of 3 into one set, and clustering to obtain a first clustering result, where the first clustering result includes six data sets, as shown in fig. 19, where N is 6 and N is 156, and calculating the entropy load I at that time₃：

S405, setting Δ x to 1, setting Δ y to 1, clustering based on a clustering result of Δ h to 3, clustering when Δ h to 4, clustering data points representing hue h values with a difference of 4, and obtaining a first clustering result after clustering, where the first clustering result includes four data sets, as shown in fig. 20, where N is 4 and N is 156, and calculating the entropy load I at that time₄：

S406, fixing delta x to 1, fixing delta y to 1, clustering based on a clustering result that delta h to 4 when delta h to 158, clustering data points representing hue h values with difference of 158 in a set, and obtaining a first clustering result after clustering, wherein the first clustering result is obtainedContains a data set: the background of the clustered images and the other sets on the images form a set corresponding to the entire image, as shown in fig. 14, where N is 1 and N is 156, and the entropy carrier I at that time is calculated₁₅₈：

In this embodiment, the result of clustering with Δ h ═ 5 is the same as the result of clustering with Δ h ═ 4 in step S405, and the result of clustering with Δ h ═ 159 to 163 is the same as the result of clustering with Δ h ═ 158 in step S406, and therefore, the description thereof is omitted.

wherein, I_maxAnd the maximum entropy load represents the maximum average information quantity carried by the clustering result obtained by clustering according to the clustering condition. I is₄When the average information amount obtained by clustering by the method of "fixed Δ x is 1, fixed Δ y is 1, and Δ h is 4" is the largest, the information amount that can be stored in a computer system with a fixed storage space is the largest, and the expression efficiency of the information is also the highest, so that the maximum entropy carrier I is the largest₄The corresponding data clustering result is expected.

In the fourth embodiment of the invention, the three-dimensional data clustering method is exemplarily illustrated only by the difference value delta h between hue h values, the difference value delta x between x coordinate values and the difference value delta y between y coordinate values, and in essence, the data clustering method, the system and the storage medium are suitable for clustering data in any three dimensions and more than three dimensions; in addition, as can be seen from fig. 20, after Δ h-4 clustering, four clearly distinguishable objects have been formed by clustering on the image: the four sets of safety helmet, water cup, gloves and image background are adopted, so that image segmentation is accurately realized.

Performing the step (1), the step (2) and the step (3) to complete one-time clustering, and it can be seen from the corresponding figure that at least one first clustering result can be obtained for each time of clustering, and each first clustering result comprises at least one set, as shown in fig. 20, the maximum entropy I is₄The corresponding clustering results are four sets: assuming that the subdivision information of the 'cup' set data needs to be known and the expected maximum entropy load is obtained, the fourth embodiment of the present invention re-determines the data clustering condition, further clusters the 'cup' set data according to the new data clustering condition by repeating the steps (1), (2) and (3) to obtain a new maximum entropy load, and the clustering result corresponding to the new maximum entropy load includes two subsets: the cup cover and the cup body correspond to information which is subdivision information of the cup set data. The cup is collected as a father node, and the cup cover and the cup body are collected as child nodes, so that an information structure tree is gradually formed. The entropy load corresponding to each branch of the information structure tree is the maximum entropy load under a certain clustering condition, and the amount of information which can be stored in a computer system with a certain storage space is the maximum, so that the information expression efficiency is the highest, specifically:

the fourth embodiment of the invention is directed at the data values of 6 ordered data points in the 'cup' set: determining a new data clustering condition according to the hue h value, the x coordinate value and the y coordinate value, and then repeatedly executing the step (1), the step (2) and the step (3) for further clustering, wherein the specific steps are as follows:

(1) determining a new data clustering condition, specifically:

the similarity between these 6 ordered data points is only affected by factors in two dimensions: the difference value delta h between the hue h values and the difference value delta y between the y coordinate values, so that the condition for clustering the data in the 'cup' set is that the data are clustered according to the combination relation of the delta h and the delta y:

(v₁,v₂)，

v₁＝Δh，

v₂＝Δy；

in the fourth pair of 'water cup' set of the embodiment of the invention, the concerned dimension data of the data cluster is delta h, and the irrelevant dimension data is delta y, so that the combination relationship is fixed delta y, the data delta h cluster is traversed, and for delta h:

v₁＝Δh＝{a_m1}＝a₁₁,a₂₁,……,a_m1

＝0,2,4；

wherein v is₁For data in dimension 1: Δ h, data v₁The difference between Δ h is arranged in the order of smaller to larger as the sequence { a }_m1}，a_m1Is a sequence { a_m1The m-th item in (a)_m14 represents a maximum difference between the data h of 4, a₁₁0 represents that the minimum difference between the data h is 0. Delta h is dimension data concerned by four pairs of 'cup' set data clustering in the embodiment of the invention, so that the value of Delta h is in a sequence { a_m1And traversing each item according to the sequence of the items in the item, and further clustering based on a clustering result obtained by taking the previous item by the delta h when the latter item is taken by the delta h for clustering.

For Δ y:

v₂＝Δy＝{a_m2}＝a₁₂,a₂₂,……,a_m2

＝1,2,3,4,5；

wherein v is₂Data for the 2 nd dimension: Δ y, data v₂The difference between Δ y is arranged in the order of smaller to larger as the sequence { a }_m2}，a_m2Is a sequence { a_m2The m-th item in (a)_m25 represents a maximum difference between data Δ y of 5, a₁₂1 represents that the minimum difference between the data Δ y is 1. Delta y is dimension data which is not concerned by four pairs of 'cup' set data clustering in the embodiment of the invention, and the value of the Delta y is a sequence { a_m2At least one arbitrary item in the sequence, the four pairs of 'water cup' set data clustering and taking sequence { a }in the embodiment of the invention_m2The first term in (1), so Δ y ═ 1.

Therefore, the condition of the four pairs of 'cup' set data clustering in the embodiment of the invention is as follows: fixed Δ y 1, rootA sequence Δ h ═ a according to the difference Δ h between the hue h values of the data points_m1And traversing each item according to the sequence of the items in 0,2 and 4, and further clustering based on a clustering result obtained by taking the previous item by the delta h when the latter item is taken by the delta h.

Entropy carrier

The four embodiments of the present invention further cluster the 'cup' set data according to the new data clustering conditions specifically:

s407, fixing Δ y to 1, clustering when Δ h is 0, clustering data points representing that hue h values in a "cup" set are the same, obtaining a first clustering result after clustering, where the first clustering result includes five data sets, as shown in fig. 21, where N is 5 and N is 6, and calculating the entropy load I at this time₀：

S408, fixing Δ y equal to 1, clustering when Δ h equal to 2 based on the clustering result when Δ h equal to 0, representing that data points with hue h value difference of 2 in the "cup" set are clustered into one set, obtaining a first clustering result after clustering, where the first clustering result includes two data sets, as shown in fig. 22, where N equal to 2 and N equal to 6, and calculating entropy at this time, where N equal to 2 and N equal to 6Carrier I₂：

S409, fixing Δ y to 1, clustering when Δ h is 4 based on a clustering result of Δ h to 2, representing that data points with hue h value difference of 4 in a "cup" set are clustered into a set, and obtaining a first clustering result after clustering, where the first clustering result includes a data set, as shown in fig. 23, where N is 1 and N is 6, and calculating entropy load I at this time₄：

(3) Taking the maximum entropy load I in the entropy loads of the first clustering results calculated in the step (2)_maxAccording to the maximum entropy carrier I_maxObtaining a data clustering result, I_maxThe maximum entropy load that can be obtained from the clustering result after each clustering is finished is represented as follows:

wherein the maximum entropy carries I_maxThe maximum average information amount carried by the clustering result obtained by clustering according to the clustering condition is shown. I is₂The entropy load obtained by clustering the 'water cup' set data by the 'fixed delta y-1 and delta h-2' method is maximum, the amount of information which can be stored in a computer system with a certain storage space is maximum, the expression efficiency of the information is also maximum, and therefore the maximum entropy load I is maximum₄The corresponding clustering result is expected.

First, it can be seen from fig. 22 that the segmentation information of the "cup" set data is obtained by further clustering under the new clustering condition: the cup cover and the cup body have the advantages that the amount of information which can be stored in a computer system with a certain storage space is maximum, the information expression efficiency is also highest, and therefore the maximum entropy I is carried₂Is correspondingly provided withThe clustering result of (2) is the segmentation information of the 'water cup' set data expected to be obtained.

Secondly, the cup set is used as a father node, the cup covers and the cups are combined together to be used as child nodes, and an information structure tree is gradually formed, as shown in fig. 24, the information structure tree embodies the information that original image data are clustered into a safety helmet set, a cup set and a glove set according to the size of delta h value granularity in a coarse-grained mode, and the cup set data are further clustered and distinguished in a fine-grained mode. Therefore, the entropy load corresponding to each branch of the information structure tree is the maximum entropy load under a certain clustering condition, and the amount of information which can be stored in a computer system with a certain storage space is the maximum, so that the information expression efficiency is the highest.

Finally, in the fourth embodiment of the present invention, if the cup lid and the cup body are obtained by further clustering from the cup set data, they are obviously separated, as shown in fig. 24; compared with a water cup, a cup cover and a cup body, the whole image data is only local data, and the local data is incomplete and inaccurate clustering information, so that the whole clustering data of the whole image is expected to be obtained firstly, and the clustering result of the whole data is further clustered to obtain local subdivision information, as shown in fig. 20, so that the method carries out clustering from the whole data to obtain at least one first clustering result, obtains the data clustering result according to each first clustering result, and realizes the integrity of data clustering; and based on the at least one first clustering result, clustering the at least one first clustering result again to obtain the local subdivision information of the at least one first clustering result, so that the coordination and unification of the integrity and the locality of the data clustering are realized, and the coordination and unification of the integrity and the locality of the data clustering are realized, so that the obtained clustering result is more complete and accurate.

The above four embodiments only use x coordinate value, y coordinate value, and hue h value as data to perform exemplary clustering, so as to illustrate the specific implementation method of the present invention, the present invention is not exhaustive for various combination relationships of other kinds of data and various dimension data, because the present invention does not depend on and process any special data, and is generally applicable to clustering of any data.

EXAMPLE five

An embodiment of the present invention provides a data clustering system, where the system includes: a memory, a processor, and a program stored on the memory and executable on the processor, the program when executed by the processor implementing a method of clustering data, the method comprising the steps of:

(1) determining a data clustering condition, specifically:

the data clustering condition is determined according to the similarity between data, and the similarity between the data is often influenced by factors of multiple dimensions, so that the data are clustered according to the following combination relationship of different dimension data by the five-data clustering condition in the embodiment of the invention:

(v₁,v₂,v₃,……,v_j)，

v_j＝{a_mj}＝a_1j,a_2j,……,a_mj；

The preferred value of the fifth a in the embodiment of the present invention is a ═ 2, and the entropy carrier obtained by this calculation represents bits, and the bits are binary and represent the measurement unit of the average information amount.

Entropy carrier

After the step (1), the step (2), and the step (3) are executed to complete the primary clustering, the data clustering method in the fifth embodiment of the present invention may further include a step of forming an information structure tree, which specifically includes:

The fifth data clustering method in the embodiment of the present invention may further include a step of forming a clustering process tree, which specifically includes:

when v is_jIs taken in the sequence { a }_mjV when the sequence of the items in the item is traversed each item for clustering, v_jGet a_qjThe first clustering result obtained by clustering is arranged in the q-th level of the clustering process tree, q is more than or equal to 1 and less than or equal to m, v_jGet a_mjThe first clustering result obtained by clustering is the root node of the clustering process tree v_jGet a_1jClustering to obtain a first clustering result which is a leaf node of the clustering process tree, wherein the degree of the leaf node is zero; the q-th level set is used as a father node, and all elements forming the set by the q-1-th level clustering are used as child nodes of the set, so that a clustering process tree is formed step by step. When v is_jIs taken in the sequence { a }_mjWhen the sequence of the items in the tree is traversed to each item for clustering, the clustering process tree is embodied according to concerned dimension data v_jThe data are clustered and distinguished continuously from coarse to fine according to the granularity of the data, all information clustered by a single data point step by step can be reflected visually by a clustering process tree, and all clustering information of the data can be traced and traced actively.

EXAMPLE six

An embodiment of the present invention further provides a computer-readable storage medium, where the storage medium stores at least one program, the program is executable by at least one processor, and the at least one program, when executed by the at least one processor, implements a method for clustering data, where the method for clustering data includes:

(1) determining a data clustering condition, specifically:

(v₁,v₂,v₃,……,v_j)，

v_j＝{a_mj}＝a_1j,a_2j,……,a_mj；

The preferred value of the sixth a of the embodiment of the present invention is a ═ 2, and the entropy carrier obtained by this calculation represents bits, and the bits are binary and represent the measurement unit of the average information amount.

Entropy carrier

Representing the average information content carried by the clustering result obtained by the current clusteringThe size of (a) is (b),

After the step (1), the step (2) and the step (3) are executed to complete the primary clustering, the data clustering method in the sixth embodiment of the present invention may further include a step of forming an information structure tree, which specifically includes:

The data clustering method in the sixth embodiment of the present invention may further include a step of forming a clustering process tree, which specifically includes:

In summary, the data clustering method, the data clustering system and the storage medium provided by the invention cluster from the whole data according to the data clustering conditions to obtain at least one first clustering result, and the data clustering result is obtained through the first clustering result with the largest bearing average information amount, so that the integrity of data clustering is realized, and the obtained clustering result is more complete and accurate; in addition, dependence and processing on any special data do not exist in the clustering process, and any data type is not limited, so that the method is generally suitable for clustering of any data, and has very high practicability; the maximum average information bearing amount is used as a basis for determining a clustering result, and for a computer system with a certain storage space, the larger the information amount which can be stored, the higher the information expression efficiency is;

the data clustering method, the data clustering system and the storage medium provided by the invention cluster the at least one first clustering result again based on the at least one first clustering result to obtain the local subdivision information of the at least one first clustering result, so that the coordination and unification of the integrity and the locality of the data clustering are realized;

the method, the system and the storage medium for clustering data form an information structure tree, the entropy load corresponding to each branch of the information structure tree is the maximum entropy load under a certain clustering condition, and the information quantity which can be stored in a computer system with a certain storage space is the maximum, so that the information expression efficiency is also the highest;

the clustering method, the system and the storage medium of the data provided by the invention also form a clustering process tree in the clustering process, the clustering process tree clusters and distinguishes the data continuously from coarse to fine according to the granularity of the concerned dimension data, all information of single data point step by step can be reflected visually, and all clustering information of the data is traced and traceable.

Claims

1. A method for clustering data, the method comprising:

determining a data clustering condition;

2. The method according to claim 1, wherein the data clustering condition is determined based on similarity between data.

3. The method of claim 1, wherein clustering data according to the data clustering condition comprises: and clustering the data according to the combination relation of the data with different dimensions.

4. The method according to claim 3, wherein the combination relationship of the data with different dimensions is determined according to the dimension concerned by the data clustering, and comprises: and fixing the dimension data which is not concerned, and combining and traversing the concerned dimension data.

5. The method according to claim 3, wherein the clustering data according to the combination relationship of the data with different dimensions specifically comprises:

(v₁,v₂,v₃,……,v_j)，

v_j＝{a_mj}＝a_1j,a_2j,……,a_mj；

6. The method for clustering data according to claim 1, wherein the entropy carriers are calculated by:

wherein, a_mjIs a sequence { a_mjThe m-th item in { a }_mjIs the data v of the jth dimension_jThe difference values between the two are arranged in a sequence from small to large, a is the base number of a logarithmic function, a is more than 1, and I is carried by entropy_amjDenotes v_jGet the sequence { a_mjThe mth item a in_mjCarrying out clustering to obtain the size of the average information quantity borne by the first clustering result; n is v_jGet the sequence { a_mjThe mth item a in_mjClustering to obtain the number of data sets contained in a first clustering result; k is a radical of_iIs the number of elements in the ith data set, N is the total number of data, p_iIs the ratio of the number of elements in the ith data set to the total number of data.

7. A method for clustering data according to claim 6, wherein a is 2, and the entropy obtained by calculation represents bits, and the bits are binary and represent the measurement unit of the average information amount.

8. A method for clustering data according to claim 1, the method comprising the step of forming an information structure tree comprising:

9. A method for clustering data according to claim 5, the method comprising a step of forming a clustering process tree comprising:

10. A system for clustering data, the system comprising a memory, a processor and a program stored in the memory and executable on the processor, the program, when executed by the processor, implementing the steps of the method for clustering data according to any one of claims 1 to 9.

11. A computer-readable storage medium characterized by: the storage medium stores at least one program executable by at least one processor, the at least one program when executed by the at least one processor implementing the steps of the method of clustering data according to any one of claims 1 to 9.