WO2023050461A1 - 一种数据的聚类方法、系统及存储介质 - Google Patents

一种数据的聚类方法、系统及存储介质 Download PDF

Info

Publication number
WO2023050461A1
WO2023050461A1 PCT/CN2021/123007 CN2021123007W WO2023050461A1 WO 2023050461 A1 WO2023050461 A1 WO 2023050461A1 CN 2021123007 W CN2021123007 W CN 2021123007W WO 2023050461 A1 WO2023050461 A1 WO 2023050461A1
Authority
WO
WIPO (PCT)
Prior art keywords
clustering
data
entropy
load
information
Prior art date
Application number
PCT/CN2021/123007
Other languages
English (en)
French (fr)
Inventor
邓少冬
盛龙
Original Assignee
西安米克斯智能技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 西安米克斯智能技术有限公司 filed Critical 西安米克斯智能技术有限公司
Publication of WO2023050461A1 publication Critical patent/WO2023050461A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees

Definitions

  • the invention relates to the technical field of artificial intelligence, in particular to a data clustering method, system and storage medium.
  • Image clustering is to divide target data with completely unknown labels and classify them into different clusters. It is an exploratory technique for grouping data features. It can usually be used to organize image information, or to generate training sample labels, etc. It is a kind of Common image processing means.
  • image clustering methods are generally based on image features extracted from images, and image clustering is performed through traditional clustering algorithms, for example, through K-Means clustering algorithm (K-Means clustering algorithm) or density-based noisy spatial clustering.
  • K-Means clustering algorithm K-Means clustering algorithm
  • DBSCAN Density-Based Spatial Clustering of Applications with Noise
  • the traditional K-Means algorithm needs to input the sample set, the cluster tree K of clustering, the maximum number of iterations N, and finally output the cluster division.
  • the general process is: select K objects from the data as the initial clustering Class center; calculate the distance from each cluster object to the cluster center to divide; calculate each cluster center again; calculate the standard measure function until the maximum number of iterations is reached, then stop, otherwise, continue to operate.
  • the K-Means algorithm has the following main disadvantages:
  • K is difficult to determine, because it is impossible to determine in advance what category a given sample set should be divided into to be optimal;
  • K-Means adopts the above iterative method, and the result obtained is only a locally optimal clustering result, which lacks integrity;
  • the clustering effect depends on the initialization of the cluster centers, and the initial cluster centers are randomly selected.
  • the object of the present invention is to provide a data clustering method, system and storage medium, which solves the technical problem that traditional clustering algorithms in the prior art lack integrity and universal applicability.
  • an embodiment of the present invention provides a data clustering method, characterized in that the method includes:
  • Cluster the data according to the data clustering condition to obtain at least one first clustering result, and each first clustering result in the at least one first clustering result contains at least one data set; calculate each The entropy load corresponding to the first clustering result, where the entropy load represents the average amount of information carried by the corresponding first clustering result;
  • the maximum entropy load among the entropy loads corresponding to each first clustering result is taken, and the first clustering result corresponding to the maximum entropy load is a data clustering result.
  • the basis for determining the data clustering condition is the similarity between data.
  • clustering the data according to the data clustering condition includes: clustering the data according to a combination relationship of data of different dimensions.
  • the combination relationship of the different dimensional data is determined according to the dimensions concerned by the data clustering, including: fixing the dimensional data not concerned, and combining and traversing the concerned dimensional data.
  • clustering of data according to the combination relationship of data of different dimensions is specifically:
  • v j is the data of the jth dimension, and the difference between the data v j is arranged as a sequence ⁇ a mj ⁇ in ascending order, a mj is the mth item in the sequence ⁇ a mj ⁇ , a mj represents the maximum difference between data v j , and a 1j represents the minimum difference between data v j ;
  • the value of v j is the sequence ⁇ a mj ⁇ Any at least one item in; when v j is the dimension data concerned by data clustering, then the value of v j traverses each item in the order of the items in the sequence ⁇ a mj ⁇ , and v j takes the latter item
  • further clustering is performed based on the first clustering result obtained by taking the previous item of v j .
  • the calculation method of the entropy load is:
  • a mj is the m-th item in the sequence ⁇ a mj ⁇
  • ⁇ a mj ⁇ is the sequence of the difference between the data v j of the j-th dimension arranged in ascending order
  • a is the pair
  • the base number of the number function, a>1 entropy load Indicates that v j takes the mth item a mj in the sequence ⁇ a mj ⁇ and clusters the average amount of information carried by the first clustering result
  • n is v j takes the mth item a in the sequence ⁇ a mj ⁇
  • k i is the number of elements in the i-th data set
  • N is the total number of data
  • p i is the number of elements in the i-th data set
  • the calculated entropy load represents bits, and the bits are binary, representing the measurement unit of the average amount of information.
  • the method includes the step of forming an information structure tree, including:
  • the clustering result includes several sub-sets, the information corresponding to the several sub-sets is the subdivision information of the certain data set, and the certain data set is used as the parent node, and the several sub-sets are As a child node, an information structure tree is formed step by step.
  • the method comprises the step of forming a clustering process tree, comprising:
  • the first clustering result obtained by clustering a qj for v j is placed in the clustering process tree
  • the qth level of , 1 ⁇ q ⁇ m, the first clustering result obtained by clustering a mj for v j is the root node of the clustering process tree, and the first clustering result obtained by clustering a 1j for v j is the leaf node of the clustering process tree, and the degree of the leaf node is zero;
  • the set at the qth level is used as the parent node, and all the elements of the set formed by the clustering at the q-1 level are the child nodes of the set, so as to gradually form clustering class process tree.
  • another embodiment of the present invention provides a data clustering system, which is characterized in that the system includes a memory, a processor, and a system stored in the memory and operable on the processor A program, when the program is executed by the processor, implements the steps of the above-mentioned data clustering method.
  • another embodiment of the present invention provides a computer-readable storage medium, which is characterized in that: the storage medium stores at least one program, and the at least one program can be executed by at least one processor, so When the at least one program is executed by the at least one processor, the above-mentioned one data clustering method is implemented.
  • a data clustering method, system and storage medium provided by the present invention have the following beneficial effects:
  • clustering is carried out from the overall data according to the data clustering conditions to obtain at least one first clustering result, through which the first clustering result with the largest average amount of information is carried
  • the clustering result obtains the data clustering result, which realizes the integrity of data clustering, so the clustering result obtained is more complete and accurate; and there is no dependence on and processing of any special data in the clustering process, and no restriction on any data type , so it is generally applicable to the clustering of any data, and the practicability is very high; the maximum average information load is used as the basis for determining the clustering results.
  • the greater the amount of information it can store the higher the Improve the efficiency of information expression;
  • a data clustering method, system and storage medium provided by the present invention are based on at least one first clustering result, clustering at least one first clustering result again to obtain the locality of at least one first clustering result Subdividing information, realizing the coordination and unification of the integrity and locality of data clustering;
  • the clustering method, system and storage medium of a kind of data provided by the present invention form information structure tree, and the entropy load corresponding to each bifurcation of information structure tree is the maximum entropy load under certain clustering conditions, then for A computer system with a certain storage space can store the largest amount of information, so it can express information with the highest efficiency;
  • a data clustering method, system and storage medium provided by the present invention also form a clustering process tree during the clustering process, and the clustering process tree performs data from coarse to large according to the granularity of the dimension data concerned.
  • Detailed and continuous clustering and identification can intuitively reflect all the information of a single data point gradually clustered, and realize that all the clustering information of the data can be traced and traced back to the source.
  • FIG. 1 is a schematic flow diagram of a data clustering method according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of an application scenario of 12 data points in a data clustering method, system, and storage medium according to Embodiment 2 of the present invention
  • FIG. 3 is a schematic diagram of a data clustering method, system, and storage medium according to Embodiment 2 of the present invention, and a clustering result with a data value difference of 1;
  • FIG. 4 is a schematic diagram of a data clustering method, system, and storage medium according to Embodiment 2 of the present invention, where the data value difference is 2 clustering results;
  • FIG. 5 is a schematic diagram of a data clustering method, system, and storage medium according to Embodiment 2 of the present invention, with a data value difference of 3 clustering results;
  • FIG. 6 is a schematic diagram of a data clustering method, system, and storage medium according to Embodiment 2 of the present invention, where the data value difference is 4 clustering results;
  • FIG. 7 is a schematic diagram of an application scenario of 11 ordered data points in a data clustering method, system, and storage medium according to Embodiment 3 of the present invention.
  • FIG. 8 is a schematic structural diagram of a clustering process tree of a data clustering method, system, and storage medium according to Embodiment 3 of the present invention.
  • FIG. 9 is a schematic diagram of a data clustering method, system, and storage medium according to Embodiment 3 of the present invention, and a clustering result with a data value difference of 1;
  • FIG. 10 is a schematic diagram of a data clustering method, system, and storage medium according to Embodiment 3 of the present invention, where the data value difference is 2 clustering results;
  • Fig. 11 is a schematic diagram of a data clustering method, system, and storage medium according to Embodiment 3 of the present invention, where the data value difference is 3 clustering results;
  • FIG. 12 is a schematic diagram of a data clustering method, system, and storage medium according to Embodiment 3 of the present invention, where the data value difference is 4 clustering results;
  • FIG. 13 is a schematic diagram of an application scenario of a data clustering method, system, and storage medium according to Embodiment 4 of the present invention.
  • FIG. 14 is a schematic diagram of an application scenario of 156 ordered data points in a data clustering method, system, and storage medium according to Embodiment 4 of the present invention.
  • FIG. 15 is a schematic structural diagram of a clustering process tree of a data clustering method, system, and storage medium according to Embodiment 4 of the present invention.
  • FIG. 16 is a schematic diagram of a data clustering method, system, and storage medium according to Embodiment 4 of the present invention, and a clustering result with a data value difference of 0;
  • FIG. 17 is a schematic diagram of a data clustering method, system, and storage medium according to Embodiment 4 of the present invention, and a clustering result with a data value difference of 1;
  • FIG. 18 is a schematic diagram of a data clustering method, system, and storage medium according to Embodiment 4 of the present invention, and a clustering result with a data value difference of 2;
  • FIG. 19 is a schematic diagram of a data clustering method, system, and storage medium according to Embodiment 4 of the present invention, where the data value difference is 3 clustering results;
  • FIG. 20 is a schematic diagram of a data clustering method, system, and storage medium according to Embodiment 4 of the present invention, where the data value difference is 4 clustering results;
  • Fig. 21 is a schematic diagram of a data clustering method, system, and storage medium according to Embodiment 4 of the present invention, and a schematic diagram showing the result of clustering the data of the "water cup” set with a data value difference of 0;
  • Fig. 22 is a schematic diagram of a data clustering method, system, and storage medium according to Embodiment 4 of the present invention, and a schematic diagram of the result of clustering the data of the "water cup" set with a data value difference of 2;
  • Fig. 23 is a schematic diagram of a data clustering method, system and storage medium according to Embodiment 4 of the present invention, and a schematic diagram showing the result of clustering the data of the "water cup” set with a data value difference of 4;
  • FIG. 24 is a schematic structural diagram of an information structure tree of a data clustering method, system, and storage medium according to Embodiment 4 of the present invention.
  • Embodiment 1 of the present invention provides a data clustering method, as shown in FIG. 1 , comprising the following steps:
  • the data clustering condition is determined based on the similarity between data, and the similarity between data is often affected by factors in multiple dimensions. Therefore, the data clustering condition in Embodiment 1 of the present invention is based on the following combinations of different dimensional data Relationships cluster data as follows:
  • the combination relationship is determined according to the dimensions concerned by data clustering, including: fixing the dimension data that is not concerned, and combining and traversing the dimension data concerned.
  • v j is the data of the jth dimension, and the difference between the data v j is arranged as a sequence ⁇ a mj ⁇ in ascending order, a mj is the mth item in the sequence ⁇ a mj ⁇ , a mj represents the maximum difference between data v j , and a 1j represents the minimum difference between data v j ;
  • the value of v j is the sequence ⁇ a mj ⁇ Any at least one item in; when v j is the dimension data concerned by data clustering, then the value of v j traverses each item in the order of the items in the sequence ⁇ a mj ⁇ , and v j takes the latter item
  • further clustering is performed based on the first clustering result obtained by taking the previous item of v j .
  • each first clustering result includes at least one data set.
  • the entropy load corresponding to each first clustering result is calculated, and the entropy load represents the average amount of information carried by the corresponding first clustering result.
  • the calculation method of entropy load is:
  • a mj is the m-th item in the sequence ⁇ a mj ⁇
  • a is the base of logarithmic function, a>1
  • entropy load Indicates that v j takes the mth item a mj in the sequence ⁇ a mj ⁇ and clusters the average amount of information carried by the first clustering result
  • n is v j takes the mth item a in the sequence ⁇ a mj ⁇
  • k i is the number of elements in the i-th data set
  • N is the total number of data
  • p i is the number of elements in the i-th data set
  • each clustering is several data sets, and each data set corresponds to a data category.
  • each data category has a corresponding fixed-length code.
  • Each code The average amount of information that can be stored is certain, and correspondingly, the information expression efficiency of each code is also certain. We expect that a fixed-length code can store the largest average amount of information, so that the information expression efficiency is the highest.
  • entropy load Indicates the size of the average amount of information carried by the clustering results of this clustering, The larger the value, the greater the average amount of information of each data category in this clustering result, the greater the average amount of information that can be stored by the code corresponding to each data category, and the information expression of the code corresponding to each data category. The higher the efficiency, the greater the amount of information that can be stored in a computer system with a certain storage space, and therefore the higher the efficiency of expressing information.
  • I max is the maximum entropy load, indicating the maximum average amount of information carried by the clustering results obtained by clustering according to the clustering conditions.
  • I max is the maximum entropy load, indicating the maximum average amount of information carried by the clustering results obtained by clustering according to the clustering conditions.
  • the amount of information that can be stored is the largest.
  • the expression efficiency of information is also the highest, so the clustering result corresponding to the maximum entropy load I max is what we expect.
  • a data clustering method in Embodiment 1 of the present invention may also include the step of forming an information structure tree, specifically including:
  • the clustering result includes several sub-sets, the information corresponding to the several sub-sets is the subdivision information of the certain data set, and the certain data set is used as the parent node, and the several sub-sets are As a child node, an information structure tree is formed step by step.
  • the entropy load corresponding to each branch of the information structure tree is the maximum entropy load under a certain clustering condition, then for a computer system with a certain storage space, it can store the largest amount of information, so its efficiency in expressing information is also high. Highest.
  • a data clustering method in Embodiment 1 of the present invention may also include the step of forming a clustering process tree, specifically including:
  • the first clustering result obtained by clustering a qj for v j is placed in the qth clustering process tree Hierarchy, 1 ⁇ q ⁇ m, the first clustering result obtained by clustering a mj for v j is the root node of the clustering process tree, and the first clustering result obtained by clustering a 1j for v j is clustering
  • the leaf node of the process tree, the degree of the leaf node is zero; the set at the qth level is used as the parent node, and all the elements of the set formed by clustering at the q-1 level are the child nodes of the set, so as to gradually form a clustering process tree .
  • the clustering process tree When the value of v j traverses each item in the order of the items in the sequence ⁇ a mj ⁇ for clustering, it reflects that the clustering process tree performs data from coarse to fine according to the granularity of the concerned dimension data v j The process of clustering and identifying continuously and continuously, the clustering process tree can intuitively reflect all the information of the gradual clustering of a single data point, and realize that all the clustering information of the data can be traced and traced back to the source.
  • HSV is a color space created according to the intuitive characteristics of color, also known as the hexagonal cone model.
  • the parameters of the color in this model are: hue (h), saturation (s), lightness (v), and the value ranges are respectively It is: H: 0 ⁇ 180, S: 0 ⁇ 255, V: 0 ⁇ 255, the image is composed of several data points, each data point has h value, s value, v value.
  • Embodiment 2 of the present invention provides a data clustering method, for the data of 12 scattered and disordered data points: the hue h value, clustering is performed by the following method, including the following steps:
  • the similarity between the data in this embodiment is only affected by one dimension: the difference ⁇ h between the hue h values, so the condition for data clustering in this embodiment is to cluster the data according to ⁇ h:
  • v 1 is the data of the first dimension: ⁇ h
  • a m1 is the mth in the sequence ⁇ a m1 ⁇ items
  • ⁇ h is the dimensional data concerned by the data clustering in the second embodiment of the present invention, then the value of ⁇ h is to traverse each item in the order of the items in the sequence ⁇ a m1 ⁇ , and ⁇ h is based on ⁇ h when clustering the next item. The clustering results obtained in the previous item are further clustered.
  • a m1 is the mth item in the sequence ⁇ a m1 ⁇
  • a is the base number of logarithmic function, a>1
  • entropy load Indicates the average amount of information carried by the first clustering result obtained by clustering the m-th item a m1 in the v 1 sequence ⁇ a m1 ⁇
  • n is the m -th item in the v 1 sequence ⁇ a m1 ⁇
  • the number of data sets contained in the first clustering result obtained by clustering the item a m1 ; k i is the number of elements in the i-th data set, N is the total number of data, p i is the number of elements in the i-th data set The ratio of the number of to the total number of data.
  • each clustering is several data sets, and each data set corresponds to a data category.
  • each data category has a corresponding fixed-length code.
  • Each code The average amount of information that can be stored is certain, and correspondingly, the information expression efficiency of each code is also certain. We expect that a fixed-length code can store the largest average amount of information, so that the information expression efficiency is the highest.
  • entropy load Indicates the size of the average amount of information carried by the clustering results of this clustering, The larger the value, the greater the average amount of information of each data category in this clustering result, the greater the average amount of information that can be stored by the code corresponding to each data category, and the information expression of the code corresponding to each data category. The higher the efficiency, the greater the amount of information that can be stored in a computer system with a certain storage space, and therefore the higher the efficiency of expressing information.
  • Embodiment 2 of the present invention clusters data according to data clustering conditions specifically as follows:
  • I max is the maximum entropy load, indicating the maximum average amount of information carried by the clustering results obtained by clustering according to the clustering conditions.
  • Embodiment 2 of the present invention only uses the difference ⁇ h between the hue h values of data points to illustrate the clustering method of one-dimensional data.
  • a data clustering method, system and storage medium of the present invention are applicable to any Clustering of dimensional data.
  • Embodiment 3 of the present invention provides a data clustering method, for the data of 11 ordered data points in the Cartesian coordinate system: hue h value, x coordinate value, y coordinate value, through the following method
  • Clustering includes the following steps:
  • the similarity between the data in the third embodiment of the present invention is jointly affected by factors in two dimensions: the difference ⁇ h between the hue h values, and the difference ⁇ x between the x coordinate values, so the data clustering in the third embodiment of the present invention
  • the condition is to cluster the data according to the combination relationship of ⁇ h and ⁇ x:
  • the dimension data concerned by data clustering is ⁇ h
  • the dimension data not concerned is ⁇ x
  • the combination relationship is fixed ⁇ x
  • the traversal data ⁇ h is clustered.
  • v 1 is the data of the first dimension: ⁇ h
  • a m1 is the mth in the sequence ⁇ a m1 ⁇ items
  • ⁇ h is the dimensional data concerned by the data clustering in the third embodiment of the present invention
  • the value of ⁇ h is to traverse each item in the order of the items in the sequence ⁇ a m1 ⁇
  • the value of ⁇ h is based on ⁇ h when the next item is clustered.
  • the clustering results obtained in the previous item are further clustered.
  • v 2 is the data of the second dimension: ⁇ x
  • a m2 is the mth in the sequence ⁇ a m2 ⁇ items
  • the order of the items in 5, 6, and 7 traverses each item, and when ⁇ h takes the latter item to cluster, it further clusters based on the clustering result obtained by ⁇ h taking the previous item.
  • the degree of the leaf node is zero; if a certain set at the second level is used as the parent node, then all the elements that make up the set at the first level are used as the child nodes of the set, so as to gradually form a clustering process tree, as shown in the figure
  • the clustering process tree when the value of ⁇ h is clustered by traversing each item in the order of the items in the sequence ⁇ a m1 ⁇ , it reflects that the clustering process tree divides the data according to the granularity of the concerned dimension data ⁇ h Coarse to fine, continuous clustering and identification process, the clustering process tree can intuitively reflect all the information of a single data point gradually clustered, so that all the clustering information of the data can be traced
  • a m1 is the mth item in the sequence ⁇ a m1 ⁇
  • a is the base number of logarithmic function, a>1
  • entropy load Indicates the average amount of information carried by the first clustering result obtained by clustering the m-th item a m1 in the v 1 sequence ⁇ a m1 ⁇
  • n is the m -th item in the v 1 sequence ⁇ a m1 ⁇
  • the number of data sets contained in the first clustering result obtained by clustering the item a m1 ; k i is the number of elements in the i-th data set, N is the total number of data, p i is the number of elements in the i-th data set The ratio of the number of to the total number of data.
  • each clustering is several data sets, and each data set corresponds to a data category.
  • each data category has a corresponding fixed-length code.
  • Each code The average amount of information that can be stored is certain, and correspondingly, the information expression efficiency of each code is also certain. We expect that a fixed-length code can store the largest average amount of information, so that the information expression efficiency is the highest.
  • entropy load Indicates the size of the average amount of information carried by the clustering results of this clustering, The larger the value, the greater the average amount of information of each data category in this clustering result, the greater the average amount of information that can be stored by the code corresponding to each data category, and the information expression of the code corresponding to each data category. The higher the efficiency, the greater the amount of information that can be stored in a computer system with a certain storage space, and therefore the higher the efficiency of expressing information.
  • Embodiment 3 of the present invention clusters data according to the conditions of data clustering specifically as follows:
  • I max is the maximum entropy load, indicating the maximum average amount of information carried by the clustering results obtained by clustering according to the clustering conditions.
  • the third embodiment of the present invention only uses the difference ⁇ h between the hue h values and the difference ⁇ x between the x coordinate values to illustrate the clustering method of two-dimensional data.
  • a data clustering method and system of the present invention And the storage medium is suitable for the clustering of any two-dimensional data.
  • Embodiment 4 of the present invention uses the field of image segmentation as an example to illustrate a data clustering method of the present invention.
  • the application scene of image segmentation is shown in FIG. 13 .
  • the fourth embodiment is an image, for the data of 156 ordered data points in the image: hue h value, x coordinate value, y coordinate value, clustering is performed by the following method, specifically:
  • the condition for data clustering in Embodiment 4 of the present invention is to cluster the data according to the combination relationship of ⁇ h, ⁇ x and ⁇ y:
  • the dimension data concerned by data clustering is ⁇ h
  • the dimension data not concerned are ⁇ x and ⁇ y
  • the combination relationship is fixed ⁇ x and ⁇ y
  • the traversal data ⁇ h is clustered.
  • v 1 is the data of the first dimension: ⁇ h
  • a m1 is the mth in the sequence ⁇ a m1 ⁇ items
  • ⁇ h is the dimensional data concerned by the implementation of the four-class data clustering in the present invention, so the value of ⁇ h traverses each item in the order of the items in the sequence ⁇ a m1 ⁇ , and the value of ⁇ h is based on the value of ⁇ h when the next item is clustered. The clustering results obtained in the previous item are further clustered.
  • v 2 is the data of the second dimension: ⁇ x
  • a m2 is the mth in the sequence ⁇ a m2 ⁇ items
  • the value of ⁇ h is clustered by traversing each item in the order of the items in the sequence a m1 , and the clustering result obtained by clustering with ⁇ h set to 163 is placed at the 163rd level of the clustering process tree, and the clustering result obtained by clustering
  • the result is the root node of the clustering process tree;
  • the clustering result obtained by clustering with ⁇ h set to 0 is placed in the first level of the clustering process tree, and the result obtained by clustering is the leaf node of the clustering process tree, and the degree of the leaf node is is zero;
  • a set at the second level is used as the parent node, and all the elements that make up the set at the first level are taken as the child nodes of the set, so as to gradually form a clustering process tree, as shown in Figure 15, when ⁇ h is taken as When the value is to traverse each item in the order of the items in the sequence a m1 for clustering, it reflects that the clustering process tree performs
  • a m1 is the mth item in the sequence ⁇ a m1 ⁇
  • a is the base number of logarithmic function, a>1
  • entropy load Indicates the average amount of information carried by the first clustering result obtained by clustering the m-th item a m1 in the v 1 sequence ⁇ a m1 ⁇
  • n is the m -th item in the v 1 sequence ⁇ a m1 ⁇
  • the number of data sets contained in the first clustering result obtained by clustering the item a m1 ; k i is the number of elements in the i-th data set, N is the total number of data, p i is the number of elements in the i-th data set The ratio of the number of to the total number of data.
  • each clustering is several data sets, and each data set corresponds to a data category.
  • each data category has a corresponding fixed-length code.
  • Each code The average amount of information that can be stored is certain, and correspondingly, the information expression efficiency of each code is also certain. We expect that a fixed-length code can store the largest average amount of information, so that the information expression efficiency is the highest.
  • entropy load Indicates the size of the average amount of information carried by the clustering results of this clustering, The larger the value, the greater the average amount of information of each data category in this clustering result, the greater the average amount of information that can be stored by the code corresponding to each data category, and the information expression of the code corresponding to each data category. The higher the efficiency, the greater the amount of information that can be stored in a computer system with a certain storage space, and therefore the higher the efficiency of expressing information.
  • Embodiment 4 of the present invention clusters data according to the conditions of data clustering specifically as follows:
  • I max is the maximum entropy load, indicating the maximum average amount of information carried by the clustering results obtained by clustering according to the clustering conditions.
  • the fourth embodiment of the present invention only uses the difference ⁇ h between the hue h values, the difference ⁇ x between the x coordinate values, and the difference ⁇ y between the y coordinate values to illustrate the clustering method of three-dimensional data.
  • step (1), step (2) and step (3) to complete a clustering it can be seen from the corresponding drawings that at least one first clustering result can be obtained for each clustering, and each first clustering result contains at least A set, as shown in Figure 20, the clustering results corresponding to the maximum entropy load I 4 are four sets: "hard hat”, “water glass”, “gloves” and “image background”, assuming that the fourth embodiment of the present invention needs to be understood
  • the subdivision information of the "water cup” collection data, and the expected entropy load is the largest, then re-determine the data clustering conditions, and repeat steps (1), (2) and (3) according to the new data clustering conditions.
  • the "water cup” set data is further clustered to obtain a new maximum entropy load.
  • the clustering results corresponding to the new maximum entropy load include two sub-sets: "cup lid” and "cup body".
  • the information is the subdivision information of the "water glass” collection data. Take the “Water Cup” collection as the parent node, and its sub-collections “Cup Lid” and “Cup Body” as child nodes, so as to gradually form an information structure tree.
  • the entropy load corresponding to each branch of the information structure tree is the maximum entropy load under a certain clustering condition, then for a computer system with a certain storage space, it can store the largest amount of information, so its efficiency in expressing information is also high. Highest, specifically:
  • Embodiment 4 of the present invention is aimed at the data values of 6 ordered data points in the "water glass" set: hue h value, x coordinate value, y coordinate value, determine new data clustering conditions, and then repeat steps (1), Step (2) and step (3) are further clustered, specifically:
  • the similarity between these 6 ordered data points is only affected by factors of two dimensions: the difference ⁇ h between the hue h values, and the difference ⁇ y between the y coordinate values, so the data in the "water cup" set
  • the condition of clustering is to cluster the data according to the combination relationship of ⁇ h and ⁇ y:
  • the dimension data of the data clustering in the "water cup” set is ⁇ h, and the dimension data of no concern is ⁇ y, so the combination relationship is fixed ⁇ y, and the traversal data ⁇ h is clustered.
  • v 1 is the data of the first dimension: ⁇ h
  • a m1 is the mth in the sequence ⁇ a m1 ⁇ items
  • ⁇ h is the dimensional data concerned by the data clustering of the "water cup" set in the fourth embodiment of the present invention, so the value of ⁇ h traverses each item in the order of the items in the sequence ⁇ a m1 ⁇ , and ⁇ h takes the next item to cluster
  • the class time is further clustered based on the clustering result obtained from the previous item by ⁇ h.
  • a m1 is the mth item in the sequence ⁇ a m1 ⁇
  • a is the base number of logarithmic function, a>1
  • entropy load Indicates the average amount of information carried by the first clustering result obtained by clustering the m-th item a m1 in the v 1 sequence ⁇ a m1 ⁇
  • n is the m -th item in the v 1 sequence ⁇ a m1 ⁇
  • the number of data sets contained in the first clustering result obtained by clustering the item a m1 ; k i is the number of elements in the i-th data set, N is the total number of data, p i is the number of elements in the i-th data set The ratio of the number of to the total number of data.
  • each clustering is several data sets, and each data set corresponds to a data category.
  • each data category has a corresponding fixed-length code.
  • Each code The average amount of information that can be stored is certain, and correspondingly, the information expression efficiency of each code is also certain. We expect that a fixed-length code can store the largest average amount of information, so that the information expression efficiency is the highest.
  • entropy load Indicates the size of the average amount of information carried by the clustering results of this clustering, The larger the value, the greater the average amount of information of each data category in this clustering result, the greater the average amount of information that can be stored by the code corresponding to each data category, and the information expression of the code corresponding to each data category The higher the efficiency, then, for a computer system with a certain storage space, the greater the amount of information it can store, and therefore the higher the efficiency of expressing information.
  • Embodiment 4 of the present invention further clusters the set data of the "water cup” according to the new data clustering condition as follows:
  • the maximum entropy load I max represents the maximum average amount of information carried by the clustering results obtained by clustering according to the clustering conditions.
  • the information structure tree reflects the The size of the granularity coarsely clusters the original image data into "hard hat” collection, "water glass” collection and “gloves” collection, and further fine-grained clustering and identification information of the "water glass” collection data. It can be seen that the entropy load corresponding to each fork of the information structure tree is the maximum entropy load under certain clustering conditions, then for a computer system with a certain storage space, it can store the largest amount of information, so its expression of information The efficiency is also the highest.
  • Embodiment 4 of the present invention if further clustering starts from the set data of the "water cup”, the "cup lid” and the “cup body” are obviously separated, as shown in Figure 24; Cover” and “cup body” are only partial data for the entire image data, and local data is incomplete and inaccurate clustering information for the entire image, so we expect to first obtain the overall clustering data of the entire image.
  • the clustering results are further clustered to obtain local subdivision information, as shown in Figure 20, so the present invention clusters from the overall data to obtain at least one first clustering result, and obtains data clustering according to each first clustering result clustering results, to achieve the integrity of data clustering; and based on at least one first clustering result, clustering at least one first clustering result again to obtain local subdivision information of at least one first clustering result, to achieve The coordination and unification of the integrity and locality of data clustering realizes the coordination and unification of the integrity and locality of data clustering, so the clustering results obtained are more complete and accurate.
  • Embodiment 5 of the present invention provides a data clustering system.
  • the system includes: a memory, a processor, and a program stored in the memory and operable on the processor.
  • a data clustering system is realized.
  • Class method the data clustering method includes the following steps:
  • Embodiment 5 of the present invention The basis for determining the data clustering conditions is the similarity between data, and the similarity between data is often affected by factors in multiple dimensions. Therefore, the conditions for data clustering in Embodiment 5 of the present invention are based on the following combination of data in different dimensions Relationships cluster the data:
  • the combination relationship is determined according to the dimensions concerned by data clustering, including: fixing the dimension data that is not concerned, and combining and traversing the dimension data concerned.
  • v j is the data of the jth dimension, and the difference between the data v j is arranged as a sequence ⁇ a mj ⁇ in ascending order, a mj is the mth item in the sequence ⁇ a mj ⁇ , a mj represents the maximum difference between data v j , and a 1j represents the minimum difference between data v j ;
  • the value of v j is the sequence ⁇ a mj ⁇ Any at least one item in; when v j is the dimension data concerned by data clustering, then the value of v j traverses each item in the order of the items in the sequence ⁇ a mj ⁇ , and v j takes the latter item
  • further clustering is performed based on the first clustering result obtained by taking the previous item of v j .
  • each first clustering result includes at least one data set.
  • the entropy load corresponding to each first clustering result is calculated, and the entropy load represents the average amount of information carried by the corresponding first clustering result.
  • the calculation method of entropy load is:
  • a mj is the m-th item in the sequence ⁇ a mj ⁇
  • ⁇ a mj ⁇ is the sequence of the difference between the data v j of the j-th dimension arranged in ascending order
  • a is the pair
  • the base number of the number function, a>1 entropy load Indicates that v j takes the mth item a mj in the sequence ⁇ a mj ⁇ and clusters the average amount of information carried by the first clustering result
  • n is v j takes the mth item a in the sequence ⁇ a mj ⁇
  • k i is the number of elements in the i-th data set
  • N is the total number of data
  • p i is the number of elements in the i-th data set
  • each clustering is several data sets, and each data set corresponds to a data category.
  • each data category has a corresponding fixed-length code.
  • Each code The average amount of information that can be stored is certain, and correspondingly, the information expression efficiency of each code is also certain. We expect that a fixed-length code can store the largest average amount of information, so that the information expression efficiency is the highest.
  • entropy load Indicates the size of the average amount of information carried by the clustering results of this clustering, The larger the value, the greater the average amount of information of each data category in this clustering result, the greater the average amount of information that can be stored by the code corresponding to each data category, and the information expression of the code corresponding to each data category The higher the efficiency, then, for a computer system with a certain storage space, the greater the amount of information it can store, and therefore the higher the efficiency of expressing information.
  • I max is the maximum entropy load, indicating the maximum average amount of information carried by the clustering results obtained by clustering according to the clustering conditions.
  • I max is the maximum entropy load, indicating the maximum average amount of information carried by the clustering results obtained by clustering according to the clustering conditions.
  • the amount of information that can be stored is the largest.
  • the expression efficiency of information is also the highest, so the clustering result corresponding to the maximum entropy load I max is what we expect.
  • a data clustering method may also include the step of forming an information structure tree, specifically including:
  • the clustering result includes several sub-sets, the information corresponding to the several sub-sets is the subdivision information of the certain data set, and the certain data set is used as the parent node, and the several sub-sets are As a child node, an information structure tree is formed step by step.
  • the entropy load corresponding to each branch of the information structure tree is the maximum entropy load under a certain clustering condition, then for a computer system with a certain storage space, it can store the largest amount of information, so its efficiency in expressing information is also high. Highest.
  • a data clustering method according to Embodiment 5 of the present invention may also include the step of forming a clustering process tree, specifically including:
  • the first clustering result obtained by clustering a qj for v j is placed in the qth clustering process tree Hierarchy, 1 ⁇ q ⁇ m, the first clustering result obtained by clustering a mj for v j is the root node of the clustering process tree, and the first clustering result obtained by clustering a 1j for v j is clustering
  • the leaf node of the process tree, the degree of the leaf node is zero; the set at the qth level is used as the parent node, and all the elements of the set formed by clustering at the q-1 level are the child nodes of the set, so as to gradually form a clustering process tree .
  • the clustering process tree When the value of v j traverses each item in the order of the items in the sequence ⁇ a mj ⁇ for clustering, it reflects that the clustering process tree performs data from coarse to fine according to the granularity of the concerned dimension data v j The process of clustering and identifying continuously and continuously, the clustering process tree can intuitively reflect all the information of the gradual clustering of a single data point, and realize that all the clustering information of the data can be traced and traced back to the source.
  • Embodiment 6 of the present invention also provides a computer-readable storage medium, the storage medium stores at least one program, the program can be executed by at least one processor, and when the at least one program is executed by the at least one processor, a A data clustering method, the data clustering method comprises the following steps:
  • Embodiment 5 of the present invention The basis for determining the data clustering conditions is the similarity between data, and the similarity between data is often affected by factors in multiple dimensions. Therefore, the conditions for data clustering in Embodiment 5 of the present invention are based on the following combination of data in different dimensions Relationships cluster the data:
  • the combination relationship is determined according to the dimensions concerned by data clustering, including: fixing the dimension data that is not concerned, and combining and traversing the dimension data concerned.
  • v j is the data of the jth dimension, and the difference between the data v j is arranged as a sequence ⁇ a mj ⁇ in ascending order, a mj is the mth item in the sequence ⁇ a mj ⁇ , a mj represents the maximum difference between data v j , and a 1j represents the minimum difference between data v j ;
  • the value of v j is the sequence ⁇ a mj ⁇ Any at least one item in; when v j is the dimension data concerned by data clustering, then the value of v j traverses each item in the order of the items in the sequence ⁇ a mj ⁇ , and v j takes the latter item
  • further clustering is performed based on the first clustering result obtained by taking the previous item of v j .
  • each first clustering result includes at least one data set.
  • the entropy load corresponding to each first clustering result is calculated, and the entropy load represents the average amount of information carried by the corresponding first clustering result.
  • the calculation method of entropy load is:
  • a mj is the m-th item in the sequence ⁇ a mj ⁇
  • ⁇ a mj ⁇ is the sequence of the difference between the data v j of the j-th dimension arranged in ascending order
  • a is the pair
  • the base number of the number function, a>1 entropy load Indicates that v j takes the mth item a mj in the sequence ⁇ a mj ⁇ and clusters the average amount of information carried by the first clustering result
  • n is v j takes the mth item a in the sequence ⁇ a mj ⁇
  • k i is the number of elements in the i-th data set
  • N is the total number of data
  • p i is the number of elements in the i-th data set
  • each clustering is several data sets, and each data set corresponds to a data category.
  • each data category has a corresponding fixed-length code.
  • Each code The average amount of information that can be stored is certain, and correspondingly, the information expression efficiency of each code is also certain. We expect that a fixed-length code can store the largest average amount of information, so that the information expression efficiency is the highest.
  • entropy load Indicates the size of the average amount of information carried by the clustering results of this clustering, The larger the value, the greater the average amount of information of each data category in this clustering result, the greater the average amount of information that can be stored by the code corresponding to each data category, and the information expression of the code corresponding to each data category. The higher the efficiency, the greater the amount of information that can be stored in a computer system with a certain storage space, and therefore the higher the efficiency of expressing information.
  • I max is the maximum entropy load, indicating the maximum average amount of information carried by the clustering results obtained by clustering according to the clustering conditions.
  • I max is the maximum entropy load, indicating the maximum average amount of information carried by the clustering results obtained by clustering according to the clustering conditions.
  • the amount of information that can be stored is the largest.
  • the expression efficiency of information is also the highest, so the clustering result corresponding to the maximum entropy load I max is what we expect.
  • a data clustering method in Embodiment 6 of the present invention may also include the step of forming an information structure tree, specifically including:
  • the clustering result includes several sub-sets, the information corresponding to the several sub-sets is the subdivision information of the certain data set, and the certain data set is used as the parent node, and the several sub-sets are As a child node, an information structure tree is formed step by step.
  • the entropy load corresponding to each branch of the information structure tree is the maximum entropy load under a certain clustering condition, then for a computer system with a certain storage space, it can store the largest amount of information, so its efficiency in expressing information is also high. Highest.
  • a data clustering method may also include the step of forming a clustering process tree, specifically including:
  • the first clustering result obtained by clustering a qj for v j is placed in the qth clustering process tree Hierarchy, 1 ⁇ q ⁇ m, the first clustering result obtained by clustering a mj for v j is the root node of the clustering process tree, and the first clustering result obtained by clustering a 1j for v j is clustering
  • the leaf node of the process tree, the degree of the leaf node is zero; the set at the qth level is used as the parent node, and all the elements of the set formed by clustering at the q-1 level are the child nodes of the set, so as to gradually form a clustering process tree .
  • the clustering process tree When the value of v j traverses each item in the order of the items in the sequence ⁇ a mj ⁇ for clustering, it reflects that the clustering process tree performs data from coarse to fine according to the granularity of the concerned dimension data v j The process of clustering and identifying continuously and continuously, the clustering process tree can intuitively reflect all the information of the gradual clustering of a single data point, and realize that all the clustering information of the data can be traced and traced back to the source.
  • the present invention provides a data clustering method, system, and storage medium based on the data clustering conditions to perform clustering from the overall data to obtain at least one first clustering result, through which the first clustering result with the largest average amount of information is carried.
  • the clustering result obtains the data clustering result, which realizes the integrity of data clustering, so the clustering result obtained is more complete and accurate; and there is no dependence on and processing of any special data in the clustering process, and no restriction on any data type , so it is generally applicable to the clustering of any data, and the practicability is very high; the maximum average information load is used as the basis for determining the clustering results.
  • the greater the amount of information it can store the higher the Improve the efficiency of information expression;
  • a data clustering method, system, and storage medium provided by the present invention are based on at least one first clustering result, re-clustering at least one first clustering result to obtain local subdivision information of at least one first clustering result , realizing the coordination and unification of data clustering integrity and locality;
  • a data clustering method, system and storage medium provided by the present invention form an information structure tree, and the entropy load corresponding to each branch of the information structure tree is the maximum entropy load under certain clustering conditions.
  • the computer system can store the largest amount of information, so it can express the information with the highest efficiency;
  • a data clustering method, system, and storage medium provided by the present invention also form a clustering process tree during the clustering process, and the clustering process tree performs data from coarse to fine according to the granularity of the dimension data concerned. Continuous clustering and identification can intuitively reflect all the information of a single data point gradually clustered, and realize that all the clustering information of the data can be traced and traced back to the source.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种数据的聚类方法、系统及存储介质,包括如下步骤:确定数据聚类条件;根据数据聚类条件对数据进行聚类得到至少一个第一聚类结果,并计算每一个第一聚类结果的熵载;所述熵载表示其对应的第一聚类结果所承载的平均信息量的大小;取各熵载中的最大熵载,其对应的第一聚类结果为数据聚类结果。本发明从整体数据出发进行聚类,实现了数据聚类的整体性,得到的聚类结果更加完整、准确;且聚类过程中不存在对任何特殊数据的依赖与处理、不限制任何数据种类,因此普遍适用于任何数据的聚类;采用最大承载平均信息量作为确定聚类结果的依据,对于存储空间一定的计算机系统,其所能存储的信息量也越大,提高了信息的表达效率。

Description

一种数据的聚类方法、系统及存储介质 技术领域
本发明涉及人工智能技术领域,尤其涉及一种数据的聚类方法、系统及存储介质。
背景技术
近年来随着互联网的发展和普及,图像、视频、文本等数据的数量和表征数据的维度越来越多,为了利用这些海量数据,就需要对这些高维数据进行快速且有效的聚类,因此衍生了大量聚类算法。
聚类算法作为机器学习领域的重要研究课题之一,已经被广泛应用于数据挖掘、人脸识别、医学影像分析、图像分割等重要领域。图像聚类是将完全未知标签的目标数据分割并归入不同的簇,属于通过数据特征进行分组的探索性技术,通常可用于图像信息的整理,或者用于生成训练样本标注等,属于一种常见的图像处理手段。
现有的图像聚类方法,一般是基于图像提取出的图像特征,通过传统聚类算法进行图像聚类,例如,通过K均值聚类算法(K-Means clustering algorithm)或密度的含噪空间聚类方法(Density-Based Spatial Clustering of Applications with Noise,DBSCAN)等算法进行聚类。
以K-Means算法为例,传统的K-Means算法需要输入样本集、聚类的簇树K,最大迭代次数N,最后输出簇划分,大致过程为:从数据中选择K个对象作为初始聚类中心;计算每个聚类对象到聚类中心的距离来划分;再次计算每个聚类中心;计算标准测度函数,直至达到最大迭代次数,则停止,否则,继续操作。
但是,基于以上算法过程,K-Means算法存在以下主要缺点:
a.K值很难确定,因为事先无法确定给定的样本集应该被分成什么类别才为最优;
b.K-Means采用以上迭代方法,得到的结果只是局部最优的聚类结果,缺乏整体性;
c.对于离群点和孤立点敏感;
d.需样本集存在均值,这限定了数据种类;
e.聚类效果依赖于聚类中心的初始化,而初始聚类中心是随机选择的。
申请人对其他聚类算法也进行了充分的研究,发现除K-Means算法外,其他传统聚类算法也包含太多对特殊数据的依赖与处理,因此这些算法对数据的聚类并不具备普遍适用性和整体性,而数据聚类领域对克服缺乏普遍适用性和整体性缺点的聚类方法缺乏充分的探索。
发明内容
本发明的目的是提供一种数据的聚类方法、系统及存储介质,解决了本领域现有技术中的传统聚类算法缺乏整体性和普遍适用性的技术问题。
为实现上述发明目的,本发明实施例提供一种数据的聚类方法,其特征在于,所述方法包括:
确定数据聚类条件;
根据所述数据聚类条件对数据进行聚类得到至少一个第一聚类结果,所述至少一个第一聚类结果中的每一个第一聚类结果包含至少一个数据集合;计算所述每一个第一聚类结果对应的熵载,所述熵载表示其对应的第一聚类结果所承载的平均信息量的大小;
取所述每一个第一聚类结果对应的熵载中的最大熵载,所述最大熵载对应的第一聚类结果为数据聚类结果。
优选地,所述数据聚类条件的确定依据为数据之间的相似性。
优选地,根据所述数据聚类条件对数据进行聚类包括:根据不同维度数据的组合关系对数据进行聚类。
进一步地优选地,所述不同维度数据的组合关系根据数据聚类所关注的维度决定,包括:固定不关注的维度数据,组合遍历所关注的维度数据。
进一步地优选地,所述根据不同维度数据的组合关系对数据进行聚类具体为:
(v 1,v 2,v 3,L L,v j),
v j={a mj}=a 1j,a 2j,L L,a mj
其中,v j为第j个维度的数据,数据v j之间的差值按照从小到大的顺序排列为序列{a mj},a mj为序列{a mj}中的第m个项,a mj代表数据v j之间的最大差值,a 1j代表数据v j之间的最小差值;当v j为数据聚类不关注的维度数据,则v j的取值为序列{a mj}中的任意至少一项;当v j为数据聚类所关注的维度数据,则v j的取值为按序列{a mj}中项的先后顺序遍历每个项,并且v j取后一个项聚类时基于v j取前一个项所得的第一聚类结果进一步聚类。
优选地,所述熵载的计算方法为:
Figure PCTCN2021123007-appb-000001
Figure PCTCN2021123007-appb-000002
其中,a mj为序列{a mj}中的第m个项,{a mj}为第j个维度的数据v j之间的差值按照从小到大的顺序排列而成的序列,a为对数函数的底数,a>1,熵载
Figure PCTCN2021123007-appb-000003
表示v j取序列{a mj}中第m个项a mj进行聚类所得第一聚类结果所承载的平均信息量的大小;n为v j取序列{a mj}中第m个项a mj进行聚类所得第一聚类结果包含的数据集合数;k i为第i个数据集合中元素的个数,N为数据的总个数,p i为第i个数据集合中元素的个数与数据的总个数的比值。
进一步地优选地,a=2,由此计算得出的熵载代表比特,比特为二进制,代表平均信息量的度量单位。
优选地,所述方法包括形成信息结构树的步骤,包括:
重新确定数据聚类条件,按新的数据聚类条件执行所述聚类方法对所述数据聚类结果中的某个数据集合进一步聚类得到新的最大熵载,新的最大熵载所对应的聚类结果包括若干个分集合,所述若干个分集合所对应的信息为所述某个数据集合的细分信息,将所述某个数据集合作为父节点,将所述若干个分集合作为子节点,以此逐步形成信息结构树。
优选地,所述方法包括形成聚类过程树的步骤,包括:
当所述v j的取值为按序列{a mj}中项的先后顺序遍历每个项进行聚类时,v j取a qj进行聚类得到的第一聚类结果置于聚类过程树的第q层次,1≤q≤m,v j取a mj进行聚类得到的第一聚类结果为聚类过程树的根节点,v j取a 1j进行聚类得到的第一聚类结果为聚类过程树的叶节点,叶节点的度为零;第q层次的集合作为父节点,第q-1层次聚类形成该集合的所有元素为该集合的子节点,以此逐步形成聚类过程树。
为实现上述发明目的,本发明另一实施例提供一种数据的聚类系统,其特征在于,所述系统包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的程序,所述程序被所述处理器执行时实现上述一种数据的聚类方法的步骤。
为实现上述发明目的,本发明另一实施例提供一种计算机可读存储介质,其特征在于:所述的存储介质存储有至少一个程序,所述至少一个程序可被至少一个处理器执行,所述至少一个程序被所述至少一个处理器执行时实现上述一种数据的聚类方法的步骤。
本发明提供的一种数据的聚类方法、系统及存储介质具有如下有益效果:
(1)本发明提供的一种数据的聚类方法、系统及存储介质根据数据聚类条件从整体数据出发进行聚类得到至少一个第一聚类结果,通过其中承载平均信息量最大的第一聚类结果得到数据聚类结果,实现了数据聚类的整体性,因此得到的聚类结果更加完整、准确;且聚类 过程中不存在对任何特殊数据的依赖与处理、不限制任何数据种类,因此普遍适用于任何数据的聚类,实用性非常高;采用最大承载平均信息量作为确定聚类结果的依据,对于存储空间一定的计算机系统,其所能存储的信息量也越大,提高了信息的表达效率;
(2)本发明提供的一种数据的聚类方法、系统及存储介质基于至少一个第一聚类结果,对至少一个第一聚类结果再次聚类得到至少一个第一聚类结果的局部性细分信息,实现了数据聚类整体性与局部性的协调与统一;
(3)本发明提供的一种数据的聚类方法、系统及存储介质形成信息结构树,信息结构树的每个分叉对应的熵载都为一定聚类条件下的最大熵载,则对于存储空间一定的计算机系统,其所能存储的信息量最大,因此其对信息的表达效率也最高;
(4)本发明提供的一种数据的聚类方法、系统及存储介质在聚类过程中还形成聚类过程树,聚类过程树根据所关注维度数据的粒度的粗细对数据进行由粗到细地、持续地聚类与辨别,能直观地反映出单个数据点逐步聚类的所有信息,实现了数据的所有聚类信息都有迹可循,有源可溯。
附图说明
图1为本发明实施例一种数据的聚类方法的流程示意图;
图2为本发明实施例二一种数据的聚类方法、系统及存储介质中12个数据点的应用场景示意图;
图3为本发明实施例二一种数据的聚类方法、系统及存储介质中以数据值相差为1聚类的结果示意图;
图4为本发明实施例二一种数据的聚类方法、系统及存储介质中以数据值相差为2聚类的结果示意图;
图5为本发明实施例二一种数据的聚类方法、系统及存储介质中以数据值相差为3聚类的结果示意图;
图6为本发明实施例二一种数据的聚类方法、系统及存储介质中以数据值相差为4聚类的结果示意图;
图7为本发明实施例三一种数据的聚类方法、系统及存储介质中11个有序数据点的应用场景示意图;
图8为本发明实施例三一种数据的聚类方法、系统及存储介质的聚类过程树的结构示意图;
图9为本发明实施例三一种数据的聚类方法、系统及存储介质中以数据值相差为1聚类的结果示意图;
图10为本发明实施例三一种数据的聚类方法、系统及存储介质中以数据值相差为2聚类的结果示意图;
图11为本发明实施例三一种数据的聚类方法、系统及存储介质中以数据值相差为3聚类的结果示意图;
图12为本发明实施例三一种数据的聚类方法、系统及存储介质中以数据值相差为4聚类的结果示意图;
图13为本发明实施例四一种数据的聚类方法、系统及存储介质的应用场景示意图;
图14为本发明实施例四一种数据的聚类方法、系统及存储介质中156个有序数据点的应用场景示意图;
图15为本发明实施例四一种数据的聚类方法、系统及存储介质的聚类过程树的结构示意图;
图16为本发明实施例四一种数据的聚类方法、系统及存储介质中以数据值相差为0聚类的结果示意图;
图17为本发明实施例四一种数据的聚类方法、系统及存储介质中以数据值相差为1聚类的结果示意图;
图18为本发明实施例四一种数据的聚类方法、系统及存储介质中以数据值相差为2聚类的结果示意图;
图19为本发明实施例四一种数据的聚类方法、系统及存储介质中以数据值相差为3聚类的结果示意图;
图20为本发明实施例四一种数据的聚类方法、系统及存储介质中以数据值相差为4聚类的结果示意图;
图21为本发明实施例四一种数据的聚类方法、系统及存储介质中“水杯”集合的数据以数据值相差为0聚类的结果示意图;
图22为本发明实施例四一种数据的聚类方法、系统及存储介质中“水杯”集合的数据以数据值相差为2聚类的结果示意图;
图23为本发明实施例四一种数据的聚类方法、系统及存储介质中“水杯”集合的数据以数据值相差为4聚类的结果示意图;
图24为本发明实施例四一种数据的聚类方法、系统及存储介质的信息结构树的结构示意图。
具体实施方式
下面结合附图和具体实施例对本发明技术方案作进一步详细的描述,以下内容是结合具体的实施方式对本发明所作的进一步详细说明,不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干简单推演或替换,都应当视为属于本发明的保护范围。
实施例一
本发明实施例一提供一种数据的聚类方法,如图1所示,包括如下步骤:
(1)确定数据聚类条件,包括如下步骤:
确定影响数据之间相似性的因素;
从众多因素中确定数据聚类所关注的数据维度;
确定不同维度数据的组合关系;
根据各维度数据的组合关系确定数据的聚类条件。
数据聚类条件的确定依据为数据之间的相似性,而数据之间的相似性往往受多个维度的因素共同影响,因此本发明实施例一数据聚类的条件根据以下不同维度数据的组合关系对数据进行聚类,具体如下:
(v 1,v 2,v 3,L L,v j),
v j={a mj}=a 1j,a 2j,L L,a mj
组合关系根据数据聚类所关注的维度决定,包括:固定不关注的维度数据,组合遍历所关注的维度数据。
其中,v j为第j个维度的数据,数据v j之间的差值按照从小到大的顺序排列为序列{a mj},a mj为序列{a mj}中的第m个项,a mj代表数据v j之间的最大差值,a 1j代表数据v j之间的最小差值;当v j为数据聚类不关注的维度数据,则v j的取值为序列{a mj}中的任意至少一项;当v j为数据聚类所关注的维度数据,则v j的取值为按序列{a mj}中项的先后顺序遍历每个项,并且v j取后一个项聚类时基于v j取前一个项所得的第一聚类结果进一步聚类。
(2)根据数据聚类条件对数据进行聚类得到至少一个第一聚类结果,每一个第一聚类结果包含至少一个数据集合。计算每一个第一聚类结果对应的熵载,所述熵载表示其对应的第一聚类结果所承载的平均信息量的大小。熵载的计算方法为:
Figure PCTCN2021123007-appb-000004
Figure PCTCN2021123007-appb-000005
[根据细则91更正 26.10.2021] 
其中,a mj为序列{a mj}中的第m个项,{a mj}为第j个维度的数据v j=Δh之间的差值按照从小到大的顺序排列而成的序列,a为对数函数的底数,a>1,熵载
Figure PCTCN2021123007-appb-000006
表示v j取序列{a mj}中第m个项a mj进行聚类所得第一聚类结果所承载的平均信息量的大小;n为v j取序列{a mj}中第m个项a mj进行聚类所得第一聚类结果包含的数据集合数;k i为第i个数据集合中元素的个数,N为数据的总个数,p i为第i个数据集合中元素的个数与数据的总个数的比值。
本发明实施例一a的优选取值为a=2,由此计算得出的熵载代表比特,比特为二进制,代表平均信息量的度量单位。
每一次聚类得到的结果为若干个数据集合,每个数据集合都对应一个数据类别,计算机系统在存储聚类结果时,每个数据类别都有与其对应的一个固定长度的编码,每个编码所能存储的平均信息量是一定的,对应地,每个编码的信息表达效率也是一定的,我们期望固定长度的编码可以存储最多的平均信息量,从而信息表达效率最高。
熵载
Figure PCTCN2021123007-appb-000007
表示本次聚类所得聚类结果所承载的平均信息量的大小,
Figure PCTCN2021123007-appb-000008
越大表示本次聚类结果中每个数据类别的平均信息量越大,则每个数据类别所对应的编码所能存储的平均信息量越大,每个数据类别所对应的编码的信息表达效率也越高,那么,对于存储空间一定的计算机系统,其所能存储的信息量也越大,因此其对信息的表达效率也越高。
(3)取步骤(2)中计算所得各第一聚类结果的熵载中的最大熵载I max,根据最大熵载I max得到数据聚类结果,具体为:
Figure PCTCN2021123007-appb-000009
其中,I max为最大熵载,表示按照所述聚类条件进行聚类所得聚类结果所承载的最大平均信息量,则对于存储空间一定的计算机系统,其所能存储的信息量最大,其对信息的表达效率也最高,因此最大熵载I max所对应的聚类结果是我们期望得到的。
执行步骤(1)、步骤(2)和步骤(3)完成一次聚类后,本发明实施例一的一种数据的聚类方法还可以包括形成信息结构树的步骤,具体包括:
重新确定数据聚类条件,按新的数据聚类条件执行所述聚类方法对所述数据聚类结果中的某个数据集合进一步聚类得到新的最大熵载,新的最大熵载所对应的聚类结果包括若干个分集合,所述若干个分集合所对应的信息为所述某个数据集合的细分信息,将所述某个数据集合作为父节点,将所述若干个分集合作为子节点,以此逐步形成信息结构树。
信息结构树的每个分叉对应的熵载都为一定聚类条件下的最大熵载,则对于存储空间一定的计算机系统,其所能存储的信息量最大,因此其对信息的表达效率也最高。
本发明实施例一的一种数据的聚类方法还可以包括形成聚类过程树的步骤,具体包括:
当v j的取值按序列{a mj}中项的先后顺序遍历每个项进行聚类时,v j取a qj进行聚类得到的第一聚类结果置于聚类过程树的第q层次,1≤q≤m,v j取a mj进行聚类得到的第一聚类结果为聚类过程树的根节点,v j取a 1j进行聚类得到的第一聚类结果为聚类过程树的叶节点,叶节点的度为零;第q层次的集合作为父节点,第q-1层次聚类形成该集合的所有元素为该集合的子节点,以此逐步形成聚类过程树。当v j的取值按序列{a mj}中项的先后顺序遍历每个项进行聚类时,体现了聚类过程树根据所关注维度数据v j的粒度的粗细对数据进行由粗到细地、持续地聚类与辨别的过程,聚类过程树能直观地反映出单个数据点逐步聚类的所有信息,实现了数据的所有聚类信息都有迹可循,有源可溯。
实施例二
HSV是根据颜色的直观特性创建的一种颜色空间,也称六角锥体模型,这个模型中颜色的参数分别是:色调(h),饱和度(s),明度(v),取值范围分别为:H:0~180,S:0~255,V:0~255,图像由若干个数据点构成,每个数据点均有h值、s值、v值。
如图2所示,本发明实施例二提供一种数据的聚类方法,针对12个散乱无序的数据点的数据:色调h值,通过以下方法进行聚类,包括如下步骤:
(1)确定数据聚类的条件,具体为:
本实施例数据之间的相似性仅受一个维度的因素影响:色调h值之间的差值Δh,因此本实施例数据聚类的条件为,根据Δh对数据进行聚类:
v 1=Δh={a m1}=a 11,a 21,L L,a m1=1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,17,18,19,20,21,22,23,25;
其中,v 1为第1个维度的数据:Δh,数据v 1=Δh之间的差值按照从小到大的顺序排列为序列{a m1},a m1为序列{a m1}中的第m个项,a m1=25代表数据h之间的最大差值为25,a 11=1代表数据h之间的最小差值为1。Δh是本发明实施例二数据聚类所关注的维度数据,则Δh的取值为按序列{a m1}中项的先后顺序遍历每个项,并且Δh取后一个项聚类时基于Δh取前一个项所得的聚类结果进一步聚类。
因此,本发明实施例二数据聚类的条件为:根据各数据点的色调h值之间的差值Δh序列Δh={a m1}=1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,17,18,19,20,21,22,23,25中项的先后顺序遍历每个项,并且Δh取后一个项聚类时基于Δh取前一个项所得的聚类结果进一步聚类。
(2)根据数据聚类的条件对数据进行聚类,并计算聚类后的熵载
Figure PCTCN2021123007-appb-000010
Figure PCTCN2021123007-appb-000011
Figure PCTCN2021123007-appb-000012
其中,a m1为序列{a m1}中的第m个项,{a m1}为第1个维度的数据v 1=Vh之间的差值按照从小到大的顺序排列而成的序列;a为对数函数的底数,a>1;熵载
Figure PCTCN2021123007-appb-000013
表示v 1取序列{a m1}中以第m个项a m1进行聚类所得第一聚类结果所承载的平均信息量的大小;n为v 1取序列{a m1}中以第m个项a m1进行聚类所得第一聚类结果包含的数据集合数;k i为第i个数据集合中元素的个数,N为数据的总个数,p i为第i个数据集合中元素的个数与数据的总个数的比值。
本发明实施例二a的优选取值为a=2,由此计算得出的熵载代表比特,比特为二进制,代表平均信息量的度量单位。
每一次聚类得到的结果为若干个数据集合,每个数据集合都对应一个数据类别,计算机系统在存储聚类结果时,每个数据类别都有与其对应的一个固定长度的编码,每个编码所能存储的平均信息量是一定的,对应地,每个编码的信息表达效率也是一定的,我们期望固定长度的编码可以存储最多的平均信息量,从而信息表达效率最高。
熵载
Figure PCTCN2021123007-appb-000014
表示本次聚类所得聚类结果所承载的平均信息量的大小,
Figure PCTCN2021123007-appb-000015
越大表示本次聚类结果中每个数据类别的平均信息量越大,则每个数据类别所对应的编码所能存储的平均信息量越大,每个数据类别所对应的编码的信息表达效率也越高,那么,对于存储空间一定的计算机系统,其所能存储的信息量也越大,因此其对信息的表达效率也越高。
本发明实施例二根据数据聚类条件对数据聚类具体为:
S201.在Δh=1时聚类,代表色调h值相差为1的数据点聚在一个集合,聚类后得到第一聚类结果,该第一聚类结果中包含五个数据集合,如图3所示,n=5,N=12,计算此时的熵载I 1
Figure PCTCN2021123007-appb-000016
S202.基于Δh=1的聚类结果,在Δh=2时聚类,代表色调h值相差为2的数据点聚在一个集合,聚类后得到第一聚类结果,该第一聚类结果中包含三个数据集合,如图4所示,n=3,N=12,计算此时的熵载I 2
Figure PCTCN2021123007-appb-000017
S203.基于Δh=2的聚类结果,在Δh=3时聚类,代表色调h值相差为3的数据点聚在一个集合,聚类后得到第一聚类结果,该第一聚类结果中包含两个数据集合,如图5所示,n=2,N=12,计算此时的熵载I 3
Figure PCTCN2021123007-appb-000018
S204.基于Δh=3的聚类结果,在Δh=4时聚类,代表色调h值相差为4的数据点聚在一个集合,聚类后得到第一聚类结果,该第一聚类结果中包含一个数据集合,如图6所示,可见所有数据点全部聚为一个集合,计算此时的熵载I 4
Figure PCTCN2021123007-appb-000019
本实施例中以Δh=5~25聚类的结果与步骤S204中以Δh=4聚类的结果相同,熵载也相同,因此不再赘述。
(3)取步骤(2)中计算所得各第一聚类结果的熵载中的最大熵载I max,根据最大熵载I max得到数据聚类结果,具体为:
Figure PCTCN2021123007-appb-000020
其中,I max为最大熵载,表示按照所述聚类条件进行聚类所得聚类结果所承载的最大平均信息量。I 3表示以“Δh=3”方法聚类所得的熵载最大,则对于存储空间一定的计算机系统,使用I 3所对应的聚类方法其所能存储的信息量最大,其对信息的表达效率也最高,因此最大熵载I 3所对应的聚类结果是我们期望得到的。
本发明实施例二仅以数据点的色调h值之间的差值Δh示例性说明一维数据的聚类方法,实质上本发明一种数据的聚类方法、系统及存储介质适用于任何一维数据的聚类。
实施例三
如图7所示,本发明实施例三提供一种数据的聚类方法,针对直角坐标系中11个有序的数据点的数据:色调h值、x坐标值、y坐标值,通过以下方法进行聚类,包括如下步骤:
(1)确定数据聚类的条件,具体为:
本发明实施例三数据之间的相似性受两个维度的因素共同影响:色调h值之间的差值Δh、x坐标值之间的差值Δx,因此本发明实施例三数据聚类的条件为,根据Δh和Δx的组合关系对数据进行聚类:
(v 1,v 2),
v 1=Δh,
v 2=Δx;
本发明实施例三数据聚类所关注的维度数据为Δh,不关注的维度数据为Δx,因此组合关系为固定Δx,遍历数据Δh聚类,对于Δh:
v 1=Δh={a m1}=a 11,a 21,L L,a m1=0,1,2,3,4,5,6,7;
其中,v 1为第1个维度的数据:Δh,数据v 1=Δh之间的差值按照从小到大的顺序排列为序列{a m1},a m1为序列{a m1}中的第m个项,a m1=7代表数据h之间的最大差值为7,a 11=0代表数据h之间的最小差值为0。Δh为本发明实施例三数据聚类所关注的维度数据,则Δh的取值为按序列{a m1}中项的先后顺序遍历每个项,并且Δh取后一个项聚类时基于Δh取前一个项所得的聚类结果进一步聚类。
对于Δx:
v 2=Δx={a m2}=a 12,a 22,L L,a m2=1,2,3,4,5,6,7,8,9,10;
其中,v 2为第2个维度的数据:Δx,数据v 2=Δx之间的差值按照从小到大的顺序排列为序列{a m2},a m2为序列{a m2}中的第m个项,a m2=10代表数据Δx之间的最大差值为10,a 12=1代表数据Δx之间的最小差值为1。Δx为本发明实施例三数据聚类不关注的维度数据,则Δx的取值为序列{a m2}中的任意至少一项,本发明实施例三取序列{a m2}中的第一个项,故Δx=1。
因此,本发明实施例三数据聚类的条件为:固定Δx=1,根据各数据点的色调h值之间的差值Δh序列Δh={a m1}=0,1,2,3,4,5,6,7中项的先后顺序遍历每个项,并且Δh取后一个项聚类时基于Δh取前一个项所得的聚类结果进一步聚类。
Δh的取值为按序列{a m1}中项的先后顺序遍历每个项进行聚类,Δh取a 81=7进行聚类得到的聚类结果置于聚类过程树的第7层次,聚类得到的聚类结果为聚类过程树的根节点;Δh取a 11=0进行聚类得到的聚类结果置于聚类过程树的第1层次,聚类得到的结果为聚类过程树的叶节点,叶节点的度为零;第2层次的某个集合作为父节点,则第1层次组成该集合的所有元素作为该集合的子节点,以此逐步形成聚类过程树,如图8所示,当Δh的取值为按序列{a m1}中项的先后顺序遍历每个项进行聚类时,体现了聚类过程树根据所关注维度数据Δh的粒度的粗细对数据进行由粗到细地、持续地聚类与辨别的过程,聚类过程树能直观地反映出单个数据点逐步聚类的所有信息,实现了数据的所有聚类信息都有迹可循,有源可溯。
(2)根据数据聚类的条件对数据进行聚类,并计算聚类后的熵载
Figure PCTCN2021123007-appb-000021
Figure PCTCN2021123007-appb-000022
Figure PCTCN2021123007-appb-000023
其中,a m1为序列{a m1}中的第m个项,{a m1}为第1个维度的数据v 1=Δh之间的差值按照从小到大的顺序排列而成的序列,a为对数函数的底数,a>1;熵载
Figure PCTCN2021123007-appb-000024
表示v 1取序列{a m1}中以第m个项a m1进行聚类所得第一聚类结果所承载的平均信息量的大小;n为v 1取序列{a m1}中以第m个项a m1进行聚类所得第一聚类结果包含的数据集合数;k i为第i个数据集合中元素的个数,N为数据的总个数,p i为第i个数据集合中元素的个数与数据的总个数的比值。
本发明实施例三a的优选取值为a=2,由此计算得出的熵载代表比特,比特为二进制,代表平均信息量的度量单位。
每一次聚类得到的结果为若干个数据集合,每个数据集合都对应一个数据类别,计算机系统在存储聚类结果时,每个数据类别都有与其对应的一个固定长度的编码,每个编码所能存储的平均信息量是一定的,对应地,每个编码的信息表达效率也是一定的,我们期望固定长度的编码可以存储最多的平均信息量,从而信息表达效率最高。
熵载
Figure PCTCN2021123007-appb-000025
表示本次聚类所得聚类结果所承载的平均信息量的大小,
Figure PCTCN2021123007-appb-000026
越大表示本次聚类结果中每个数据类别的平均信息量越大,则每个数据类别所对应的编码所能存储的平均信息量越大,每个数据类别所对应的编码的信息表达效率也越高,那么,对于存储空间一定的计算机系统,其所能存储的信息量也越大,因此其对信息的表达效率也越高。
本发明实施例三根据数据聚类的条件对数据聚类具体为:
S301.固定Δx=1,在Δh=0时聚类,代表色调h值相同的数据点聚在一个集合,因为没有满足该聚类条件的数据点,数据点不发生任何聚类,因此此时的熵载I 0=0。
S302.固定Δx=1,在Δh=1时聚类,代表色调h值相差为1的数据点聚在一个集合,聚类后得到第一聚类结果,该第一聚类结果中包含八个数据集合,如图9所示,n=8,N=11,计算此时的熵载I 1
Figure PCTCN2021123007-appb-000027
S303.固定Δx=1,基于Δh=1的聚类结果,在Δh=2时聚类,代表色调h值相差为2的数据点聚在一个集合,聚类后得到第一聚类结果,该第一聚类结果中包含四个数据集合,如图10所示,n=4,N=11,计算此时的熵载I 2
Figure PCTCN2021123007-appb-000028
S304.固定Δx=1,基于Δh=2的聚类结果,在Δh=3时聚类,代表色调h值相差为3的数据点聚在一个集合,聚类后得到第一聚类结果,该第一聚类结果中包含两个数据集合,如图11所示,n=2,N=11,计算此时的熵载I 3
Figure PCTCN2021123007-appb-000029
S305.固定Δx=1,基于Δh=3的聚类结果,在Δh=4时聚类,代表色调h值相差为4的数据点聚在一个集合,聚类后得到第一聚类结果,该第一聚类结果中包含一个数据集合,如图12所示,n=1,N=11,计算此时的熵载I 4
Figure PCTCN2021123007-appb-000030
本发明实施例三中以Δh=5~7聚类的结果与步骤S305中以Δh=4聚类的结果相同,熵载也相同,因此不再赘述。
(3)取步骤(2)中计算所得各第一聚类结果的熵载中的最大熵载I max,根据最大熵载I max得到数据聚类结果,具体为:
Figure PCTCN2021123007-appb-000031
其中,I max为最大熵载,表示按照所述聚类条件进行聚类所得聚类结果所承载的最大平均信息量。I 2表示以“固定Δx=1,Δh=2”方法聚类所得的熵载最大,则对于存储空间一定的计算机系统,其所能存储的信息量最大,其对信息的表达效率也最高,因此最大熵载I 2所对应的聚类结果是我们期望得到的。
本发明实施例三仅以色调h值之间的差值Δh和x坐标值之间的差值Δx示例性说明二维数据的聚类方法,实质上本发明一种数据的聚类方法、系统及存储介质适用于任何二维数据的聚类。
实施例四
本发明实施例四以图像分割领域为例说明本发明一种数据的聚类方法,图像分割的应用场景如图13所示。
如图14所示,本实施例四为一张图像,针对图像中156个有序的数据点的数据:色调h值、x坐标值、y坐标值,通过以下方法进行聚类,具体为:
(1)确定数据聚类的条件,具体为:
本发明实施例四数据之间的相似性仅受三个维度的因素影响:色调h值之间的差值Δh、x坐标值之间的差值Δx、y坐标值之间的差值Δy,因此本发明实施例四数据聚类的条件为,根据Δh、Δx和Δy的组合关系对数据进行聚类:
(v 1,v 2,v 3),
v 1=Δh,
v 2=Δx,
v 3=Δy;
本发明实施例四数据聚类所关注的维度数据为Δh,不关注的维度数据为Δx和Δy,因此组合关系为固定Δx和Δy,遍历数据Δh聚类,对于Δh:
v 1=Δh={a m1}=a 11,a 21,L L,a m1=0,1,2,3,4,5,158,159,160,161,162,163;
其中,v 1为第1个维度的数据:Δh,数据v 1=Δh之间的差值按照从小到大的顺序排列为序列{a m1},a m1为序列{a m1}中的第m个项,a m1=163代表数据h之间的最大差值为163,a 11=0代表数据h之间的最小差值为0。Δh是本发明实施类四数据聚类所关注的维度数据,因此Δh的取值为按序列{a m1}中项的先后顺序遍历每个项,并且Δh取后一个项聚类时基于Δh取前一个项所得的聚类结果进一步聚类。
对于Δx:
v 2=Δx={a m2}=a 12,a 22,L L,a m2=1,2,3,4,5,6,7,8,9,10,11;
其中,v 2为第2个维度的数据:Δx,数据v 2=Δx之间的差值按照从小到大的顺序排列为序列{a m2},a m2为序列{a m2}中的第m个项,a m2=11代表数据Δx之间的最大差值为11,a 12=1代表数据Δx之间的最小差值为1。Δx为本发明实施例四数据聚类不关注的维度数据,则Δx的取值为序列{a m2}中的任意至少一项,本发明实施例四取序列{a m2}中的第一个项,故Δx=1。
对于Δy:
v 3=Δy={a m3}=a 13,a 23,L L,a m3=1,2,3,4,5,6,7,8,9,10,11,12;
其中,v 3为第3个维度的数据:Δy,数据v 3=Δy之间的差值按照从小到大的顺序排列为序列{a m3},a m3为序列{a m3}中的第m个项,a m3=12代表数据Δy之间的最大差值为12,a 13=1代表数据Δy之间的最小差值为1。Δy为本发明实施例四数据聚类不关注的维度数据,则Δy的取值为序列{a m3}中的任意至少一项,本发明实施例四取序列{a m3}中的第一个项,故Δy=1。
因此,本发明实施例四数据聚类的条件为:固定Δx=1,固定Δy=1,根据各数据点的色调h值之间的差值Δh序列Δh={a m1}=0,1,2,3,4,5,158,159,160,161,162,163中项的先后顺序遍历每个项,并且Δh取后一个项聚类时基于Δh取前一个项所得的聚类结果进一步聚类。
Δh的取值为按序列a m1中项的先后顺序遍历每个项进行聚类,Δh取163进行聚类得到的聚类结果置于聚类过程树的第163层次,聚类得到的聚类结果为聚类过程树的根节点;Δh取0进行聚类得到的聚类结果置于聚类过程树的第1层次,聚类得到的结果为聚类过程树的叶节点,叶节点的度为零;第2层次的某个集合作为父节点,则第1层次组成该集合的所有元素作为该集合的子节点,以此逐步形成聚类过程树,如图15所示,当Δh的取值为按序列a m1中项的先后顺序遍历每个项进行聚类时,体现了聚类过程树根据所关注维度数据Δh的粒度的粗细对数据进行由粗到细地、持续地聚类与辨别,能直观地反映出单个图像数据点逐步聚类形成可区分的物体,进一步聚类形成整副图像的所有信息,实现了数据的所有聚类信息都有迹可循,有源可溯。
(2)根据数据聚类的条件对数据进行聚类,并计算数据每次聚类后的熵载
Figure PCTCN2021123007-appb-000032
Figure PCTCN2021123007-appb-000033
Figure PCTCN2021123007-appb-000034
其中,a m1为序列{a m1}中的第m个项,{a m1}为第1个维度的数据v 1=Δh之间的差值按照从小到大的顺序排列而成的序列,a为对数函数的底数,a>1;熵载
Figure PCTCN2021123007-appb-000035
表示v 1取序列{a m1}中以第m个项a m1进行聚类所得第一聚类结果所承载的平均信息量的大小;n为v 1取序列{a m1}中以第m个项a m1进行聚类所得第一聚类结果包含的数据集合数;k i为第i个数据集合中元素的个数,N为数据的总个数,p i为第i个数据集合中元素的个数与数据的总个数的比值。
本发明实施例四中取a=2,由此计算得出的熵载代表比特,比特为二进制,代表平均信息量的度量单位,因此取a=2更合适。
每一次聚类得到的结果为若干个数据集合,每个数据集合都对应一个数据类别,计算机系统在存储聚类结果时,每个数据类别都有与其对应的一个固定长度的编码,每个编码所能存储的平均信息量是一定的,对应地,每个编码的信息表达效率也是一定的,我们期望固定长度的编码可以存储最多的平均信息量,从而信息表达效率最高。
熵载
Figure PCTCN2021123007-appb-000036
表示本次聚类所得聚类结果所承载的平均信息量的大小,
Figure PCTCN2021123007-appb-000037
越大表示本次聚类结果中每个数据类别的平均信息量越大,则每个数据类别所对应的编码所能存储的平均信息量越大,每个数据类别所对应的编码的信息表达效率也越高,那么,对于存储空间一定的计算机系统,其所能存储的信息量也越大,因此其对信息的表达效率也越高。
本发明实施例四根据数据聚类的条件对数据聚类具体为:
S401.固定Δx=1,固定Δy=1,在Δh=0时聚类,代表色调h值相同的数据点聚在一个集合,聚类后得到第一聚类结果,该第一聚类结果中包含十八个数据集合,如图16所示,n=18,N=156,计算此时的熵载I 0
Figure PCTCN2021123007-appb-000038
S402.固定Δx=1,固定Δy=1,基于Δh=0的聚类结果,在Δh=1时聚类,代表色调h值相差为1的数据点聚在一个集合,聚类后得到第一聚类结果,该第一聚类结果中包含十五个数据集合,如图17所示,n=15,N=156,计算此时的熵载I 1
Figure PCTCN2021123007-appb-000039
S403.固定Δx=1,固定Δy=1,基于Δh=1的聚类结果,在Δh=2时聚类,代表色调h值相差为2的数据点聚在一个集合,聚类后得到第一聚类结果,该第一聚类结果中包含九个数据集合,如图18所示,n=9,N=156,计算此时的熵载I 2
Figure PCTCN2021123007-appb-000040
S404.固定Δx=1,固定Δy=1,基于Δh=2的聚类结果,在Δh=3时聚类,代表色调h值相差为3的数据点聚在一个集合,聚类后得到第一聚类结果,该第一聚类结果中包含六个数据集合,如图19所示,n=6,N=156,计算此时的熵载I 3
Figure PCTCN2021123007-appb-000041
S405.固定Δx=1,固定Δy=1,基于Δh=3的聚类结果,在Δh=4时聚类,代表色调h值相差为4的数据点聚在一个集合,聚类后得到第一聚类结果,该第一聚类结果中包含四个数据集合,如图20所示,n=4,N=156,计算此时的熵载I 4
Figure PCTCN2021123007-appb-000042
S406.固定Δx=1,固定Δy=1,基于Δh=4的聚类结果,在Δh=158时聚类,代表色调h值相差为158的数据点聚在一个集合,聚类后得到第一聚类结果,该第一聚类结果中包含一个数据集合:聚类后图像的背景与图像上的其他集合形成一个集合,该集合对应于整副图像,如图14所示,n=1,N=156,计算此时的熵载I 158
Figure PCTCN2021123007-appb-000043
本实施例中以Δh=5聚类的结果与步骤S405中以Δh=4聚类的结果相同,以Δh=159~163聚类的结果与步骤S406中以Δh=158聚类的结果相同,因此不再赘述。
(3)取步骤(2)中计算所得各第一聚类结果的熵载中的最大熵载I max,根据最大熵载I max得到数据聚类结果,具体为:
Figure PCTCN2021123007-appb-000044
其中,I max为最大熵载,表示按照所述聚类条件进行聚类所得聚类结果所承载的最大平均信息量。I 4表示以“固定Δx=1,固定Δy=1,Δh=4”方法聚类所得的平均信息量最大,则对于存储空间一定的计算机系统,其所能存储的信息量最大,其对信息的表达效率也最高,因此最大熵载I 4所对应的数据聚类结果是我们期望得到的。
本发明实施例四仅以色调h值之间的差值Δh、x坐标值之间的差值Δx、y坐标值之间的差值Δy示例性说明三维数据的聚类方法,实质上本发明一种数据的聚类方法、系统及存储介质适用于任何三维及三维以上数据的聚类;另,由图20可见,Δh=4聚类后图像上已经聚集形成明显可区分的四个物体:“安全帽”、“水杯”、“手套”和“图像背景”四个集合,从而准确地实现了图像分割。
执行步骤(1)、步骤(2)和步骤(3)完成一次聚类,由相应的附图可见每次聚类可得至少一个第一聚类结果,且每一个第一聚类结果包含至少一个集合,如图20所示,最大熵载I 4对应的聚类结果为四个集合:“安全帽”、“水杯”、“手套”和“图像背景”,假定本发明实施例四需了解“水杯”集合数据的细分信息,并且期望得到的熵载最大,则重新确定数据聚类条件,按新的数据聚类条件重复执行步骤(1)、步骤(2)和步骤(3)对“水杯”集合数据进一步聚类得到新的最大熵载,新的最大熵载所对应的聚类结果包括两个分集合:“杯盖”和“杯体”,这两个分集合所对应的信息为“水杯”集合数据的细分信息。将“水杯”集合作为父节点,将其分集合“杯盖”和“杯体”作为子节点,以此逐步形成信息结构 树。信息结构树的每个分叉对应的熵载都为一定聚类条件下的最大熵载,则对于存储空间一定的计算机系统,其所能存储的信息量最大,因此其对信息的表达效率也最高,具体为:
本发明实施例四针对“水杯”集合中6个有序数据点的数据值:色调h值、x坐标值、y坐标值,确定新的数据聚类的条件,然后重复执行步骤(1)、步骤(2)和步骤(3)进一步聚类,具体为:
(1)确定新的数据聚类的条件,具体为:
这6个有序的数据点之间的相似性仅受两个维度的因素影响:色调h值之间的差值Δh、y坐标值之间的差值Δy,因此对“水杯”集合中数据聚类的条件为,根据Δh和Δy的组合关系对数据进行聚类:
(v 1,v 2),
v 1=Δh,
v 2=Δy;
本发明实施例四对“水杯”集合中数据聚类所关注的维度数据为Δh,不关注的维度数据为Δy,因此组合关系为固定Δy,遍历数据Δh聚类,对于Δh:
v 1=Δh={a m1}=a 11,a 21,L L,a m1=0,2,4;
其中,v 1为第1个维度的数据:Δh,数据v 1=Δh之间的差值按照从小到大的顺序排列为序列{a m1},a m1为序列{a m1}中的第m个项,a m1=4代表数据h之间的最大差值为4,a 11=0代表数据h之间的最小差值为0。Δh是本发明实施例四对“水杯”集合数据聚类所关注的维度数据,因此Δh的取值为按序列{a m1}中项的先后顺序遍历每个项,并且Δh取后一个项聚类时基于Δh取前一个项所得的聚类结果进一步聚类。
对于Δy:
v 2=Δy={a m2}=a 12,a 22,L L,a m2=1,2,3,4,5;
其中,v 2为第2个维度的数据:Δy,数据v 2=Δy之间的差值按照从小到大的顺序排列为序列{a m2},a m2为序列{a m2}中的第m个项,a m2=5代表数据Δy之间的最大差值为5,a 12=1代表数据Δy之间的最小差值为1。Δy为本发明实施例四对“水杯”集合数据聚类不关注的维度 数据,则Δy的取值为序列{a m2}中的任意至少一项,本发明实施例四对“水杯”集合数据聚类取序列{a m2}中的第一个项,故Δy=1。
因此,本发明实施例四对“水杯”集合数据聚类的条件为:固定Δy=1,根据各数据点的色调h值之间的差值Δh序列Δh={a m1}=0,2,4中项的先后顺序遍历每个项,并且Δh取后一个项聚类时基于Δh取前一个项所得的聚类结果进一步聚类。
(2)根据数据聚类条件对数据进行聚类,并计算数据每次聚类后的熵载
Figure PCTCN2021123007-appb-000045
Figure PCTCN2021123007-appb-000046
Figure PCTCN2021123007-appb-000047
其中,a m1为序列{a m1}中的第m个项,{a m1}为第1个维度的数据v 1=Δh之间的差值按照从小到大的顺序排列而成的序列,a为对数函数的底数,a>1;熵载
Figure PCTCN2021123007-appb-000048
表示v 1取序列{a m1}中以第m个项a m1进行聚类所得第一聚类结果所承载的平均信息量的大小;n为v 1取序列{a m1}中以第m个项a m1进行聚类所得第一聚类结果包含的数据集合数;k i为第i个数据集合中元素的个数,N为数据的总个数,p i为第i个数据集合中元素的个数与数据的总个数的比值。
本发明实施例四中取a=2,由此计算得出的熵载代表比特,比特为二进制,代表平均信息量的度量单位,因此取a=2更合适。
每一次聚类得到的结果为若干个数据集合,每个数据集合都对应一个数据类别,计算机系统在存储聚类结果时,每个数据类别都有与其对应的一个固定长度的编码,每个编码所能存储的平均信息量是一定的,对应地,每个编码的信息表达效率也是一定的,我们期望固定长度的编码可以存储最多的平均信息量,从而信息表达效率最高。
熵载
Figure PCTCN2021123007-appb-000049
表示本次聚类所得聚类结果所承载的平均信息量的大小,
Figure PCTCN2021123007-appb-000050
越大表示本次聚类结果中每个数据类别的平均信息量越大,则每个数据类别所对应的编码所能存储的平均信息量越大,每个数据类别所对应的编码的信息表达效率也越高,那么,对于存储空间一定的计算机系统,其所能存储的信息量也越大,因此其对信息的表达效率也越高。
本发明实施例四根据新的数据聚类条件对“水杯”集合数据进一步聚类具体为:
S407.固定Δy=1,在Δh=0时聚类,代表“水杯”集合中色调h值相同的数据点聚在一个集合,聚类后得到第一聚类结果,该第一聚类结果中包含五个数据集合,如图21所示,n=5,N=6,计算此时的熵载I 0
Figure PCTCN2021123007-appb-000051
S408.固定Δy=1,基于Δh=0的聚类结果,在Δh=2时聚类,代表“水杯”集合中色调h值相差为2的数据点聚在一个集合,聚类后得到第一聚类结果,该第一聚类结果中包含两个数据集合,如图22所示,n=2,N=6,计算此时的熵载I 2
Figure PCTCN2021123007-appb-000052
S409.固定Δy=1,基于Δh=2的聚类结果,在Δh=4时聚类,代表“水杯”集合中色调h值相差为4的数据点聚在一个集合,聚类后得到第一聚类结果,该第一聚类结果中包含一个数据集合,如图23所示,n=1,N=6,计算此时的熵载I 4
Figure PCTCN2021123007-appb-000053
(3)取步骤(2)中计算所得各第一聚类结果的熵载中的最大熵载I max,根据最大熵载I max得到数据聚类结果,I max表示每次聚类结束从聚类结果中所能获取的最大熵载,具体为:
Figure PCTCN2021123007-appb-000054
其中,最大熵载I max表示按照聚类条件进行聚类所得聚类结果所承载的最大平均信息量。I 2表示以“固定Δy=1,Δh=2”方法对“水杯”集合数据聚类所得的熵载最大,则对于存储空间一定的计算机系统,其所能存储的信息量最大,其对信息的表达效率也最高,因此最大熵载I 4所对应的聚类结果是我们期望得到的。
首先,由图22可见通过新的聚类条件进一步聚类得到“水杯”集合数据的细分信息:“杯盖”和“杯体”,并且对于存储空间一定的计算机系统,其所能存储的信息量最大,其对信息的表达效率也最高,因此最大熵载I 2所对应的聚类结果是我们期望得到的“水杯”集合数据的细分信息。
其次,将“水杯”集合作为父节点,将其分集合“杯盖”和“杯体”作为子节点,以此逐步形成信息结构树,如图24所示,信息结构树体现了根据Δh值粒度的大小将原图像数据粗粒度地聚类为“安全帽”集合、“水杯”集合和“手套”集合,并进一步将“水杯”集合数据细粒度地聚类与辨别的信息。可见,信息结构树的每个分叉对应的熵载都为一定聚类条件下的最大熵载,则对于存储空间一定的计算机系统,其所能存储的信息量最大,因此其对信息的表达效率也最高。
最后,本发明实施例四中如果从“水杯”集合数据出发进一步聚类得到“杯盖”与“杯体”显然是分开的,如图24所示;而相比于“水杯”,“杯盖”与“杯体”对于整张图像数据只是局部数据,局部数据对整张图像是不完整、不准确的聚类信息,所以我们期望首先得到整张图像的整体聚类数据,对于整体数据的聚类结果进一步聚类得到局部的细分信息,如图20所示,因此本发明从整体数据出发进行聚类得到至少一个第一聚类结果,根据每一个第一聚类结果得到数据聚类结果,实现了数据聚类的整体性;并且基于至少一个第一聚类结果,对至少一个第一聚类结果再次聚类得到至少一个第一聚类结果的局部性细分信息,实现了数据聚类整体性与局部性的协调与统一,实现了数据聚类整体性与局部性的协调与统一,所以得到的聚类结果更加完整、准确。
以上四个实施例仅以x坐标值、y坐标值、色调h值作为数据进行示范性地聚类,以此说明本发明的具体实施方法,对于其他种类的数据、各种维度数据的各种组合关系本发明不作穷举,因为本发明不存在对任何特殊数据的依赖与处理,普遍适用于任何数据的聚类。
实施例五
本发明实施例五提供一种数据的聚类系统,该系统包括:存储器、处理器及存储在存储器上并可在处理器上运行的程序,该程序被处理器执行时实现一种数据的聚类方法,该数据的聚类方法包括以下步骤:
(1)确定数据聚类条件,具体为:
数据聚类条件的确定依据为数据之间的相似性,而数据之间的相似性往往受多个维度的因素共同影响,因此本发明实施例五数据聚类的条件根据以下不同维度数据的组合关系对数据进行聚类:
(v 1,v 2,v 3,L L,v j),
v j={a mj}=a 1j,a 2j,L L,a mj
组合关系根据数据聚类所关注的维度决定,包括:固定不关注的维度数据,组合遍历所关注的维度数据。
其中,v j为第j个维度的数据,数据v j之间的差值按照从小到大的顺序排列为序列{a mj},a mj为序列{a mj}中的第m个项,a mj代表数据v j之间的最大差值,a 1j代表数据v j之间的最小差值;当v j为数据聚类不关注的维度数据,则v j的取值为序列{a mj}中的任意至少一项;当v j为数据聚类所关注的维度数据,则v j的取值为按序列{a mj}中项的先后顺序遍历每个项,并且v j取后一个项聚类时基于v j取前一个项所得的第一聚类结果进一步聚类。
(2)根据数据聚类条件对数据进行聚类得到至少一个第一聚类结果,每一个第一聚类结果包含至少一个数据集合。计算每一个第一聚类结果对应的熵载,所述熵载表示其对应的第一聚类结果所承载的平均信息量的大小。熵载的计算方法为:
Figure PCTCN2021123007-appb-000055
Figure PCTCN2021123007-appb-000056
其中,a mj为序列{a mj}中的第m个项,{a mj}为第j个维度的数据v j之间的差值按照从小到大的顺序排列而成的序列,a为对数函数的底数,a>1,熵载
Figure PCTCN2021123007-appb-000057
表示v j取序列{a mj}中第m个项a mj进行聚类所得第一聚类结果所承载的平均信息量的大小;n为v j取序列{a mj}中第m 个项a mj进行聚类所得第一聚类结果包含的数据集合数;k i为第i个数据集合中元素的个数,N为数据的总个数,p i为第i个数据集合中元素的个数与数据的总个数的比值。
本发明实施例五a的优选取值为a=2,由此计算得出的熵载代表比特,比特为二进制,代表平均信息量的度量单位。
每一次聚类得到的结果为若干个数据集合,每个数据集合都对应一个数据类别,计算机系统在存储聚类结果时,每个数据类别都有与其对应的一个固定长度的编码,每个编码所能存储的平均信息量是一定的,对应地,每个编码的信息表达效率也是一定的,我们期望固定长度的编码可以存储最多的平均信息量,从而信息表达效率最高。
熵载
Figure PCTCN2021123007-appb-000058
表示本次聚类所得聚类结果所承载的平均信息量的大小,
Figure PCTCN2021123007-appb-000059
越大表示本次聚类结果中每个数据类别的平均信息量越大,则每个数据类别所对应的编码所能存储的平均信息量越大,每个数据类别所对应的编码的信息表达效率也越高,那么,对于存储空间一定的计算机系统,其所能存储的信息量也越大,因此其对信息的表达效率也越高。
(3)取步骤(2)中计算所得各第一聚类结果的熵载中的最大熵载I max,根据最大熵载I max得到数据聚类结果,具体为:
Figure PCTCN2021123007-appb-000060
其中,I max为最大熵载,表示按照所述聚类条件进行聚类所得聚类结果所承载的最大平均信息量,则对于存储空间一定的计算机系统,其所能存储的信息量最大,其对信息的表达效率也最高,因此最大熵载I max所对应的聚类结果是我们期望得到的。
执行步骤(1)、步骤(2)和步骤(3)完成一次聚类后,本发明实施例五的一种数据的聚类方法还可以包括形成信息结构树的步骤,具体包括:
重新确定数据聚类条件,按新的数据聚类条件执行所述聚类方法对所述数据聚类结果中的某个数据集合进一步聚类得到新的最大熵载,新的最大熵载所对应的聚类结果包括若干个分集合,所述若干个分集合所对应的信息为所述某个数据集合的细分信息,将所述某个数据集合作为父节点,将所述若干个分集合作为子节点,以此逐步形成信息结构树。
信息结构树的每个分叉对应的熵载都为一定聚类条件下的最大熵载,则对于存储空间一定的计算机系统,其所能存储的信息量最大,因此其对信息的表达效率也最高。
本发明实施例五的一种数据的聚类方法还可以包括形成聚类过程树的步骤,具体包括:
当v j的取值按序列{a mj}中项的先后顺序遍历每个项进行聚类时,v j取a qj进行聚类得到的第一聚类结果置于聚类过程树的第q层次,1≤q≤m,v j取a mj进行聚类得到的第一聚类结果为聚类过程树的根节点,v j取a 1j进行聚类得到的第一聚类结果为聚类过程树的叶节点,叶节点的度为零;第q层次的集合作为父节点,第q-1层次聚类形成该集合的所有元素为该集合的子节点,以此逐步形成聚类过程树。当v j的取值按序列{a mj}中项的先后顺序遍历每个项进行聚类时,体现了聚类过程树根据所关注维度数据v j的粒度的粗细对数据进行由粗到细地、持续地聚类与辨别的过程,聚类过程树能直观地反映出单个数据点逐步聚类的所有信息,实现了数据的所有聚类信息都有迹可循,有源可溯。
实施例六
本发明实施例六还提供一种计算机可读存储介质,所述的存储介质存储有至少一个程序,该程序可被至少一个处理器执行,该至少一个程序被该至少一个处理器执行时实现一种数据的聚类方法,该数据的聚类方法包括以下步骤:
(1)确定数据聚类条件,具体为:
数据聚类条件的确定依据为数据之间的相似性,而数据之间的相似性往往受多个维度的因素共同影响,因此本发明实施例五数据聚类的条件根据以下不同维度数据的组合关系对数据进行聚类:
(v 1,v 2,v 3,L L,v j),
v j={a mj}=a 1j,a 2j,L L,a mj
组合关系根据数据聚类所关注的维度决定,包括:固定不关注的维度数据,组合遍历所关注的维度数据。
其中,v j为第j个维度的数据,数据v j之间的差值按照从小到大的顺序排列为序列{a mj},a mj为序列{a mj}中的第m个项,a mj代表数据v j之间的最大差值,a 1j代表数据v j之间的最小差值;当v j为数据聚类不关注的维度数据,则v j的取值为序列{a mj}中的任意至少一项;当v j为数据聚类所关注的维度数据,则v j的取值为按序列{a mj}中项的先后顺序遍历每个项,并且v j取后一个项聚类时基于v j取前一个项所得的第一聚类结果进一步聚类。
(2)根据数据聚类条件对数据进行聚类得到至少一个第一聚类结果,每一个第一聚类结果包含至少一个数据集合。计算每一个第一聚类结果对应的熵载,所述熵载表示其对应的第一聚类结果所承载的平均信息量的大小。熵载的计算方法为:
Figure PCTCN2021123007-appb-000061
Figure PCTCN2021123007-appb-000062
其中,a mj为序列{a mj}中的第m个项,{a mj}为第j个维度的数据v j之间的差值按照从小到大的顺序排列而成的序列,a为对数函数的底数,a>1,熵载
Figure PCTCN2021123007-appb-000063
表示v j取序列{a mj}中第m个项a mj进行聚类所得第一聚类结果所承载的平均信息量的大小;n为v j取序列{a mj}中第m个项a mj进行聚类所得第一聚类结果包含的数据集合数;k i为第i个数据集合中元素的个数,N为数据的总个数,p i为第i个数据集合中元素的个数与数据的总个数的比值。
本发明实施例六a的优选取值为a=2,由此计算得出的熵载代表比特,比特为二进制,代表平均信息量的度量单位。
每一次聚类得到的结果为若干个数据集合,每个数据集合都对应一个数据类别,计算机系统在存储聚类结果时,每个数据类别都有与其对应的一个固定长度的编码,每个编码所能存储的平均信息量是一定的,对应地,每个编码的信息表达效率也是一定的,我们期望固定长度的编码可以存储最多的平均信息量,从而信息表达效率最高。
熵载
Figure PCTCN2021123007-appb-000064
表示本次聚类所得聚类结果所承载的平均信息量的大小,
Figure PCTCN2021123007-appb-000065
越大表示本次聚类结果中每个数据类别的平均信息量越大,则每个数据类别所对应的编码所能存储的平均信息量越大,每个数据类别所对应的编码的信息表达效率也越高,那么,对于存储空间一定的计算机系统,其所能存储的信息量也越大,因此其对信息的表达效率也越高。
(3)取步骤(2)中计算所得各第一聚类结果的熵载中的最大熵载I max,根据最大熵载I max得到数据聚类结果,具体为:
Figure PCTCN2021123007-appb-000066
其中,I max为最大熵载,表示按照所述聚类条件进行聚类所得聚类结果所承载的最大平均信息量,则对于存储空间一定的计算机系统,其所能存储的信息量最大,其对信息的表达效率也最高,因此最大熵载I max所对应的聚类结果是我们期望得到的。
执行步骤(1)、步骤(2)和步骤(3)完成一次聚类后,本发明实施例六的一种数据的聚类方法还可以包括形成信息结构树的步骤,具体包括:
重新确定数据聚类条件,按新的数据聚类条件执行所述聚类方法对所述数据聚类结果中的某个数据集合进一步聚类得到新的最大熵载,新的最大熵载所对应的聚类结果包括若干个分集合,所述若干个分集合所对应的信息为所述某个数据集合的细分信息,将所述某个数据集合作为父节点,将所述若干个分集合作为子节点,以此逐步形成信息结构树。
信息结构树的每个分叉对应的熵载都为一定聚类条件下的最大熵载,则对于存储空间一定的计算机系统,其所能存储的信息量最大,因此其对信息的表达效率也最高。
本发明实施例六的一种数据的聚类方法还可以包括形成聚类过程树的步骤,具体包括:
当v j的取值按序列{a mj}中项的先后顺序遍历每个项进行聚类时,v j取a qj进行聚类得到的第一聚类结果置于聚类过程树的第q层次,1≤q≤m,v j取a mj进行聚类得到的第一聚类结果为聚类过程树的根节点,v j取a 1j进行聚类得到的第一聚类结果为聚类过程树的叶节点,叶节点的度为零;第q层次的集合作为父节点,第q-1层次聚类形成该集合的所有元素为该集合的子节点,以此逐步形成聚类过程树。当v j的取值按序列{a mj}中项的先后顺序遍历每个项进行聚类时,体现了聚类过程树根据所关注维度数据v j的粒度的粗细对数据进行由粗到细地、持续地聚类与辨别的过程,聚类过程树能直观地反映出单个数据点逐步聚类的所有信息,实现了数据的所有聚类信息都有迹可循,有源可溯。
综上,本发明提供的一种数据的聚类方法、系统及存储介质根据数据聚类条件从整体数据出发进行聚类得到至少一个第一聚类结果,通过其中承载平均信息量最大的第一聚类结果得到数据聚类结果,实现了数据聚类的整体性,因此得到的聚类结果更加完整、准确;且聚类过程中不存在对任何特殊数据的依赖与处理、不限制任何数据种类,因此普遍适用于任何数据的聚类,实用性非常高;采用最大承载平均信息量作为确定聚类结果的依据,对于存储空间一定的计算机系统,其所能存储的信息量也越大,提高了信息的表达效率;
本发明提供的一种数据的聚类方法、系统及存储介质基于至少一个第一聚类结果,对至少一个第一聚类结果再次聚类得到至少一个第一聚类结果的局部性细分信息,实现了数据聚类整体性与局部性的协调与统一;
本发明提供的一种数据的聚类方法、系统及存储介质形成信息结构树,信息结构树的每个分叉对应的熵载都为一定聚类条件下的最大熵载,则对于存储空间一定的计算机系统,其所能存储的信息量最大,因此其对信息的表达效率也最高;
本发明提供的一种数据的聚类方法、系统及存储介质在聚类过程中还形成聚类过程树,聚类过程树根据所关注维度数据的粒度的粗细对数据进行由粗到细地、持续地聚类与辨别,能直观地反映出单个数据点逐步聚类的所有信息,实现了数据的所有聚类信息都有迹可循,有源可溯。

Claims (11)

  1. 一种数据的聚类方法,其特征在于,所述方法包括:
    确定数据聚类条件;
    根据所述数据聚类条件对数据进行聚类得到至少一个第一聚类结果,所述至少一个第一聚类结果中的每一个第一聚类结果包含至少一个数据集合;计算所述每一个第一聚类结果对应的熵载,所述熵载表示其对应的第一聚类结果所承载的平均信息量的大小;
    取所述每一个第一聚类结果对应的熵载中的最大熵载,所述最大熵载对应的第一聚类结果为数据聚类结果。
  2. 根据权利要求1所述的一种数据的聚类方法,其特征在于,所述数据聚类条件的确定依据为数据之间的相似性。
  3. 根据权利要求1所述的一种数据的聚类方法,其特征在于,根据所述数据聚类条件对数据进行聚类包括:根据不同维度数据的组合关系对数据进行聚类。
  4. 根据权利要求3所述的一种数据的聚类方法,其特征在于,所述不同维度数据的组合关系根据数据聚类所关注的维度决定,包括:固定不关注的维度数据,组合遍历所关注的维度数据。
  5. 根据权利要求3所述的一种数据的聚类方法,其特征在于,所述根据不同维度数据的组合关系对数据进行聚类具体为:
    (v 1,v 2,v 3,L L,v j),
    v j={a mj}=a 1j,a 2j,L L,a mj
    其中,v j为第j个维度的数据,数据v j之间的差值按照从小到大的顺序排列为序列{a mj},a mj为序列{a mj}中的第m个项,a mj代表数据v j之间的最大差值,a 1j代表数据v j之间的最小差值;当v j为数据聚类不关注的维度数据,则v j的取值为序列{a mj}中的任意至少一项;当v j为数据聚类所关注的维度数据,则v j的取值为按序列{a mj}中项的先后顺序遍历每个项,并且v j取后一个项聚类时基于v j取前一个项所得的第一聚类结果进一步聚类。
  6. 根据权利要求1所述的一种数据的聚类方法,其特征在于,所述熵载的计算方法为:
    Figure PCTCN2021123007-appb-100001
    Figure PCTCN2021123007-appb-100002
    其中,a mj为序列{a mj}中的第m个项,{a mj}为第j个维度的数据v j之间的差值按照从小到大的顺序排列而成的序列,a为对数函数的底数,a>1,熵载
    Figure PCTCN2021123007-appb-100003
    表示v j取序列{a mj}中第m个项a mj进行聚类所得第一聚类结果所承载的平均信息量的大小;n为v j取序列{a mj}中第m个项a mj进行聚类所得第一聚类结果包含的数据集合数;k i为第i个数据集合中元素的个数,N为数据的总个数,p i为第i个数据集合中元素的个数与数据的总个数的比值。
  7. 根据权利要求6所述的一种数据的聚类方法,其特征在于,a=2,由此计算得出的熵载代表比特,比特为二进制,代表平均信息量的度量单位。
  8. 根据权利要求1所述的一种数据的聚类方法,其特征在于,所述方法包括形成信息结构树的步骤,包括:
    重新确定数据聚类条件,按新的数据聚类条件执行所述聚类方法对所述数据聚类结果中的某个数据集合进一步聚类得到新的最大熵载,新的最大熵载所对应的聚类结果包括若干个分集合,所述若干个分集合所对应的信息为所述某个数据集合的细分信息,将所述某个数据集合作为父节点,将所述若干个分集合作为子节点,以此逐步形成信息结构树。
  9. 根据权利要求5所述的一种数据的聚类方法,其特征在于,所述方法包括形成聚类过程树的步骤,包括:
    当所述v j的取值为按序列{a mj}中项的先后顺序遍历每个项进行聚类时,v j取a qj进行聚类得到的第一聚类结果置于聚类过程树的第q层次,1≤q≤m,v j取a mj进行聚类得到的第一聚类结果为聚类过程树的根节点,v j取a 1j进行聚类得到的第一聚类结果为聚类过程树的叶节点,叶节点的度为零;第q层次的集合作为父节点,第q-1层次聚类形成该集合的所有元素为该集合的子节点,以此逐步形成聚类过程树。
  10. 一种数据的聚类系统,其特征在于,所述系统包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的程序,所述程序被所述处理器执行时实现权利要求1-9任一项所述的数据的聚类方法的步骤。
  11. 一种计算机可读存储介质,其特征在于:所述的存储介质存储有至少一个程序,所述至少一个程序可被至少一个处理器执行,所述至少一个程序被所述至少一个处理器执行时实现权利要求1-9任一项所述的数据的聚类方法的步骤。
PCT/CN2021/123007 2021-09-30 2021-10-11 一种数据的聚类方法、系统及存储介质 WO2023050461A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111156414.X 2021-09-30
CN202111156414.XA CN113806610A (zh) 2021-09-30 2021-09-30 一种数据的聚类方法、系统及存储介质

Publications (1)

Publication Number Publication Date
WO2023050461A1 true WO2023050461A1 (zh) 2023-04-06

Family

ID=78939055

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/123007 WO2023050461A1 (zh) 2021-09-30 2021-10-11 一种数据的聚类方法、系统及存储介质

Country Status (2)

Country Link
CN (1) CN113806610A (zh)
WO (1) WO2023050461A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140108653A1 (en) * 2012-09-25 2014-04-17 Huawei Technologies Co., Ltd. Man-Machine Interaction Data Processing Method and Apparatus
CN107909478A (zh) * 2017-11-27 2018-04-13 苏州点对点信息科技有限公司 基于社会网络聚类和信息增益熵指标的fof基金投资组合系统及方法
CN109657695A (zh) * 2018-11-05 2019-04-19 中国电子科技集团公司电子科学研究院 一种基于确定性退火的模糊划分聚类方法及装置
CN111539443A (zh) * 2020-01-22 2020-08-14 北京小米松果电子有限公司 一种图像识别模型训练方法及装置、存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140108653A1 (en) * 2012-09-25 2014-04-17 Huawei Technologies Co., Ltd. Man-Machine Interaction Data Processing Method and Apparatus
CN107909478A (zh) * 2017-11-27 2018-04-13 苏州点对点信息科技有限公司 基于社会网络聚类和信息增益熵指标的fof基金投资组合系统及方法
CN109657695A (zh) * 2018-11-05 2019-04-19 中国电子科技集团公司电子科学研究院 一种基于确定性退火的模糊划分聚类方法及装置
CN111539443A (zh) * 2020-01-22 2020-08-14 北京小米松果电子有限公司 一种图像识别模型训练方法及装置、存储介质

Also Published As

Publication number Publication date
CN113806610A (zh) 2021-12-17

Similar Documents

Publication Publication Date Title
TWI821671B (zh) 一種文本區域的定位方法及裝置
CN104573130B (zh) 基于群体计算的实体解析方法及装置
CN107209860A (zh) 使用分块特征来优化多类图像分类
CN105589938A (zh) 基于fpga的图像检索系统及检索方法
CN109886334B (zh) 一种隐私保护的共享近邻密度峰聚类方法
CN101710334A (zh) 基于图像哈希的大规模图像库检索方法
CN104239553A (zh) 一种基于Map-Reduce框架的实体识别方法
Hossain et al. Scatter/gather clustering: Flexibly incorporating user feedback to steer clustering results
WO2006055894A2 (en) Data mining of very large spatial dataset
US11822595B2 (en) Incremental agglomerative clustering of digital images
WO2015001416A1 (en) Multi-dimensional data clustering
JP6173754B2 (ja) 画像検索システム、画像検索装置および画像検索方法
Zafari et al. Segmentation of partially overlapping convex objects using branch and bound algorithm
CN111368125B (zh) 一种面向图像检索的距离度量方法
Torres-Tramón et al. Topic detection in Twitter using topology data analysis
KR20150112832A (ko) 산출 프로그램, 산출 장치 및 산출 방법
WO2023050461A1 (zh) 一种数据的聚类方法、系统及存储介质
Dhoot et al. Efficient Dimensionality Reduction for Big Data Using Clustering Technique
CN108268533B (zh) 一种用于图像检索的图像特征匹配方法
Akhtar et al. Big data mining based on computational intelligence and fuzzy clustering
JP4125951B2 (ja) テキスト自動分類方法及び装置並びにプログラム及び記録媒体
Histograms Bi-level classification of color indexed image histograms for content based image retrieval
WO2023134000A1 (zh) 一种多维空间中的直线、平面和超平面的快速检测方法
Pakhira et al. Computing approximate value of the pbm index for counting number of clusters using genetic algorithm
Doulamis et al. 3D modelling of cultural heritage objects from photos posted over the Twitter

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21959025

Country of ref document: EP

Kind code of ref document: A1